hive基础调优方法(一)

如题所述

第1个回答 2022-06-16

1.查看执行计划:Explain

查看执行计划:explain select kind, count(*) from table_name group by kind

常见名词示意:
STAGE DEPENDENCIES:阶段的依赖关系
FETCH Operator :抓取操作
limit: -1 未对数据做限制
TableScan:扫描的表
alias:查询的表名
Select Operator:查询操作
expressions:查询的列名
outputColumnNames:输出的别名

详细执行计划:explain extended select kind, count(*) from table_name group by kind

2.分区表:分区对应不同文件夹。
查询时用where语句可以指定分区目录dt='20211112'。
建表时用partitioned by(dt string)。
加载时需要指定分区信息 into table partition_table partition(dt='20211112')。
增加分区alter partition_table add partition(dt='20211122')。
删除分区alter partition_table drop partition(dt='20211122')。
可同时增加或删除多个，增加只需空格，删除中间需要逗号隔开。
查看分区 show partitions partition_table;
分区字段可以指定多个。

3.动态分区:
动态分区:set hive.exec.dynamic.partition=true;
非严格状态:set hive.exec.dynamic.partition.mode=nonstrict;
最大可创建动态分区:set hive.exec.max.dynamic.partitions=1000;
单个MR最大可创建动态分区：set hive.exec.max.dynamic.partitions.pernode=100
MR Job 中，最大可以创建多少个 HDFS 文件:set hive.exec.max.created.files=100000
空分区时否需要抛出异常:set hive.error.on.empty.partition=false

4.分桶表:将数据放到不同的文件
创建表clustered by(id)
用于抽样tablesample(bucket 1 out of 4 on id);

5.文件存储和压缩格式

行存储TEXTFILE、SEQUENCEFILE
列存储ORC、PARQUET

LZO和SNAPPY有优秀的压缩比和压缩速度

6.裁剪
列裁剪，只读取需要的列
分区裁剪，只读取需要的分区

7.group by数据倾斜
Map端进行聚合:set hive.map.aggr = true;
Map端进行聚合操作的条目数目 set hive.groupby.mapaggr.checkinterval = 100000;
有数据倾斜的时候进行负载均衡:set hive.groupby.skewindata = true;

8.矢量计算,可以在似scan, filter, aggregation进行批量处理
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

9.left semi join用来替代exist/in

10.CBO优化
成本优化器
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

11.谓词下推
set hive.optimize.ppd = true;

12.mapjoin:大表left join小表
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=25000000;

13.大表和大表join(极少用到)
Sort Merge Bucket Join
建表时
clustered by(id)
sorted by(id)
into 3 buckets
开启桶join
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

14.笛卡尔积
hive.mapred.mode=strict;开启后不会出现笛卡尔积

相似回答

大家正在搜