HIVE调优MapJoin
目录
HIVE调优MapJoin
1.mapjoin (1.2以后自动默认启动mapjoin)
2.创建表格
3.查询建表
4.通过 explain 展示执行计划
5.Map JOIN 相关设置:
1.mapjoin (1.2以后自动默认启动mapjoin)
select /*+mapjoin(b)*/ a.xx,b.xxx from a left outer join b on a.id=b.id
2.创建表格
CREATE EXTERNAL TABLE IF NOT EXISTS learn4.student1(
id STRING COMMENT "学生ID",
name STRING COMMENT "学生姓名",
age int COMMENT "年龄",
gender STRING COMMENT "性别",
clazz STRING COMMENT "班级"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
load data local inpath "/usr/local/soft/hive-3.1.2/data/students.txt" INTO TABLE learn4.student1;
CREATE EXTERNAL TABLE IF NOT EXISTS learn4.score1(
id STRING COMMENT "学生ID",
subject_id STRING COMMENT "科目ID",
score int COMMENT "成绩"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
load data local inpath "/usr/local/soft/hive-3.1.2/data/score.txt" INTO TABLE learn4.score1;
3.查询建表
CREATE TABLE mapJonTest AS SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;
建表所需时间:
INFO : Total MapReduce CPU Time Spent: 2 seconds 510 msec
INFO : Completed executing command(queryId=root_20240511090524_3a34bdda-4247-4af4-b686-d681856af110); Time taken: 19.199 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
No rows affected (20.707 seconds)
4.通过 explain 展示执行计划
explain SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;
查看详细信息:
explain extended SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;
| STAGE DEPENDENCIES: | -- 执行stage的依赖
| Stage-4 is a root stage | Stage-4 表示根流程 --表示最先执行的流程
| Stage-3 depends on stages: Stage-4 | Stage-3 依赖 Stage-4
| Stage-0 depends on stages: Stage-3 | Stage-0 依赖 Stage-3 依赖 Stage-4
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| $hdt$_1:min |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| $hdt$_1:min |
| TableScan | TableScan 扫描的表
| alias: min |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), subject_id (type: string), score (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: max |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), name (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator | -- 不需要做任何操作 默认开启 Map JOIN 操作
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| outputColumnNames: _col1, _col3, _col4 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col1 (type: string), _col3 (type: string), _col4 (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Execution mode: vectorized |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
5.Map JOIN 相关设置:
1)设置自动选择Mapjoin
set hive.auto.convert.join = true; 默认为true
set hive.auto.convert.join = false; 默认为true
2)大表小表的阈值设置(默认25M以下认为是小表):
set hive.mapjoin.smalltable.filesize = 25000000;
set hive.mapjoin.smalltable.filesize = 10000000;