/数据仓库/hive-极致优化-解释计划1
1. 读懂解释计划
hivesql的运行同关系型数据库一样遵循一个逻辑。解释计划就是用来描述这个逻辑。读懂解释计划,
就能够知道,程序运行的前世今生。能够很直观的了解程序运行瓶颈点。 简而言之,优化SQL,先读懂解释计划!
1.1 解释计划的打开方式
hive 如RDBMS一样提供了快捷的打开解释计划的方式,如下,其中EXTENDED|DEPENDENCY|AUTHORIZATION 在实际工作中比较常用,我们着重介绍这个三个
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION] query
现在我们以一个例子来做个
1.1.1 数据准备
- 创建表
CREATE TABLE `odata.stock_asset_holding`(
`trade_id` string,
`secu_acc_id` string,
`prd_no` string,
`sys_code` string,
`qty` bigint,
`prsnt_qty` bigint,
`asset_amend_qty` bigint,
`mkt_val` double,
`last_price` double,
`undivi_last_price` double,
`scale_factor` double)
PARTITIONED BY (`busi_date` string)
STORED AS ORC
- 这这里我们创建一个stock_asset_holding位于odata的数据库的表,用格式为"yyyy-MM-dd"形式的busi_date做为分区列,用ORC文件形式存储数据。 字段的含义没必要去理解。至于选用ORC,是在企业实际的生产环境ORC也已经是标准,eg:平安科技大数据平台,百度某项目的数据平台。 ORC对HIVE是一个强有力的补充,后续我们会专门进行剖析。一般建立表后我们会通过describe formmated 命令来查看我们创建运行的结果
hive> describe formatted odata.stock_asset_holding;
OK
# col_name data_type comment
trade_id string
secu_acc_id string
prd_no string
sys_code string
qty bigint
prsnt_qty bigint
asset_amend_qty bigint
mkt_val double
last_price double
undivi_last_price double
scale_factor double
# Partition Information
# col_name data_type comment
busi_date string
# Detailed Table Information
Database: odata
Owner: hdfs
CreateTime: Thu Jul 12 19:06:20 CST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
Table Type: MANAGED_TABLE
Table Parameters:
numPartitions 51
spark.sql.partitionProvider catalog
spark.sql.sources.provider orc
spark.sql.sources.schema.numPartCols 1
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"trade_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"secu_acc_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"prd_no\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"sys_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"prsnt_qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"asset_amend_qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"mkt_val\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"last_price\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"undivi_last_price\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"scale_factor\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"busi_date\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
spark.sql.sources.schema.partCol.0 busi_date
transient_lastDdlTime 1531393580
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
serialization.format 1
Time taken: 0.494 seconds, Fetched: 49 row(s)
- 上面我们在生产环境中一般关注几个点
- Table Type,表的类型,分为外部表和内部表,MANAGED_TABLE就是内部表,EXTENDED_TABLE是外部表
- Location,表示表的存储位置
- SerDe Library:序列化和反序列化的方式
- InputFormat/outputFormat,文件的输入和输出形式,这里是OrcInputFormat/OrcOutputFormat
- Compressed表是否进行压缩
- Num Buckets/Bucket Columns,分别表示分桶个数,和分桶的列。分桶在HIVE优化中也是一个重要课题,后面也会讲到这里我们先提一下
根据上面的方式我们在创建另外一张表:
CREATE TABLE `odata.stock_cash_flow_detail`(
`trade_id` string,
`secu_acc_id` string,
`prd_no` string,
`timestamp` bigint,
`busi_flag_code` string,
`busi_flag_name` string,
`trd_qty` bigint,
`trd_cash_flow` double,
`orig_cash_flow` double,
`trd_capital_type` string,
`cash_flow_modi_label` bigint,
`trd_amt` double,
`trd_price` double,
`sys_code` string,
`trd_type` string,
`inner_busi_flag_code` string,
`amortize_label` bigint)
PARTITIONED BY (`busi_date` string)
STORED AS ORC
- 生成数据 生成的数据,请大家自行生成,我们这边给出我生成的结果
hive> ANALYZE TABLE odata.stock_asset_holding PARTITION(busi_date='2017-03-17') COMPUTE STATISTICS;
Query ID = hdfs_20180730200707_26903574-8c9b-4a40-af92-154079f5493e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1530865492016_2702, Tracking URL = http://bigdata-03:8088/proxy/application_1530865492016_2702/
Kill Command = /opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hadoop/bin/hadoop job -kill job_1530865492016_2702
Hadoop job information for Stage-0: number of mappers: 3; number of reducers: 0
2018-07-30 20:07:56,961 Stage-0 map = 0%, reduce = 0%
2018-07-30 20:08:04,274 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 11.3 sec
MapReduce Total cumulative CPU time: 11 seconds 300 msec
Ended Job = job_1530865492016_2702
*Partition odata.stock_asset_holding{busi_date=2017-03-17} stats: [numFiles=16, numRows=455199, totalSize=11644188, rawDataSize=0]*
MapReduce Jobs Launched:
Stage-Stage-0: Map: 3 Cumulative CPU: 11.3 sec HDFS Read: 283021 HDFS Write: 317 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 300 msec
OK
Time taken: 17.007 seconds
1.1.2 explain 格式的解释计划
hive>
>
> explain
> select * from odata.stock_asset_holding dt
> where dt.busi_date='2017-03-17' and trade_id='66949';
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: dt
Statistics: Num rows: 455199 Data size: 11644188 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (trade_id = '66949') (type: boolean)
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: '66949' (type: string), secu_acc_id (type: string), prd_no (type: string), sys_code (type: string), qty (type: bigint), prsnt_qty (type: bigint), asset_amend_qty (type: bigint), mkt_val (type: double), last_price (type: double), undivi_last_price (type: double), scale_factor (type: double), '2017-03-17' (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.092 seconds, Fetched: 33 row(s)
--- 命令行返回结束-----
上面的解释计划一共分为俩个部分:STAGE DEPENDENCIES,STAGE PLANS。 STAGE DEPENDENCIES,用于描述最终生成的作业数,我们看到一个共生成俩个任务,Stage-1,0。stage-0 依赖 stage-1。 STAGE PLANS:用户描述程序的执行逻辑。上面的STAGE PLANS,我们可以看到,每个阶段分为MapReduce俩阶段,为了方便大家理解,我选用一个只有Map,读懂了Map,Reduce阶段也就非常容易了。 所以上面的解释计划读起来就是这样的, stage-1 1. 这个作业执行模式是vectorized,vectorized模式将数据处理从单行处理,转化为批量行进行处理,默认是1024 2. Map Operator Tree,首先进行表扫描(Table scan),预计扫描的行455199,涉及的数据大小 11644188,表别名为dt 3. Filter Operator,行过滤操作,过滤的表达式是trade_id = '66949',预计处理的行是227599,涉及的数据大小 5822081 4. Select Operator,列过滤操作. 这个阶段,我建议采用这样的方式,先读最深的缩进。所以读起来是这样, 4.1 处理的表,输入格式(input format)是org.apache.hadoop.mapred.TextInputFormat,输出格式是org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 文件的序列化格式org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 4.2 输出的文件预计是227599行,数据大小是5822081 4.3 输出没有压缩 4.4 一共输出了12列,由于mapreduce中所有的列都是按占位符来表示,所以这里都是用_col[0-11] 4.5 输出每一列表述的数据格式(expressions) stage-0 1. Fetch Operator这表示客户端的取数操作,也就是集群中,这一步是可以略过的
读到这里,我相信大家都有几个疑问,odata.stock_asset_holding,我们创建的是ORC的格式,为什么解释计划给出来的TextInPutFormat, 文件预计事227599行是怎么来的等几个问题。回答这些问题前,大家要先有一个概念,解释计划是Hive根据统计信息所进行的简单描述,不是真正跑的解释计划。 简而言之,这个解释计划是假的,他能帮我们大致理解整个运行过程,却没法帮我们定位问题。
1.1.3 explain extended 格式的解释计划
下面的解释计划很长,但是主要的东西不多,一共分为三个部分
- ABSTRACT SYNTAX TREE(AST)抽象语法树,这个在实际工作中,用于快速理清一段代码的逻辑。特别是在给别人的代码做检查的。
- STAGE DEPENDENCIES,同1.1.2一样
- STAGE PLANS,这个会多一点东西
- Path,用于描述输入的分区信息。在想要快速了解程序输入的数据信息,可以用到这个。当然也可以用describe formatted来查看
hive> explain extended
> select * from odata.stock_asset_holding dt
> where dt.busi_date='2017-03-17' and trade_id='66949';
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
odata
stock_asset_holding
dt
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
TOK_ALLCOLREF
TOK_WHERE
and
=
.
TOK_TABLE_OR_COL
dt
busi_date
'2017-03-17'
=
TOK_TABLE_OR_COL
trade_id
'66949'
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: dt
Statistics: Num rows: 455199 Data size: 11644188 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (trade_id = '66949') (type: boolean)
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: '66949' (type: string), secu_acc_id (type: string), prd_no (type: string), sys_code (type: string), qty (type: bigint), prsnt_qty (type: bigint), asset_amend_qty (type: bigint), mkt_val (type: double), last_price (type: double), undivi_last_price (type: double), scale_factor (type: double), '2017-03-17' (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://bigdata-02:8020/tmp/hive/hdfs/fd76c958-9d3c-468e-bbb4-9578f1e56f60/hive_2018-07-31_14-25-36_536_2545346208890276381-1/-mr-10000/.hive-staging_hive_2018-07-31_14-25-36_536_2545346208890276381-1/-ext-10001
NumFilesPerFileSink: 1
Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://bigdata-02:8020/tmp/hive/hdfs/fd76c958-9d3c-468e-bbb4-9578f1e56f60/hive_2018-07-31_14-25-36_536_2545346208890276381-1/-mr-10000/.hive-staging_hive_2018-07-31_14-25-36_536_2545346208890276381-1/-ext-10001/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns _col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11
columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double:string
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Path -> Alias:
hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17 [dt]
Path -> Partition:
hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17
Partition
base file name: busi_date=2017-03-17
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
partition values:
busi_date 2017-03-17
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns trade_id,secu_acc_id,prd_no,sys_code,qty,prsnt_qty,asset_amend_qty,mkt_val,last_price,undivi_last_price,scale_factor
columns.comments
columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17
name odata.stock_asset_holding
numFiles 16
numRows 455199
partition_columns busi_date
partition_columns.types string
path hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
rawDataSize 0
serialization.ddl struct stock_asset_holding { string trade_id, string secu_acc_id, string prd_no, string sys_code, i64 qty, i64 prsnt_qty, i64 asset_amend_qty, double mkt_val, double last_price, double undivi_last_price, double scale_factor}
serialization.format 1
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 11644188
transient_lastDdlTime 1532953515
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
bucket_count -1
columns trade_id,secu_acc_id,prd_no,sys_code,qty,prsnt_qty,asset_amend_qty,mkt_val,last_price,undivi_last_price,scale_factor
columns.comments
columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
name odata.stock_asset_holding
partition_columns busi_date
partition_columns.types string
path hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
serialization.ddl struct stock_asset_holding { string trade_id, string secu_acc_id, string prd_no, string sys_code, i64 qty, i64 prsnt_qty, i64 asset_amend_qty, double mkt_val, double last_price, double undivi_last_price, double scale_factor}
serialization.format 1
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
spark.sql.partitionProvider catalog
spark.sql.sources.provider orc
spark.sql.sources.schema.numPartCols 1
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {"type":"struct","fields":[{"name":"trade_id","type":"string","nullable":true,"metadata":{}},{"name":"secu_acc_id","type":"string","nullable":true,"metadata":{}},{"name":"prd_no","type":"string","nullable":true,"metadata":{}},{"name":"sys_code","type":"string","nullable":true,"metadata":{}},{"name":"qty","type":"long","nullable":true,"metadata":{}},{"name":"prsnt_qty","type":"long","nullable":true,"metadata":{}},{"name":"asset_amend_qty","type":"long","nullable":true,"metadata":{}},{"name":"mkt_val","type":"double","nullable":true,"metadata":{}},{"name":"last_price","type":"double","nullable":true,"metadata":{}},{"name":"undivi_last_price","type":"double","nullable":true,"metadata":{}},{"name":"scale_factor","type":"double","nullable":true,"metadata":{}},{"name":"busi_date","type":"string","nullable":true,"metadata":{}}]}
spark.sql.sources.schema.partCol.0 busi_date
transient_lastDdlTime 1531393580
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: odata.stock_asset_holding
name: odata.stock_asset_holding
Truncated Path -> Alias:
/odata.db/stock_asset_holding/busi_date=2017-03-17 [dt]
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.095 seconds, Fetched: 141 row(s)
1.1.4 explain dependency 格式的解释计划
用于描述整个sql需要依赖的输入数据, 分为俩部分input_partitions,input_tables,顾名思义就是输入的表和分区。 实际运用场景:
- 排错,排查某个程序可能在运行过程略过了某个分区,基本使用方法见例1
- 理清程序依赖的表的输入,理解程序的运行,特别是理解在俩表join的情况下的依赖输入,见例[2-4]
--例1
explain DEPENDENCY
select * from odata.stock_asset_holding dt
where dt.busi_date>'2017-03-17' and dt.busi_date<'2017-03-31';
{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"}],
"input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"}]}
--例2-表inner join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
inner join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23';
{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-16"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"}],
"input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
{"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}
--例子3-表full join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
full join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23'
and di.busi_date>='2017-03-17' and di.busi_date<'2017-03-23'
{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-15"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-16"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-23"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-24"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-27"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-28"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-16"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-23"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-24"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-27"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-28"},
"input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
{"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}
--例子4-表left outer join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
left outer join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23'
and di.busi_date>='2017-03-17' and di.busi_date<'2017-03-23'
{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-15"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-16"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-23"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-24"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-27"},
{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-28"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
{"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"}],
"input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
{"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}
从例2到例4,我们看到即使限制条件一样,但是full join和outer join都没有按照预先的约束输入对应的分区。所以在hive的使用中,即使是业务上的语意一样(最终的运行结果一样) 也要小心在full/left/right的on子句跟限制条件,因为会造成表的全表扫描,极大降低程序的运行速度。这里有俩条建议:
- 如果一定要要在A full join B,对A,B表进行查询条件限制,请转换成子查询的形式。
- 如果一定要要在A left/right outer join B,对A,B表进行查询条件限制,请转换成子查询的形式,也可以在where 子句中进行限制,俩者的explain DEPENDENCY表现形式是一致
1.1.4 explain AUTHORIZATION 格式的解释计划
这个语句用来表达CURRENT_USER的用户对哪些INPUTS有读操作,对哪些OUTPUTS有写操作。 下面的例子是hue的用户,读取odata@stock_asset_holding@busi_date=2017-03-[20-30]的分区,查询出来的结果会暂时存放到 hdfs://bigdata-02:8020/tmp/hive/hue/1b128e16-8c27-4a9b-a008-9c2357556ef6/hive_2018-07-31_17-10-04_422_1048252151662607045-1/-mr-10000 文件中
explain AUTHORIZATION
select * from odata.stock_asset_holding dt
where dt.busi_date>'2017-03-17' and dt.busi_date<'2017-03-31';
INPUTS:
odata@stock_asset_holding
odata@stock_asset_holding@busi_date=2017-03-20
odata@stock_asset_holding@busi_date=2017-03-21
odata@stock_asset_holding@busi_date=2017-03-22
odata@stock_asset_holding@busi_date=2017-03-23
odata@stock_asset_holding@busi_date=2017-03-24
odata@stock_asset_holding@busi_date=2017-03-27
odata@stock_asset_holding@busi_date=2017-03-28
odata@stock_asset_holding@busi_date=2017-03-29
odata@stock_asset_holding@busi_date=2017-03-30
OUTPUTS:
hdfs://bigdata-02:8020/tmp/hive/hue/1b128e16-8c27-4a9b-a008-9c2357556ef6/hive_2018-07-31_17-10-04_422_1048252151662607045-1/-mr-10000
CURRENT_USER:
hue
OPERATION:
SWITCHDATABASE
链接
LanguageManual Explain