当前位置: 首页>数据库>正文

HIVESQL 加密解密 hive 解锁

/数据仓库/hive-极致优化-解释计划1

1. 读懂解释计划

hivesql的运行同关系型数据库一样遵循一个逻辑。解释计划就是用来描述这个逻辑。读懂解释计划,

就能够知道,程序运行的前世今生。能够很直观的了解程序运行瓶颈点。 简而言之,优化SQL,先读懂解释计划!

1.1 解释计划的打开方式

hive 如RDBMS一样提供了快捷的打开解释计划的方式,如下,其中EXTENDED|DEPENDENCY|AUTHORIZATION 在实际工作中比较常用,我们着重介绍这个三个

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION] query

现在我们以一个例子来做个

1.1.1 数据准备

  1. 创建表
CREATE TABLE `odata.stock_asset_holding`(
    `trade_id` string,
    `secu_acc_id` string,
    `prd_no` string,
    `sys_code` string,
    `qty` bigint,
    `prsnt_qty` bigint,
    `asset_amend_qty` bigint,
    `mkt_val` double,
    `last_price` double,
    `undivi_last_price` double,
    `scale_factor` double)
    PARTITIONED BY (`busi_date` string)
    STORED AS ORC
  1. 这这里我们创建一个stock_asset_holding位于odata的数据库的表,用格式为"yyyy-MM-dd"形式的busi_date做为分区列,用ORC文件形式存储数据。 字段的含义没必要去理解。至于选用ORC,是在企业实际的生产环境ORC也已经是标准,eg:平安科技大数据平台,百度某项目的数据平台。 ORC对HIVE是一个强有力的补充,后续我们会专门进行剖析。一般建立表后我们会通过describe formmated 命令来查看我们创建运行的结果
hive> describe formatted odata.stock_asset_holding;
OK
# col_name            	data_type           	comment

trade_id            	string
secu_acc_id         	string
prd_no              	string
sys_code            	string
qty                 	bigint
prsnt_qty           	bigint
asset_amend_qty     	bigint
mkt_val             	double
last_price          	double
undivi_last_price   	double
scale_factor        	double

# Partition Information
# col_name            	data_type           	comment

busi_date           	string

# Detailed Table Information
Database:           	odata
Owner:              	hdfs
CreateTime:         	Thu Jul 12 19:06:20 CST 2018
LastAccessTime:     	UNKNOWN
Protect Mode:       	None
Retention:          	0
Location:           	hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
Table Type:         	MANAGED_TABLE
Table Parameters:
    numPartitions       	51
    spark.sql.partitionProvider	catalog
    spark.sql.sources.provider	orc
    spark.sql.sources.schema.numPartCols	1
    spark.sql.sources.schema.numParts	1
    spark.sql.sources.schema.part.0	{\"type\":\"struct\",\"fields\":[{\"name\":\"trade_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"secu_acc_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"prd_no\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"sys_code\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"prsnt_qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"asset_amend_qty\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"mkt_val\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"last_price\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"undivi_last_price\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"scale_factor\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},
    {\"name\":\"busi_date\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
    spark.sql.sources.schema.partCol.0	busi_date
    transient_lastDdlTime	1531393580

# Storage Information
SerDe Library:      	org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:        	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:       	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:         	No
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]
Storage Desc Params:
    path                	hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
    serialization.format	1
Time taken: 0.494 seconds, Fetched: 49 row(s)
  1. 上面我们在生产环境中一般关注几个点
  1. Table Type,表的类型,分为外部表和内部表,MANAGED_TABLE就是内部表,EXTENDED_TABLE是外部表
  2. Location,表示表的存储位置
  3. SerDe Library:序列化和反序列化的方式
  4. InputFormat/outputFormat,文件的输入和输出形式,这里是OrcInputFormat/OrcOutputFormat
  5. Compressed表是否进行压缩
  6. Num Buckets/Bucket Columns,分别表示分桶个数,和分桶的列。分桶在HIVE优化中也是一个重要课题,后面也会讲到这里我们先提一下
    根据上面的方式我们在创建另外一张表:
CREATE TABLE `odata.stock_cash_flow_detail`(
`trade_id` string,
`secu_acc_id` string,
`prd_no` string,
`timestamp` bigint,
`busi_flag_code` string,
`busi_flag_name` string,
`trd_qty` bigint,
`trd_cash_flow` double,
`orig_cash_flow` double,
`trd_capital_type` string,
`cash_flow_modi_label` bigint,
`trd_amt` double,
`trd_price` double,
`sys_code` string,
`trd_type` string,
`inner_busi_flag_code` string,
`amortize_label` bigint)
PARTITIONED BY (`busi_date` string)
STORED AS ORC
  1. 生成数据 生成的数据,请大家自行生成,我们这边给出我生成的结果
hive> ANALYZE TABLE odata.stock_asset_holding PARTITION(busi_date='2017-03-17') COMPUTE STATISTICS;
Query ID = hdfs_20180730200707_26903574-8c9b-4a40-af92-154079f5493e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1530865492016_2702, Tracking URL = http://bigdata-03:8088/proxy/application_1530865492016_2702/
Kill Command = /opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hadoop/bin/hadoop job  -kill job_1530865492016_2702
Hadoop job information for Stage-0: number of mappers: 3; number of reducers: 0
2018-07-30 20:07:56,961 Stage-0 map = 0%,  reduce = 0%
2018-07-30 20:08:04,274 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 11.3 sec
MapReduce Total cumulative CPU time: 11 seconds 300 msec
Ended Job = job_1530865492016_2702
*Partition odata.stock_asset_holding{busi_date=2017-03-17} stats: [numFiles=16, numRows=455199, totalSize=11644188, rawDataSize=0]*
MapReduce Jobs Launched:
Stage-Stage-0: Map: 3   Cumulative CPU: 11.3 sec   HDFS Read: 283021 HDFS Write: 317 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 300 msec
OK
Time taken: 17.007 seconds

1.1.2 explain 格式的解释计划

hive>
    >
    > explain
    > select * from odata.stock_asset_holding dt
    > where dt.busi_date='2017-03-17' and trade_id='66949';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dt
            Statistics: Num rows: 455199 Data size: 11644188 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (trade_id = '66949') (type: boolean)
              Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: '66949' (type: string), secu_acc_id (type: string), prd_no (type: string), sys_code (type: string), qty (type: bigint), prsnt_qty (type: bigint), asset_amend_qty (type: bigint), mkt_val (type: double), last_price (type: double), undivi_last_price (type: double), scale_factor (type: double), '2017-03-17' (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.092 seconds, Fetched: 33 row(s)
--- 命令行返回结束-----

上面的解释计划一共分为俩个部分:STAGE DEPENDENCIES,STAGE PLANS。 STAGE DEPENDENCIES,用于描述最终生成的作业数,我们看到一个共生成俩个任务,Stage-1,0。stage-0 依赖 stage-1。 STAGE PLANS:用户描述程序的执行逻辑。上面的STAGE PLANS,我们可以看到,每个阶段分为MapReduce俩阶段,为了方便大家理解,我选用一个只有Map,读懂了Map,Reduce阶段也就非常容易了。 所以上面的解释计划读起来就是这样的, stage-1 1. 这个作业执行模式是vectorized,vectorized模式将数据处理从单行处理,转化为批量行进行处理,默认是1024 2. Map Operator Tree,首先进行表扫描(Table scan),预计扫描的行455199,涉及的数据大小 11644188,表别名为dt 3. Filter Operator,行过滤操作,过滤的表达式是trade_id = '66949',预计处理的行是227599,涉及的数据大小 5822081 4. Select Operator,列过滤操作. 这个阶段,我建议采用这样的方式,先读最深的缩进。所以读起来是这样, 4.1 处理的表,输入格式(input format)是org.apache.hadoop.mapred.TextInputFormat,输出格式是org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 文件的序列化格式org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 4.2 输出的文件预计是227599行,数据大小是5822081 4.3 输出没有压缩 4.4 一共输出了12列,由于mapreduce中所有的列都是按占位符来表示,所以这里都是用_col[0-11] 4.5 输出每一列表述的数据格式(expressions) stage-0 1. Fetch Operator这表示客户端的取数操作,也就是集群中,这一步是可以略过的

读到这里,我相信大家都有几个疑问,odata.stock_asset_holding,我们创建的是ORC的格式,为什么解释计划给出来的TextInPutFormat, 文件预计事227599行是怎么来的等几个问题。回答这些问题前,大家要先有一个概念,解释计划是Hive根据统计信息所进行的简单描述,不是真正跑的解释计划。 简而言之,这个解释计划是假的,他能帮我们大致理解整个运行过程,却没法帮我们定位问题。

1.1.3 explain extended 格式的解释计划

下面的解释计划很长,但是主要的东西不多,一共分为三个部分

  1. ABSTRACT SYNTAX TREE(AST)抽象语法树,这个在实际工作中,用于快速理清一段代码的逻辑。特别是在给别人的代码做检查的。
  2. STAGE DEPENDENCIES,同1.1.2一样
  3. STAGE PLANS,这个会多一点东西
  1. Path,用于描述输入的分区信息。在想要快速了解程序输入的数据信息,可以用到这个。当然也可以用describe formatted来查看
hive> explain extended
    > select * from odata.stock_asset_holding dt
    > where dt.busi_date='2017-03-17' and trade_id='66949';
OK
ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            odata
            stock_asset_holding
         dt
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_ALLCOLREF
      TOK_WHERE
         and
            =
               .
                  TOK_TABLE_OR_COL
                     dt
                  busi_date
               '2017-03-17'
            =
               TOK_TABLE_OR_COL
                  trade_id
               '66949'


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dt
            Statistics: Num rows: 455199 Data size: 11644188 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (trade_id = '66949') (type: boolean)
              Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: '66949' (type: string), secu_acc_id (type: string), prd_no (type: string), sys_code (type: string), qty (type: bigint), prsnt_qty (type: bigint), asset_amend_qty (type: bigint), mkt_val (type: double), last_price (type: double), undivi_last_price (type: double), scale_factor (type: double), '2017-03-17' (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  directory: hdfs://bigdata-02:8020/tmp/hive/hdfs/fd76c958-9d3c-468e-bbb4-9578f1e56f60/hive_2018-07-31_14-25-36_536_2545346208890276381-1/-mr-10000/.hive-staging_hive_2018-07-31_14-25-36_536_2545346208890276381-1/-ext-10001
                  NumFilesPerFileSink: 1
                  Statistics: Num rows: 227599 Data size: 5822081 Basic stats: COMPLETE Column stats: NONE
                  Stats Publishing Key Prefix: hdfs://bigdata-02:8020/tmp/hive/hdfs/fd76c958-9d3c-468e-bbb4-9578f1e56f60/hive_2018-07-31_14-25-36_536_2545346208890276381-1/-mr-10000/.hive-staging_hive_2018-07-31_14-25-36_536_2545346208890276381-1/-ext-10001/
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      properties:
                        columns _col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11
                        columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double:string
                        escape.delim \
                        hive.serialization.extend.additional.nesting.levels true
                        serialization.format 1
                        serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  TotalFiles: 1
                  GatherStats: false
                  MultiFileSpray: false
      Path -> Alias:
        hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17 [dt]
      Path -> Partition:
        hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17
          Partition
            base file name: busi_date=2017-03-17
            input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
            output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
            partition values:
              busi_date 2017-03-17
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns trade_id,secu_acc_id,prd_no,sys_code,qty,prsnt_qty,asset_amend_qty,mkt_val,last_price,undivi_last_price,scale_factor
              columns.comments
              columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double
              file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              location hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding/busi_date=2017-03-17
              name odata.stock_asset_holding
              numFiles 16
              numRows 455199
              partition_columns busi_date
              partition_columns.types string
              path hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
              rawDataSize 0
              serialization.ddl struct stock_asset_holding { string trade_id, string secu_acc_id, string prd_no, string sys_code, i64 qty, i64 prsnt_qty, i64 asset_amend_qty, double mkt_val, double last_price, double undivi_last_price, double scale_factor}
              serialization.format 1
              serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
              totalSize 11644188
              transient_lastDdlTime 1532953515
            serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde

              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              properties:
                bucket_count -1
                columns trade_id,secu_acc_id,prd_no,sys_code,qty,prsnt_qty,asset_amend_qty,mkt_val,last_price,undivi_last_price,scale_factor
                columns.comments
                columns.types string:string:string:string:bigint:bigint:bigint:double:double:double:double
                file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                location hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
                name odata.stock_asset_holding
                partition_columns busi_date
                partition_columns.types string
                path hdfs://bigdata-02:8020/mnt/data/bigdata/warehouse/odata.db/stock_asset_holding
                serialization.ddl struct stock_asset_holding { string trade_id, string secu_acc_id, string prd_no, string sys_code, i64 qty, i64 prsnt_qty, i64 asset_amend_qty, double mkt_val, double last_price, double undivi_last_price, double scale_factor}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
                spark.sql.partitionProvider catalog
                spark.sql.sources.provider orc
                spark.sql.sources.schema.numPartCols 1
                spark.sql.sources.schema.numParts 1
                spark.sql.sources.schema.part.0 {"type":"struct","fields":[{"name":"trade_id","type":"string","nullable":true,"metadata":{}},{"name":"secu_acc_id","type":"string","nullable":true,"metadata":{}},{"name":"prd_no","type":"string","nullable":true,"metadata":{}},{"name":"sys_code","type":"string","nullable":true,"metadata":{}},{"name":"qty","type":"long","nullable":true,"metadata":{}},{"name":"prsnt_qty","type":"long","nullable":true,"metadata":{}},{"name":"asset_amend_qty","type":"long","nullable":true,"metadata":{}},{"name":"mkt_val","type":"double","nullable":true,"metadata":{}},{"name":"last_price","type":"double","nullable":true,"metadata":{}},{"name":"undivi_last_price","type":"double","nullable":true,"metadata":{}},{"name":"scale_factor","type":"double","nullable":true,"metadata":{}},{"name":"busi_date","type":"string","nullable":true,"metadata":{}}]}
                spark.sql.sources.schema.partCol.0 busi_date
                transient_lastDdlTime 1531393580
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: odata.stock_asset_holding
            name: odata.stock_asset_holding
      Truncated Path -> Alias:
        /odata.db/stock_asset_holding/busi_date=2017-03-17 [dt]
      Execution mode: vectorized

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.095 seconds, Fetched: 141 row(s)

1.1.4 explain dependency 格式的解释计划

用于描述整个sql需要依赖的输入数据, 分为俩部分input_partitions,input_tables,顾名思义就是输入的表和分区。 实际运用场景:

  1. 排错,排查某个程序可能在运行过程略过了某个分区,基本使用方法见例1
  2. 理清程序依赖的表的输入,理解程序的运行,特别是理解在俩表join的情况下的依赖输入,见例[2-4]
--例1
explain DEPENDENCY
select * from odata.stock_asset_holding dt
where dt.busi_date>'2017-03-17' and dt.busi_date<'2017-03-31'; 

{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"}],
 "input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"}]}
--例2-表inner join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
inner join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date  
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23'; 

{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-16"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"}],
 "input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
                  {"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}
--例子3-表full join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
full join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date  
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23'
and di.busi_date>='2017-03-17' and di.busi_date<'2017-03-23'

{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-15"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-16"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-23"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-24"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-27"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-28"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-16"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-23"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-24"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-27"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-28"},
 "input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
                  {"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}
--例子4-表left outer join
explain DEPENDENCY
select * from odata.stock_asset_holding dt
left outer join odata.stock_cash_flow_detail di
on dt.trade_id=di.trade_id and dt.busi_date=di.busi_date  
and dt.busi_date>='2017-03-17' and dt.busi_date<'2017-03-23'
and di.busi_date>='2017-03-17' and di.busi_date<'2017-03-23'

{"input_partitions": [{"partitionName": "odata@stock_asset_holding@busi_date=2017-03-15"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-16"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-22"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-23"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-24"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-27"},
                      {"partitionName": "odata@stock_asset_holding@busi_date=2017-03-28"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-17"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-20"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-21"},
                      {"partitionName": "odata@stock_cash_flow_detail@busi_date=2017-03-22"}],
 "input_tables": [{"tablename": "odata@stock_asset_holding", "tabletype": "MANAGED_TABLE"},
                  {"tablename": "odata@stock_cash_flow_detail", "tabletype": "MANAGED_TABLE"}]}

从例2到例4,我们看到即使限制条件一样,但是full join和outer join都没有按照预先的约束输入对应的分区。所以在hive的使用中,即使是业务上的语意一样(最终的运行结果一样) 也要小心在full/left/right的on子句跟限制条件,因为会造成表的全表扫描,极大降低程序的运行速度。这里有俩条建议:

  1. 如果一定要要在A full join B,对A,B表进行查询条件限制,请转换成子查询的形式。
  2. 如果一定要要在A left/right outer join B,对A,B表进行查询条件限制,请转换成子查询的形式,也可以在where 子句中进行限制,俩者的explain DEPENDENCY表现形式是一致

1.1.4 explain AUTHORIZATION 格式的解释计划

这个语句用来表达CURRENT_USER的用户对哪些INPUTS有读操作,对哪些OUTPUTS有写操作。 下面的例子是hue的用户,读取odata@stock_asset_holding@busi_date=2017-03-[20-30]的分区,查询出来的结果会暂时存放到 hdfs://bigdata-02:8020/tmp/hive/hue/1b128e16-8c27-4a9b-a008-9c2357556ef6/hive_2018-07-31_17-10-04_422_1048252151662607045-1/-mr-10000 文件中

explain AUTHORIZATION
    select * from odata.stock_asset_holding dt
    where dt.busi_date>'2017-03-17' and dt.busi_date<'2017-03-31'; 

	INPUTS: 
	  odata@stock_asset_holding
	  odata@stock_asset_holding@busi_date=2017-03-20
	  odata@stock_asset_holding@busi_date=2017-03-21
	  odata@stock_asset_holding@busi_date=2017-03-22
	  odata@stock_asset_holding@busi_date=2017-03-23
	  odata@stock_asset_holding@busi_date=2017-03-24
	  odata@stock_asset_holding@busi_date=2017-03-27
	  odata@stock_asset_holding@busi_date=2017-03-28
	  odata@stock_asset_holding@busi_date=2017-03-29
	  odata@stock_asset_holding@busi_date=2017-03-30
	OUTPUTS: 
	  hdfs://bigdata-02:8020/tmp/hive/hue/1b128e16-8c27-4a9b-a008-9c2357556ef6/hive_2018-07-31_17-10-04_422_1048252151662607045-1/-mr-10000
	CURRENT_USER: 
	  hue
	OPERATION: 
	  SWITCHDATABASE

链接

LanguageManual Explain


https://www.xamrdz.com/database/6ab1925786.html

相关文章: