35

Hive数据过滤之表过滤

 3 years ago
source link: https://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650230317&%3Bidx=1&%3Bsn=2870a21f2495d3bc06a2c753951e6a02
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

点击关注上方“ 知了小巷 ”,

设为“置顶或星标”,第一时间送达干货。

Hive数据过滤之表过滤

hadoop-3.1.1 hive-3.1.1

Hive表:

hive> desc emp;
OK
empno                int                                      
ename                varchar(10)                              
job                  varchar(9)                               
mgr                  int                                      
hiredate             date                                     
sal                  float                                    
comm                 float                                    
deptno               int                                      
Time taken: 0.313 seconds, Fetched: 8 row(s)

Hive中的数据过滤

  1. where子句过滤

  2. having子句过滤

  3. distinct子句过滤

  4. 表过滤

  5. 分区过滤

  6. 分桶过滤

  7. 索引过滤

  8. 列过滤

表过滤

HiveSQL执行计划中,在Map阶段或者FetchOperaotr中都会有TableScan,表扫描。表过滤就是要过滤掉同一个SQL语句中需要多次查询相同表的数据,将重复的访问操作过滤并压缩成只读一次,可以简单的理解为一次读取多次使用。

多个查询语句union all在一块

查询不同job的最大、最小销售额

-- 动态创建表分区
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
-- 创建分区表
hive> create table emp_stat(job string, stat decimal(15,2)) partitioned by (tp string);
OK
...
-- 执行计划
hive> explain
    > insert into table emp_stat partition(tp)
    > select job, max(sal) stat, 'max' tp
    > from emp
    > group by job
    > union all
    > select job, min(sal) stat, 'min' tp
    > from emp
    > group by job;
OK
-- 总共有10个阶段
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  -- 第2阶段依赖于第1和第9
  Stage-2 depends on stages: Stage-1, Stage-9
  Stage-8 depends on stages: Stage-2 , consists of Stage-5, Stage-4, Stage-6
  Stage-5
  Stage-0 depends on stages: Stage-5, Stage-4, Stage-7
  Stage-3 depends on stages: Stage-0
  Stage-4
  Stage-6
  Stage-7 depends on stages: Stage-6
  Stage-9 is a root stage

STAGE PLANS:
  -- 第1阶段,计算最大值
  Stage: Stage-1
    Map Reduce
      -- Map端
      Map Operator Tree:
          -- 表扫描
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce端
      Reduce Operator Tree:
        Group By Operator
          -- 计算最大值
          aggregations: max(VALUE._col0)
          keys: KEY._col0 (type: varchar(9))
          ...
  -- 第2阶段:合并第1阶段的最大值和第9阶段的最小值
  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              ...
          TableScan
            Union
              ...
      Reduce Operator Tree:
        Group By Operator
          aggregations: compute_stats(VALUE._col0, 'hll'), compute_stats(VALUE._col2, 'hll')
          keys: KEY._col0 (type: string)
          ...

  ...

  -- 第9阶段,计算最小值
  Stage: Stage-9
    Map Reduce
      -- Map端
      Map Operator Tree:
          -- 表扫描
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce端
      Reduce Operator Tree:
        Group By Operator
          -- 计算最小值
          aggregations: min(VALUE._col0)
          keys: KEY._col0 (type: varchar(9))
          ...

...

上述的执行计划可以看到,在计算最大值和最小值的第1和第9两个阶段,分别在Map端都进行了一次表扫描TableScan操作。表是一样的,where条件也是一样的,存在重复读取表数据的操作,如果一次读取的数据比较大时会占用更多的IO资源。为了避免这种情况的发生,Hive引入了 from ... select 的形式,查询的表在一次读取之后,可以被多个查询语句使用。

from...select

hive> explain
    > from emp
    > insert into table emp_stat partition(tp)
    > select job, max(sal) stat, 'max' tp
    > group by job
    > insert into table emp_stat partition(tp)
    > select job, min(sal) stat, 'min' tp
    > group by job;
OK
-- 总共7个阶段
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0
  Stage-4 depends on stages: Stage-2
  Stage-5 depends on stages: Stage-1, Stage-4, Stage-6
  Stage-1 depends on stages: Stage-2
  Stage-6 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      -- Map阶段只进行一次表扫描
      Map Operator Tree:
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce阶段同时计算最大值和最小值
      Reduce Operator Tree:
        -- Forward
        Forward
          Statistics: ...
          Group By Operator
            -- 计算最大值
            aggregations: max(VALUE._col0)
            keys: KEY._col0 (type: varchar(9))
            ...
          Group By Operator
            -- 计算最小值
            aggregations: min(VALUE._col0)
            keys: KEY._col0 (type: varchar(9))
            ...

  ...

...

from...select只需要一次数据读取就可以完成最大值和最小值计算。

往期精选

Apache Kafka生产环境集群资源规划与配置

Hive数据过滤之distinct子句

Hive数据过滤之having子句

入门Apache Kafka需要了解的方方面面

Spark Core基础面试题总结(下)

Spark源码解析-Yarn部署流程(ApplicationMaster)

Spark源码解析-Yarn部署流程(SparkSubmit)

Spark Core基础面试题总结(上)

Spark技术栈-Scala

数据中台实战系列笔记

浅谈OLAP系统核心技术点(建议收藏)

HBase基础面试题总结

Hive基础面试题总结

MapReduce和YARN基础面试题总结

HDFS基础面试题总结

nIzeUzu.png!mobile


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK