Hive数据过滤之表过滤

点击关注上方“ 知了小巷 ”，

设为“置顶或星标”，第一时间送达干货。

Hive数据过滤之表过滤

hadoop-3.1.1 hive-3.1.1

Hive表：

hive> desc emp;
OK
empno                int                                      
ename                varchar(10)                              
job                  varchar(9)                               
mgr                  int                                      
hiredate             date                                     
sal                  float                                    
comm                 float                                    
deptno               int                                      
Time taken: 0.313 seconds, Fetched: 8 row(s)

Hive中的数据过滤

where子句过滤
having子句过滤
distinct子句过滤
表过滤
分区过滤
分桶过滤
索引过滤
列过滤

表过滤

HiveSQL执行计划中，在Map阶段或者FetchOperaotr中都会有TableScan，表扫描。表过滤就是要过滤掉同一个SQL语句中需要多次查询相同表的数据，将重复的访问操作过滤并压缩成只读一次，可以简单的理解为一次读取多次使用。

多个查询语句union all在一块

查询不同job的最大、最小销售额

-- 动态创建表分区
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
-- 创建分区表
hive> create table emp_stat(job string, stat decimal(15,2)) partitioned by (tp string);
OK
...
-- 执行计划
hive> explain
    > insert into table emp_stat partition(tp)
    > select job, max(sal) stat, 'max' tp
    > from emp
    > group by job
    > union all
    > select job, min(sal) stat, 'min' tp
    > from emp
    > group by job;
OK
-- 总共有10个阶段
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  -- 第2阶段依赖于第1和第9
  Stage-2 depends on stages: Stage-1, Stage-9
  Stage-8 depends on stages: Stage-2 , consists of Stage-5, Stage-4, Stage-6
  Stage-5
  Stage-0 depends on stages: Stage-5, Stage-4, Stage-7
  Stage-3 depends on stages: Stage-0
  Stage-4
  Stage-6
  Stage-7 depends on stages: Stage-6
  Stage-9 is a root stage

STAGE PLANS:
  -- 第1阶段，计算最大值
  Stage: Stage-1
    Map Reduce
      -- Map端
      Map Operator Tree:
          -- 表扫描
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce端
      Reduce Operator Tree:
        Group By Operator
          -- 计算最大值
          aggregations: max(VALUE._col0)
          keys: KEY._col0 (type: varchar(9))
          ...
  -- 第2阶段：合并第1阶段的最大值和第9阶段的最小值
  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              ...
          TableScan
            Union
              ...
      Reduce Operator Tree:
        Group By Operator
          aggregations: compute_stats(VALUE._col0, 'hll'), compute_stats(VALUE._col2, 'hll')
          keys: KEY._col0 (type: string)
          ...

  ...

  -- 第9阶段，计算最小值
  Stage: Stage-9
    Map Reduce
      -- Map端
      Map Operator Tree:
          -- 表扫描
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce端
      Reduce Operator Tree:
        Group By Operator
          -- 计算最小值
          aggregations: min(VALUE._col0)
          keys: KEY._col0 (type: varchar(9))
          ...

...

上述的执行计划可以看到，在计算最大值和最小值的第1和第9两个阶段，分别在Map端都进行了一次表扫描TableScan操作。表是一样的，where条件也是一样的，存在重复读取表数据的操作，如果一次读取的数据比较大时会占用更多的IO资源。为了避免这种情况的发生，Hive引入了 from ... select 的形式，查询的表在一次读取之后，可以被多个查询语句使用。

from...select

hive> explain
    > from emp
    > insert into table emp_stat partition(tp)
    > select job, max(sal) stat, 'max' tp
    > group by job
    > insert into table emp_stat partition(tp)
    > select job, min(sal) stat, 'min' tp
    > group by job;
OK
-- 总共7个阶段
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0
  Stage-4 depends on stages: Stage-2
  Stage-5 depends on stages: Stage-1, Stage-4, Stage-6
  Stage-1 depends on stages: Stage-2
  Stage-6 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      -- Map阶段只进行一次表扫描
      Map Operator Tree:
          TableScan
            alias: emp
            ...
      Execution mode: vectorized
      -- Reduce阶段同时计算最大值和最小值
      Reduce Operator Tree:
        -- Forward
        Forward
          Statistics: ...
          Group By Operator
            -- 计算最大值
            aggregations: max(VALUE._col0)
            keys: KEY._col0 (type: varchar(9))
            ...
          Group By Operator
            -- 计算最小值
            aggregations: min(VALUE._col0)
            keys: KEY._col0 (type: varchar(9))
            ...

  ...

...

from...select只需要一次数据读取就可以完成最大值和最小值计算。

往期精选

Apache Kafka生产环境集群资源规划与配置

Hive数据过滤之distinct子句

Hive数据过滤之having子句

入门Apache Kafka需要了解的方方面面

Spark Core基础面试题总结（下）

Spark源码解析-Yarn部署流程（ApplicationMaster）

Spark源码解析-Yarn部署流程（SparkSubmit）

Spark Core基础面试题总结（上）

Spark技术栈-Scala

数据中台实战系列笔记