Hive数据过滤之表过滤
source link: https://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650230317&%3Bidx=1&%3Bsn=2870a21f2495d3bc06a2c753951e6a02
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
点击关注上方“ 知了小巷 ”,
设为“置顶或星标”,第一时间送达干货。
Hive数据过滤之表过滤
hadoop-3.1.1 hive-3.1.1
Hive表:
hive> desc emp; OK empno int ename varchar(10) job varchar(9) mgr int hiredate date sal float comm float deptno int Time taken: 0.313 seconds, Fetched: 8 row(s)
Hive中的数据过滤
-
where子句过滤
-
having子句过滤
-
distinct子句过滤
-
表过滤
-
分区过滤
-
分桶过滤
-
索引过滤
-
列过滤
表过滤
HiveSQL执行计划中,在Map阶段或者FetchOperaotr中都会有TableScan,表扫描。表过滤就是要过滤掉同一个SQL语句中需要多次查询相同表的数据,将重复的访问操作过滤并压缩成只读一次,可以简单的理解为一次读取多次使用。
多个查询语句union all在一块
查询不同job的最大、最小销售额
-- 动态创建表分区 hive> set hive.exec.dynamic.partition=true; hive> set hive.exec.dynamic.partition.mode=nonstrict; -- 创建分区表 hive> create table emp_stat(job string, stat decimal(15,2)) partitioned by (tp string); OK ... -- 执行计划 hive> explain > insert into table emp_stat partition(tp) > select job, max(sal) stat, 'max' tp > from emp > group by job > union all > select job, min(sal) stat, 'min' tp > from emp > group by job; OK -- 总共有10个阶段 STAGE DEPENDENCIES: Stage-1 is a root stage -- 第2阶段依赖于第1和第9 Stage-2 depends on stages: Stage-1, Stage-9 Stage-8 depends on stages: Stage-2 , consists of Stage-5, Stage-4, Stage-6 Stage-5 Stage-0 depends on stages: Stage-5, Stage-4, Stage-7 Stage-3 depends on stages: Stage-0 Stage-4 Stage-6 Stage-7 depends on stages: Stage-6 Stage-9 is a root stage STAGE PLANS: -- 第1阶段,计算最大值 Stage: Stage-1 Map Reduce -- Map端 Map Operator Tree: -- 表扫描 TableScan alias: emp ... Execution mode: vectorized -- Reduce端 Reduce Operator Tree: Group By Operator -- 计算最大值 aggregations: max(VALUE._col0) keys: KEY._col0 (type: varchar(9)) ... -- 第2阶段:合并第1阶段的最大值和第9阶段的最小值 Stage: Stage-2 Map Reduce Map Operator Tree: TableScan Union ... TableScan Union ... Reduce Operator Tree: Group By Operator aggregations: compute_stats(VALUE._col0, 'hll'), compute_stats(VALUE._col2, 'hll') keys: KEY._col0 (type: string) ... ... -- 第9阶段,计算最小值 Stage: Stage-9 Map Reduce -- Map端 Map Operator Tree: -- 表扫描 TableScan alias: emp ... Execution mode: vectorized -- Reduce端 Reduce Operator Tree: Group By Operator -- 计算最小值 aggregations: min(VALUE._col0) keys: KEY._col0 (type: varchar(9)) ... ...
上述的执行计划可以看到,在计算最大值和最小值的第1和第9两个阶段,分别在Map端都进行了一次表扫描TableScan操作。表是一样的,where条件也是一样的,存在重复读取表数据的操作,如果一次读取的数据比较大时会占用更多的IO资源。为了避免这种情况的发生,Hive引入了 from ... select 的形式,查询的表在一次读取之后,可以被多个查询语句使用。
from...select
hive> explain > from emp > insert into table emp_stat partition(tp) > select job, max(sal) stat, 'max' tp > group by job > insert into table emp_stat partition(tp) > select job, min(sal) stat, 'min' tp > group by job; OK -- 总共7个阶段 STAGE DEPENDENCIES: Stage-2 is a root stage Stage-0 depends on stages: Stage-2 Stage-3 depends on stages: Stage-0 Stage-4 depends on stages: Stage-2 Stage-5 depends on stages: Stage-1, Stage-4, Stage-6 Stage-1 depends on stages: Stage-2 Stage-6 depends on stages: Stage-2 STAGE PLANS: Stage: Stage-2 Map Reduce -- Map阶段只进行一次表扫描 Map Operator Tree: TableScan alias: emp ... Execution mode: vectorized -- Reduce阶段同时计算最大值和最小值 Reduce Operator Tree: -- Forward Forward Statistics: ... Group By Operator -- 计算最大值 aggregations: max(VALUE._col0) keys: KEY._col0 (type: varchar(9)) ... Group By Operator -- 计算最小值 aggregations: min(VALUE._col0) keys: KEY._col0 (type: varchar(9)) ... ... ...
from...select只需要一次数据读取就可以完成最大值和最小值计算。
往期精选
Spark源码解析-Yarn部署流程(ApplicationMaster)
Spark源码解析-Yarn部署流程(SparkSubmit)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK