3

解析数仓OLAP函数:ROLLUP、CUBE、GROUPING SETS - 华为云开发者社区

 2 years ago
source link: https://www.cnblogs.com/huaweiyun/p/16198345.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

解析数仓OLAP函数:ROLLUP、CUBE、GROUPING SETS

摘要:GaussDB(DWS) ROLLUP,CUBE,GROUPING SETS等OLAP函数的原理解析。

本文分享自华为云社区《GaussDB(DWS) OLAP函数浅析》,作者: DWS_Jack_2。

在一些报表场景中,经常会对数据做分组统计(group by),例如对一级部门下辖的二级部门员工数进行统计:

create table emp(
id int,     --工号
name text,  --员工名
dep_1 text, --一级部门
dep_2 text  --二级部门
);
gaussdb=# select count(*), dep_2 from emp group by dep_2;
 count | dep_2
-------+-------
   200 | SRE
   100 | EI
(2 rows)

常见的统计报表业务中,通常需要进一步计算一级部门的“合计”人数,也就是二级部门各分组的累加,就可以借助于rollup,如下所示,比前面的分组计算结果多了一行合计的数据:

gaussdb=# select count(*), dep_2 from emp group by rollup(dep_2);
 count | dep_2
-------+-------
   200 | SRE
   100 | EI
   300 |
(3 rows)

如上是一种group by扩展的高级分组函数使用场景,这一类分组函数统称为OLAP函数,在GaussDB(DWS)中支持 ROLLUP,CUBE,GROUPING SETS,下面对这几种OLAP函数的原理和应用场景做一下分析。

首先我们来创建一张表,customer,用户信息表,其中包含了用户id,用户名,年龄,国家,用户级别,性别,余额等信息:

create table customer
(
 c_id char(16) not null,
 c_name char(20) ,
 c_age integer ,
 c_country varchar(20) ,
 c_class char(10),
 c_sex text,
 c_balance numeric
);
insert into customer values(1, 'tom', '20', 'China', '1', 'male', 300);
insert into customer values(2, 'jack', '30', 'USA', '1', 'male', 100);
insert into customer values(3, 'rose', '40', 'UK', '1', 'female', 200);
insert into customer values(4, 'Frank', '60', 'GER', '1', 'male', 100);
insert into customer values(5, 'Leon', '20', 'China', '2', 'male', 200);
insert into customer values(6, 'Lucy', '20', 'China', '1', 'female', 500);

ROLLUP

本文开头的示例已经解释了,ROLLUP是在分组计算基础上增加了合计,从字面意思理解,就是从最小聚合级开始,聚合单位逐渐扩大,例如如下语句:

select c_country, c_class, sum(c_balance) from customer group by rollup(c_country, c_class) order by 1,2,3;
 c_country |  c_class   | sum  
-----------+------------+------
 China     | 1          |  800
 China     | 2          |  200
 China     |            | 1000
 GER       | 1          |  100
 GER       |            |  100
 UK        | 1          |  200
 UK        |            |  200
 USA       | 1          |  100
 USA       |            |  100
           |            | 1400
(10 rows)

该语句功能等价于如下:

select c_country, c_class, sum(c_balance) from customer group by c_country, c_class
union all
select c_country, null, sum(c_balance) from customer group by c_country
union all
select null, null, sum(c_balance) from customer order by 1,2,3;
 c_country |  c_class   | sum  
-----------+------------+------
 China     | 1          |  800
 China     | 2          |  200
 China     |            | 1000
 GER       | 1          |  100
 GER       |            |  100
 UK        | 1          |  200
 UK        |            |  200
 USA       | 1          |  100
 USA       |            |  100
           |            | 1400
(10 rows)

尝试理解一下

GROUP BY ROLLUP(A,B):

首先对(A,B)进行GROUP BY,然后对(A)进行GROUP BY,最后对全表进行GROUP BY操作

CUBE从字面意思理解,就是各个维度的意思,也就是说全部组合,即聚合键中所有字段的组合的分组统计结果,例如如下语句:

select c_country, c_class, sum(c_balance) from customer group by cube(c_country, c_class) order by 1,2,3;
 c_country |  c_class   | sum  
-----------+------------+------
 China     | 1          |  800
 China     | 2          |  200
 China     |            | 1000
 GER       | 1          |  100
 GER       |            |  100
 UK        | 1          |  200
 UK        |            |  200
 USA       | 1          |  100
 USA       |            |  100
           | 1          | 1200
           | 2          |  200
           |            | 1400
(12 rows)

该语句功能等价于如下:

select c_country, c_class, sum(c_balance) from customer group by c_country, c_class
union all
select c_country, null, sum(c_balance) from customer group by c_country
union all
select null, null, sum(c_balance) from customer
union all
select NULL, c_class, sum(c_balance) from customer group by c_class order by 1,2,3;
 c_country |  c_class   | sum  
-----------+------------+------
 China     | 1          |  800
 China     | 2          |  200
 China     |            | 1000
 GER       | 1          |  100
 GER       |            |  100
 UK        | 1          |  200
 UK        |            |  200
 USA       | 1          |  100
 USA       |            |  100
           | 1          | 1200
           | 2          |  200
           |            | 1400
(12 rows)

GROUP BY CUBE(A,B):

首先对(A,B)进行GROUP BY,然后依次对(A)、(B)进行GROUP BY,最后对全表进行GROUP BY操作。

GROUPING SETS

GROUPING SETS区别于ROLLUP和CUBE,并没有总体的合计功能,相当于从ROLLUP和CUBE的结果中提取出部分记录,例如如下语句:

select c_country, c_class, sum(c_balance) from customer group by grouping sets(c_country, c_class) order by 1,2,3;
 c_country |  c_class   | sum  
-----------+------------+------
 China     |            | 1000
 GER       |            |  100
 UK        |            |  200
 USA       |            |  100
           | 1          | 1200
           | 2          |  200
(6 rows)

该语句功能等价于如下:

select c_country, null, sum(c_balance) from customer group by c_country
union all
select null, c_class, sum(c_balance) from customer group by c_class
order by 1,2,3;
 c_country |  ?column?  | sum  
-----------+------------+------
 China     |            | 1000
 GER       |            |  100
 UK        |            |  200
 USA       |            |  100
           | 1          | 1200
           | 2          |  200
(6 rows)

GROUP BY GROUPING SETS(A,B):

分别对(B)、(A)进行GROUP BY计算

目前在GaussDB(DWS)中,OLAP函数的实现,会有排序(sort)操作,相比等价的union all操作,效率并不会有提升,后续会通过mixagg的支持来提升OLAP函数的执行效率,有兴趣的同学,可以explain打印一下计划,来看一下OLAP函数的执行流程。

点击关注,第一时间了解华为云新鲜技术~


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK