42

HiveQL查询连续三天有销售记录的店铺

 3 years ago
source link: http://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650228657&%3Bidx=3&%3Bsn=30b2165a5f284451fcd72e0caad92995
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

点击关注上方“ 知了小巷

设为“置顶或星标”,第一时间送达干货。

背景

有一份关于各个店铺的销售数据,简单起见,只有店铺ID、记录时间、销售数量三个字段。

shopid dt sale 店铺ID 时间 销售数量

问题(Ask Question)

连续三天有销售记录的店铺

要点

1.时间连续,当前行的时间减去当前行所在的排序序号得到的结果相同即时间连续

2.把差值相同的行分组聚合计算count,数量大于等于3即连续三天

2016-1 = 2015

2017-2 = 2015

2018-3 = 2015

数据集

1.原始数据 sale.dat

1001,2019-10-11,300

1001,2019-10-12,200

1001,2019-10-13,100

1001,2019-10-15,100

1001,2019-10-16,300

1001,2019-10-17,150

1001,2019-10-18,340

1001,2019-10-19,360

1005,2019-10-11,400

1005,2019-10-12,200

1005,2019-10-15,600

1006,2019-10-11,350

1006,2019-10-13,250

1006,2019-10-14,300

1006,2019-10-15,400

1006,2019-10-16,200

1008,2019-10-13,500

1010,2019-10-14,600

1010,2019-10-15,500

1008,2019-10-14,600

2.分析 : 给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的

1001,2019-10-11,300,1,2019-10-10

1001,2019-10-12,200,2,2019-10-10

1001,2019-10-13,100,3,2019-10-10

1001,2019-10-15,100,4,2019-10-11

1001,2019-10-16,300,5,2019-10-11

1001,2019-10-17,150,6,2019-10-11

1001,2019-10-18,340,7,2019-10-11

1001,2019-10-19,360,8,2019-10-11

1005,2019-10-11,400

1005,2019-10-12,200

1005,2019-10-15,600

1006,2019-10-11,350

1006,2019-10-13,250

1006,2019-10-14,300

1006,2019-10-15,400

1006,2019-10-16,200

1008,2019-10-13,500

1010,2019-10-14,600

1010,2019-10-15,500

1008,2019-10-14,600

数据表和数据准备

0.建表,加载数据

create table t_jd (

shopid string,

dt string,

sale int

) row format delimited fields terminated by ',';

hive> load data local inpath '/Users/xxx/Development/logs/sale.dat' into table t_jd;

Loading data to table default.t_jd

OK

SQL流程

1.打编号-排序序号

select

shopid,

dt,

sale,

row_number() over(partition by shopid order by dt) as rn

from t_jd;

结果:

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 13776 HDFS Write: 1774 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001 2019-10-11 300 1

1001 2019-10-12 200 2

1001 2019-10-13 100 3

1001 2019-10-15 100 4

1001 2019-10-16 300 5

1001 2019-10-17 150 6

1001 2019-10-18 340 7

1001 2019-10-19 360 8

1005 2019-10-11 400 1

1005 2019-10-12 200 2

1005 2019-10-15 600 3

1006 2019-10-11 350 1

1006 2019-10-13 250 2

1006 2019-10-14 300 3

1006 2019-10-15 400 4

1006 2019-10-16 200 5

1008 2019-10-13 500 1

1008 2019-10-14 600 2

1010 2019-10-14 600 1

1010 2019-10-15 500 2

2.根据编号,生成连续日期

select

shopid,

dt,

sale,

rn,

date_sub(to_date(dt), rn)

from (

select

shopid,

dt,

sale,

row_number() over(partition by shopid order by dt) as rn

from t_jd

) tmp;

结果:

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 14576 HDFS Write: 1774 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001 2019-10-11 300 1 2019-10-10

1001 2019-10-12 200 2 2019-10-10

1001 2019-10-13 100 3 2019-10-10

1001 2019-10-15 100 4 2019-10-11

1001 2019-10-16 300 5 2019-10-11

1001 2019-10-17 150 6 2019-10-11

1001 2019-10-18 340 7 2019-10-11

1001 2019-10-19 360 8 2019-10-11

1005 2019-10-11 400 1 2019-10-10

1005 2019-10-12 200 2 2019-10-10

1005 2019-10-15 600 3 2019-10-12

1006 2019-10-11 350 1 2019-10-10

1006 2019-10-13 250 2 2019-10-11

1006 2019-10-14 300 3 2019-10-11

1006 2019-10-15 400 4 2019-10-11

1006 2019-10-16 200 5 2019-10-11

1008 2019-10-13 500 1 2019-10-12

1008 2019-10-14 600 2 2019-10-12

1010 2019-10-14 600 1 2019-10-13

1010 2019-10-15 500 2 2019-10-13

4 分组,求count

select

shopid,

count(1) as cnt

from (

select

shopid,

dt,

sale,

rn,

date_sub(to_date(dt), rn) as flag

from (

select

shopid,

dt,

sale,

row_number() over(partition by shopid order by dt) as rn

from t_jd

) tmp

) tmp2

group by shopid, flag;

结果:

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 800 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 800 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001 3

1001 5

1005 2

1005 1

1006 1

1006 4

1008 2

1010 2

5 筛选出连续天数大于等于3的

select

shopid

from (

select

shopid,

count(1) as cnt

from (

select

shopid,

dt,

sale,

rn,

date_sub(to_date(dt), rn) as flag

from (

select

shopid,

dt,

sale,

row_number() over(partition by shopid order by dt) as rn

from t_jd

) tmp

) tmp2

group by shopid, flag

) tmp3

where tmp3.cnt >= 3;

结果:

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 1600 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 1600 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001

1001

1006

6 去重

select

distinct shopid

from (

select

shopid,

count(1) as cnt

from (

select

shopid,

dt,

sale,

rn,

date_sub(to_date(dt), rn) as flag

from (

select

shopid,

dt,

sale,

row_number() over(partition by shopid order by dt) as rn

from t_jd

) tmp

) tmp2

group by shopid, flag

) tmp3

where tmp3.cnt >= 3;

结果:

...

OK

1001

1006

推荐阅读:

HiveQL实战蚂蚁森林低碳用户排名分析:解法二

HiveQL实战蚂蚁森林低碳用户排名分析:解法一

HiveQL实战蚂蚁森林植物申领统计分析

深入理解Hive数据存储格式和压缩要点

Hive-函数

Hive-查询

Hive-DML(Data Manipulation Language)数据操作语言

Hive-DDL(Data Definition Language)数据定义

Hive优化(整理版)

Spark Core之Shuffle解析

数据仓库开发规范

欢迎关注我的公众号“ 知了小巷 ”,如果喜欢,麻烦点一下“ 在看 ”~

nIzeUzu.png!web


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK