HiveQL查询连续三天有销售记录的店铺
source link: http://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650228657&%3Bidx=3&%3Bsn=30b2165a5f284451fcd72e0caad92995
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
点击关注上方“ 知了小巷 “ ,
设为“置顶或星标”,第一时间送达干货。
背景
有一份关于各个店铺的销售数据,简单起见,只有店铺ID、记录时间、销售数量三个字段。
shopid dt sale 店铺ID 时间 销售数量问题(Ask Question)
连续三天有销售记录的店铺
要点
1.时间连续,当前行的时间减去当前行所在的排序序号得到的结果相同即时间连续
2.把差值相同的行分组聚合计算count,数量大于等于3即连续三天
2016-1 = 2015
2017-2 = 2015
2018-3 = 2015
数据集
1.原始数据 sale.dat
1001,2019-10-11,300
1001,2019-10-12,200
1001,2019-10-13,100
1001,2019-10-15,100
1001,2019-10-16,300
1001,2019-10-17,150
1001,2019-10-18,340
1001,2019-10-19,360
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600
2.分析 : 给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的
1001,2019-10-11,300,1,2019-10-10
1001,2019-10-12,200,2,2019-10-10
1001,2019-10-13,100,3,2019-10-10
1001,2019-10-15,100,4,2019-10-11
1001,2019-10-16,300,5,2019-10-11
1001,2019-10-17,150,6,2019-10-11
1001,2019-10-18,340,7,2019-10-11
1001,2019-10-19,360,8,2019-10-11
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600
数据表和数据准备
0.建表,加载数据
create table t_jd (
shopid string,
dt string,
sale int
) row format delimited fields terminated by ',';
hive> load data local inpath '/Users/xxx/Development/logs/sale.dat' into table t_jd;
Loading data to table default.t_jd
OK
SQL流程
1.打编号-排序序号
select
shopid,
dt,
sale,
row_number() over(partition by shopid order by dt) as rn
from t_jd;
结果:
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 13776 HDFS Write: 1774 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1001 2019-10-11 300 1
1001 2019-10-12 200 2
1001 2019-10-13 100 3
1001 2019-10-15 100 4
1001 2019-10-16 300 5
1001 2019-10-17 150 6
1001 2019-10-18 340 7
1001 2019-10-19 360 8
1005 2019-10-11 400 1
1005 2019-10-12 200 2
1005 2019-10-15 600 3
1006 2019-10-11 350 1
1006 2019-10-13 250 2
1006 2019-10-14 300 3
1006 2019-10-15 400 4
1006 2019-10-16 200 5
1008 2019-10-13 500 1
1008 2019-10-14 600 2
1010 2019-10-14 600 1
1010 2019-10-15 500 2
2.根据编号,生成连续日期
select
shopid,
dt,
sale,
rn,
date_sub(to_date(dt), rn)
from (
select
shopid,
dt,
sale,
row_number() over(partition by shopid order by dt) as rn
from t_jd
) tmp;
结果:
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 14576 HDFS Write: 1774 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1001 2019-10-11 300 1 2019-10-10
1001 2019-10-12 200 2 2019-10-10
1001 2019-10-13 100 3 2019-10-10
1001 2019-10-15 100 4 2019-10-11
1001 2019-10-16 300 5 2019-10-11
1001 2019-10-17 150 6 2019-10-11
1001 2019-10-18 340 7 2019-10-11
1001 2019-10-19 360 8 2019-10-11
1005 2019-10-11 400 1 2019-10-10
1005 2019-10-12 200 2 2019-10-10
1005 2019-10-15 600 3 2019-10-12
1006 2019-10-11 350 1 2019-10-10
1006 2019-10-13 250 2 2019-10-11
1006 2019-10-14 300 3 2019-10-11
1006 2019-10-15 400 4 2019-10-11
1006 2019-10-16 200 5 2019-10-11
1008 2019-10-13 500 1 2019-10-12
1008 2019-10-14 600 2 2019-10-12
1010 2019-10-14 600 1 2019-10-13
1010 2019-10-15 500 2 2019-10-13
4 分组,求count
select
shopid,
count(1) as cnt
from (
select
shopid,
dt,
sale,
rn,
date_sub(to_date(dt), rn) as flag
from (
select
shopid,
dt,
sale,
row_number() over(partition by shopid order by dt) as rn
from t_jd
) tmp
) tmp2
group by shopid, flag;
结果:
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 800 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 800 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1001 3
1001 5
1005 2
1005 1
1006 1
1006 4
1008 2
1010 2
5 筛选出连续天数大于等于3的
select
shopid
from (
select
shopid,
count(1) as cnt
from (
select
shopid,
dt,
sale,
rn,
date_sub(to_date(dt), rn) as flag
from (
select
shopid,
dt,
sale,
row_number() over(partition by shopid order by dt) as rn
from t_jd
) tmp
) tmp2
group by shopid, flag
) tmp3
where tmp3.cnt >= 3;
结果:
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 1600 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 1600 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1001
1001
1006
6 去重
select
distinct shopid
from (
select
shopid,
count(1) as cnt
from (
select
shopid,
dt,
sale,
rn,
date_sub(to_date(dt), rn) as flag
from (
select
shopid,
dt,
sale,
row_number() over(partition by shopid order by dt) as rn
from t_jd
) tmp
) tmp2
group by shopid, flag
) tmp3
where tmp3.cnt >= 3;
结果:
...
OK
1001
1006
推荐阅读:
Hive-DML(Data Manipulation Language)数据操作语言
Hive-DDL(Data Definition Language)数据定义
欢迎关注我的公众号“ 知了小巷 ”,如果喜欢,麻烦点一下“ 在看 ”~
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK