HQL练习某视频网站的常规TopN指标分析
source link: http://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650228676&%3Bidx=2&%3Bsn=57fee54c1040fef1bfc17cc577e59141
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
点击关注上方“ 知了小巷 ”,
设为“置顶或星标”,第一时间送达干货。
某视频网站的TopN指标需求
1.统计视频观看数Top10
2.统计视频类别热度Top10
3.统计视频观看数Top20所属类别
4.统计视频观看数Top50所关联视频的所属类别Rank排名
5.统计每个类别中的视频热度Top10
6.统计每个类别中视频流量Top10
7.统计上传视频最多的用户Top10以及他们上传的视频
8.统计每个类别视频观看数Top10
指标分析所涉及的核心数据结构
1.视频表
字段 备注 详细描述 video_id 视频唯一id 11位字符串 uploader 视频上传者 上传视频的用户名String age 视频年龄 视频在平台上的整数天 category 视频类别 上传视频指定的视频分类 length 视频长度 整形数字标识的视频长度 views 观看次数 视频被浏览的次数 rate 视频评分 满分5分 ratings 流量 视频的流量,整型数字 coments 评论数 一个视频的整数评论数 related_id 相关视频id 相关视频的id,最多20个2.用户表
字段 备注 字段类型 uploader 上传者用户名 string videos 上传视频数 int friends 朋友数量 int建表和数据准备
创建表(load原始数据)
ods_video_ori
ods_video_user_ori
创建表(默认TEXTFILE转到ORC格式)
ods_video_orc
ods_video_user_orc
ods_video_ori
create table ods_video_ori (
video_id string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
related_id array<string>
) row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
ods_video_user_ori
create table ods_video_user_ori (
uploader string,
videos int,
friends int
) row format delimited
fields terminated by "\t"
stored as textfile;
然后把原始数据插入到ORC表中
ods_video_orc
create table ods_video_orc (
video_id string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
related_id array<string>
) row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
ods_video_user_orc
create table ods_video_user_orc (
uploader string,
videos int,
friends int
) row format delimited
fields terminated by "\t"
stored as orc;
导入数据
ods_video_ori(一个文件夹下多个文件)
hive> load data local inpath "/Users/xxx/Development/logs/video/" into table ods_video_ori;
Loading data to table default.ods_video_ori
OK
hive> select * from ods_video_ori limit 5;
OK
LKh7zAJ4nwo TheReceptionist 653 ["Entertainment"] 424 13021 4.34 1305 744 ["DjdA-5oKYFQ"]
7D0Mf4Kn4Xk periurban 583 ["Music"] 201 6508 4.19 687 312 ["e2k0h6tPvGc"]
n1cEq1C8oqQ Pipistrello 525 ["Comedy"] 125 1687 4.01 363 141 ["eprHhmurMHg"]
OHkEzL4Unck ichannel 638 ["Comedy"] 299 8043 4.4 518 371 ["eyUSTmEUQRg"]
-boOvAGNKUc mrpitifulband 639 ["Music"] 287 7548 4.48 606 386 ["fmUwUURgsX0"]
Time taken: 0.293 seconds, Fetched: 5 row(s)
ods_video_user_ori
hive> load data local inpath "/Users/xxx/Development/logs/user.txt" into table ods_video_user_ori;
hive> select * from ods_video_user_ori limit 5;
OK
barelypolitical 151 5106
bonk65 89 144
camelcars 26 674
cubskickass34 13 126
boydism08 32 50
Time taken: 2.855 seconds, Fetched: 5 row(s)
向ORC表插入数据
ods_video_orc
insert into table ods_video_orc select * from ods_video_ori;
ods_video_user_orc
insert into table ods_video_user_orc select * from ods_video_user_ori;
业务分析
统计视频观看数Top10
思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
【全局排序】
最终sql:
select
video_id,
uploader,
age,
category,
length,
views,
rate,
ratings,
comments
from ods_video_orc
order by views desc limit 10;
需要注意内存不够的话会报错:
Ended Job = job_local1559464187_0005 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
统计视频类别热度Top10
思路:某类别下的视频越多则热度越高
1.即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2.我们需要按照类别group by聚合,然后count组内的video_id个数即可。
3.因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
4.最后按照热度排序,显示前10条。
【lateral view explode的使用】
最终sql:
select
category_name as category,
count(t1.video_id) as hot
from (
select
video_id,
category_name
from ods_video_orc lateral view explode(category) t_catetory as category_name
) t1
group by t1.category_name
order by hot desc limit 10;
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 13812458 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 13812458 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Music 179049
Entertainment 127674
Comedy 87818
Film 73293
Animation 73293
Sports 67329
Games 59817
Gadgets 59817
Blogs 48890
People 48890
...
统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
思路:
1.先找到观看数最高的20个视频所属条目的所有信息,降序排列
2.把这20条信息中的category分裂出来(列转行)
3.最后查询视频分类名称和该分类下有多少个Top20的视频
最终sql:
select
category_name as category,
count(t2.video_id) as hot_with_views
from (
select
video_id,
category_name
from (
select
*
from
ods_video_orc
order by
views
desc limit
20
) t1 lateral view explode(category) t_catetory as category_name
) t2
group by category_name
order by hot_with_views desc;
...
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 30471366 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 30471366 HDFS Write: 0 SUCCESS
Stage-Stage-3: HDFS Read: 30471366 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Entertainment 6
Comedy 6
Music 5
People 2
Blogs 2
UNA 1
...
统计视频观看数Top50所关联视频的所属类别排序
思路:
1.查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
t1:观看数前50的视频
select
*
from ods_video_orc
order by views desc limit 50;
2.将找到的50条视频信息的相关视频related_id列转行,记为临时表t2
t2:将相关视频的id进行列转换操作
select
explode(related_id) as videoId
from t1;
3.将相关视频的id和ods_video_orc表进行inner join操作
t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id
(
select
distinct(t2.video_id),
t3.category
from t2 inner join ods_video_orc t3 on t2.videoId = t3.videoId
) t4 lateral view explode(category) t_catetory as category_name;
4.按照视频类别进行分组,统计每组视频个数,然后排行
最终sql:
select
category_name as category,
count(t5.video_id) as hot
from (
select
video_id,
category_name
from (
select
distinct(t2.video_id),
t3.category
from (
select
explode(related_id) as video_id
from (
select
*
from ods_video_orc order by views desc limit 50
) t1
) t2 inner join ods_video_orc t3 on t2.video_id = t3.video_id
) t4 lateral view explode(category) t_catetory as category_name
) t5
group by category_name order by hot desc;
统计每个类别中的视频热度Top10,以Music为例
思路:
1.要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放category_id展开的数据。
2.向category展开的表中插入数据。
3.统计对应类别(Music)中的视频热度。
最终sql:
创建表类别表:
create table ods_video_category(
video_id string,
uploader string,
age int,
category_id string,
length int,
views int,
rate float,
ratings int,
comments int,
related_id array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as orc;
向类别表中插入数据:
insert into table ods_video_category
select
video_id,
uploader,
age,
category_id,
length,
views,
rate,
ratings,
comments,
related_id
from ods_video_orc lateral view explode(category) catetory as category_id;
hive> select count(1) from ods_video_category;
OK
1019206
统计Music类别的Top10(也可以统计其他)
select
video_id,
views
from ods_video_category
where category_id = "Music"
order by views desc limit 10;
统计每个类别中视频流量Top10,以Music为例
思路:
1.创建视频类别展开表(category_id列转行后的表)
2.按照ratings排序即可
最终sql:
select
video_id,
views,
ratings
from ods_video_category
where category_id = "Music"
order by ratings desc limit 10;
统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
思路:
1.先找到上传视频最多的10个用户的用户信息
select
*
from ods_video_user_orc
order by videos desc limit 10;
2.通过uploader字段与ods_video_orc表进行join,得到的信息按照views观看次数进行排序即可。
最终sql:
select
t2.video_id,
t2.views,
t2.ratings,
t1.videos,
t1.friends
from (
select
*
from
ods_video_user_orc
order by videos desc
limit 10
) t1 join ods_video_orc t2 on t1.uploader = t2.uploader
order by views desc limit 20;
统计每个类别视频观看数Top10
思路:
1.先得到category_id展开的表数据
2.子查询按照category_id进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列
3.通过子查询产生的临时表,查询rank值小于等于10的数据行即可。
最终sql:
select
t1.*
from (
select
video_id,
category_id,
views,
row_number() over(partition by category_id order by views desc
) rank
from ods_video_category
) t1
where rank <= 10;
推荐阅读:
Hive-DML(Data Manipulation Language)数据操作语言
Hive-DDL(Data Definition Language)数据定义
欢迎关注我的公众号“ 知了小巷 ”,如果喜欢,可以点一下“ 在看 ”~ 谢谢~~
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK