33

HQL练习某视频网站的常规TopN指标分析

 3 years ago
source link: http://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650228676&%3Bidx=2&%3Bsn=57fee54c1040fef1bfc17cc577e59141
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

点击关注上方“ 知了小巷 ”,

设为“置顶或星标”,第一时间送达干货。

某视频网站的TopN指标需求

1.统计视频观看数Top10

2.统计视频类别热度Top10

3.统计视频观看数Top20所属类别

4.统计视频观看数Top50所关联视频的所属类别Rank排名

5.统计每个类别中的视频热度Top10

6.统计每个类别中视频流量Top10

7.统计上传视频最多的用户Top10以及他们上传的视频

8.统计每个类别视频观看数Top10

指标分析所涉及的核心数据结构

1.视频表

字段 备注 详细描述 video_id 视频唯一id 11位字符串 uploader 视频上传者 上传视频的用户名String age 视频年龄 视频在平台上的整数天 category 视频类别 上传视频指定的视频分类 length 视频长度 整形数字标识的视频长度 views 观看次数 视频被浏览的次数 rate 视频评分 满分5分 ratings 流量 视频的流量,整型数字 coments 评论数 一个视频的整数评论数 related_id 相关视频id 相关视频的id,最多20个

2.用户表

字段 备注 字段类型 uploader 上传者用户名 string videos 上传视频数 int friends 朋友数量 int

建表和数据准备

创建表(load原始数据)

ods_video_ori

ods_video_user_ori

创建表(默认TEXTFILE转到ORC格式)

ods_video_orc

ods_video_user_orc

ods_video_ori

create table ods_video_ori (

video_id string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>

) row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

ods_video_user_ori

create table ods_video_user_ori (

uploader string,

videos int,

friends int

) row format delimited

fields terminated by "\t"

stored as textfile;

然后把原始数据插入到ORC表中

ods_video_orc

create table ods_video_orc (

video_id string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>

) row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

ods_video_user_orc

create table ods_video_user_orc (

uploader string,

videos int,

friends int

) row format delimited

fields terminated by "\t"

stored as orc;

导入数据

ods_video_ori(一个文件夹下多个文件)

hive> load data local inpath "/Users/xxx/Development/logs/video/" into table ods_video_ori;

Loading data to table default.ods_video_ori

OK

hive> select * from ods_video_ori limit 5;

OK

LKh7zAJ4nwo TheReceptionist 653 ["Entertainment"] 424 13021 4.34 1305 744 ["DjdA-5oKYFQ"]

7D0Mf4Kn4Xk periurban 583 ["Music"] 201 6508 4.19 687 312 ["e2k0h6tPvGc"]

n1cEq1C8oqQ Pipistrello 525 ["Comedy"] 125 1687 4.01 363 141 ["eprHhmurMHg"]

OHkEzL4Unck ichannel 638 ["Comedy"] 299 8043 4.4 518 371 ["eyUSTmEUQRg"]

-boOvAGNKUc mrpitifulband 639 ["Music"] 287 7548 4.48 606 386 ["fmUwUURgsX0"]

Time taken: 0.293 seconds, Fetched: 5 row(s)

ods_video_user_ori

hive> load data local inpath "/Users/xxx/Development/logs/user.txt" into table ods_video_user_ori;

hive> select * from ods_video_user_ori limit 5;

OK

barelypolitical 151 5106

bonk65 89 144

camelcars 26 674

cubskickass34 13 126

boydism08 32 50

Time taken: 2.855 seconds, Fetched: 5 row(s)

向ORC表插入数据

ods_video_orc

insert into table ods_video_orc select * from ods_video_ori;

ods_video_user_orc

insert into table ods_video_user_orc select * from ods_video_user_ori;

业务分析

统计视频观看数Top10

思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。

【全局排序】

最终sql:

select

video_id,

uploader,

age,

category,

length,

views,

rate,

ratings,

comments

from ods_video_orc

order by views desc limit 10;

需要注意内存不够的话会报错:

Ended Job = job_local1559464187_0005 with errors

Error during job, obtaining debugging information...

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

统计视频类别热度Top10

思路:某类别下的视频越多则热度越高

1.即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。

2.我们需要按照类别group by聚合,然后count组内的video_id个数即可。

3.因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。

4.最后按照热度排序,显示前10条。

【lateral view explode的使用】

最终sql:

select

category_name as category,

count(t1.video_id) as hot

from (

select

video_id,

category_name

from ods_video_orc lateral view explode(category) t_catetory as category_name

) t1

group by t1.category_name

order by hot desc limit 10;

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 13812458 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 13812458 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Music 179049

Entertainment 127674

Comedy 87818

Film 73293

Animation 73293

Sports 67329

Games 59817

Gadgets 59817

Blogs 48890

People 48890

...

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路:

1.先找到观看数最高的20个视频所属条目的所有信息,降序排列

2.把这20条信息中的category分裂出来(列转行)

3.最后查询视频分类名称和该分类下有多少个Top20的视频

最终sql:

select

category_name as category,

count(t2.video_id) as hot_with_views

from (

select

video_id,

category_name

from (

select

*

from

ods_video_orc

order by

views

desc limit

20

) t1 lateral view explode(category) t_catetory as category_name

) t2

group by category_name

order by hot_with_views desc;

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Stage-Stage-3: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Entertainment 6

Comedy 6

Music 5

People 2

Blogs 2

UNA 1

...

统计视频观看数Top50所关联视频的所属类别排序

思路:

1.查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1

t1:观看数前50的视频

select

*

from ods_video_orc

order by views desc limit 50;

2.将找到的50条视频信息的相关视频related_id列转行,记为临时表t2

t2:将相关视频的id进行列转换操作

select

explode(related_id) as videoId

from t1;

3.将相关视频的id和ods_video_orc表进行inner join操作

t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id

(

select

distinct(t2.video_id),

t3.category

from t2 inner join ods_video_orc t3 on t2.videoId = t3.videoId

) t4 lateral view explode(category) t_catetory as category_name;

4.按照视频类别进行分组,统计每组视频个数,然后排行

最终sql:

select

category_name as category,

count(t5.video_id) as hot

from (

select

video_id,

category_name

from (

select

distinct(t2.video_id),

t3.category

from (

select

explode(related_id) as video_id

from (

select

*

from ods_video_orc order by views desc limit 50

) t1

) t2 inner join ods_video_orc t3 on t2.video_id = t3.video_id

) t4 lateral view explode(category) t_catetory as category_name

) t5

group by category_name order by hot desc;

统计每个类别中的视频热度Top10,以Music为例

思路:

1.要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放category_id展开的数据。

2.向category展开的表中插入数据。

3.统计对应类别(Music)中的视频热度。

最终sql:

创建表类别表:

create table ods_video_category(

video_id string,

uploader string,

age int,

category_id string,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as orc;

向类别表中插入数据:

insert into table ods_video_category

select

video_id,

uploader,

age,

category_id,

length,

views,

rate,

ratings,

comments,

related_id

from ods_video_orc lateral view explode(category) catetory as category_id;

hive> select count(1) from ods_video_category;

OK

1019206

统计Music类别的Top10(也可以统计其他)

select

video_id,

views

from ods_video_category

where category_id = "Music"

order by views desc limit 10;

统计每个类别中视频流量Top10,以Music为例

思路:

1.创建视频类别展开表(category_id列转行后的表)

2.按照ratings排序即可

最终sql:

select

video_id,

views,

ratings

from ods_video_category

where category_id = "Music"

order by ratings desc limit 10;

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

思路:

1.先找到上传视频最多的10个用户的用户信息

select

*

from ods_video_user_orc

order by videos desc limit 10;

2.通过uploader字段与ods_video_orc表进行join,得到的信息按照views观看次数进行排序即可。

最终sql:

select

t2.video_id,

t2.views,

t2.ratings,

t1.videos,

t1.friends

from (

select

*

from

ods_video_user_orc

order by videos desc

limit 10

) t1 join ods_video_orc t2 on t1.uploader = t2.uploader

order by views desc limit 20;

统计每个类别视频观看数Top10

思路:

1.先得到category_id展开的表数据

2.子查询按照category_id进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列

3.通过子查询产生的临时表,查询rank值小于等于10的数据行即可。

最终sql:

select

t1.*

from (

select

video_id,

category_id,

views,

row_number() over(partition by category_id order by views desc

) rank

from ods_video_category

) t1

where rank <= 10;

推荐阅读:

Hive企业级调优

HiveQL查询连续三天有销售记录的店铺

HiveQL实战蚂蚁森林低碳用户排名分析:解法一

HiveQL实战蚂蚁森林低碳用户排名分析:解法二

HiveQL实战蚂蚁森林植物申领统计分析

Hive-函数

Hive-查询

Hive-DML(Data Manipulation Language)数据操作语言

Hive-DDL(Data Definition Language)数据定义

Hive优化(整理版)

Spark Core之Shuffle解析

数据仓库开发规范

欢迎关注我的公众号“ 知了小巷 ”,如果喜欢,可以点一下“ 在看 ”~ 谢谢~~

nIzeUzu.png!web


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK