9

爬取冰冰B站千条评论,看看大家说了什么

 3 years ago
source link: https://blog.csdn.net/qq_45176548/article/details/112100932
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Python爬取 冰冰 第一条B站视频的千条评论,绘制词云图,看看大家说了什么吧

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1MTc2NTQ4,size_16,color_FFFFFF,t_70#pic_centerwatermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1MTc2NTQ4,size_16,color_FFFFFF,t_70#pic_center
在这里插入图片描述
B站当日弹幕获取
冰冰B站视频弹幕爬取原理解析
import pandas as pd
data = pd.read_excel(r"bingbing.xlsx")
data.head()
用户性别等级评论点赞0食贫道男6[呆][呆][呆]你来了嘿!1584571毕导THU男6我是冰冰仅有的3个关注之一[tv_doge]我和冰冰贴贴1484392老师好我叫何同学男6[热词系列_知识增加]896343央视网快看保密6冰冰来了!我们要失业了吗[doge][doge]1183704厦门大学保密5哇欢迎冰冰!!!66196

原文链接

数据预处理

data.describe()
等级点赞count1180.0000001180.000000mean4.4813562200.617797std1.04137910872.524850min2.0000001.00000025%4.0000004.00000050%5.0000009.00000075%5.000000203.750000max6.000000158457.000000
data.dropna()
用户性别等级评论点赞0食贫道男6[呆][呆][呆]你来了嘿!1584571毕导THU男6我是冰冰仅有的3个关注之一[tv_doge]我和冰冰贴贴1484392老师好我叫何同学男6[热词系列_知识增加]896343央视网快看保密6冰冰来了!我们要失业了吗[doge][doge]1183704厦门大学保密5哇欢迎冰冰!!!66196..................1175黑旗鱼保密511小时一百万,好快[惊讶]51176是你的益达哦男6冰冰粉丝上涨速度:11小时107.3万,平均每小时上涨9.75万,每分钟上涨1625,每秒钟...51177快乐风男崔斯特男4军训的时候去了趟厕所,出来忘记是哪个队伍了。看了up的视频才想起来,是三连[doge][滑稽]51178很认真的大熊男5我觉得冰冰主持春晚应该问题不大吧。[OK]51179飞拖鞋呀吼保密5《论一个2级号如何在2020年最后一天成为百大up主》5

1180 rows × 5 columns

data.drop_duplicates()
用户性别等级评论点赞0食贫道男6[呆][呆][呆]你来了嘿!1584571毕导THU男6我是冰冰仅有的3个关注之一[tv_doge]我和冰冰贴贴1484392老师好我叫何同学男6[热词系列_知识增加]896343央视网快看保密6冰冰来了!我们要失业了吗[doge][doge]1183704厦门大学保密5哇欢迎冰冰!!!66196..................1175黑旗鱼保密511小时一百万,好快[惊讶]51176是你的益达哦男6冰冰粉丝上涨速度:11小时107.3万,平均每小时上涨9.75万,每分钟上涨1625,每秒钟...51177快乐风男崔斯特男4军训的时候去了趟厕所,出来忘记是哪个队伍了。看了up的视频才想起来,是三连[doge][滑稽]51178很认真的大熊男5我觉得冰冰主持春晚应该问题不大吧。[OK]51179飞拖鞋呀吼保密5《论一个2级号如何在2020年最后一天成为百大up主》5

1179 rows × 5 columns

点赞TOP20

df1 = data.sort_values(by="点赞",ascending=False).head(20)
from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.faker import Faker

c1 = (
    Bar()
    .add_xaxis(df1["评论"].to_list())
    .add_yaxis("点赞数", df1["点赞"].to_list(), color=Faker.rand_color())
    .set_global_opts(
        title_opts=opts.TitleOpts(title="评论热度Top20"),
        datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_="inside")],
    )
    .render_notebook()
)
c1

在这里插入图片描述

data.等级.value_counts().sort_index(ascending=False)
6    165
5    502
4    312
3    138
2     63
Name: 等级, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie
from pyecharts.faker import Faker

c2 = (
    Pie()
    .add(
        "",
        [list(z) for z in zip([str(i) for i in range(2,7)], [63,138,312,502,165])],
        radius=["40%", "75%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="等级分布"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    .render_notebook()
)
c2

在这里插入图片描述

data.性别.value_counts().sort_index(ascending=False)
from pyecharts import options as opts
from pyecharts.charts import Pie
from pyecharts.faker import Faker

c4 = (
    Pie()
    .add(
        "",
        [list(z) for z in zip(["男","女","保密"], ["404",'103','673'])],
        radius=["40%", "75%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="性别分布"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    .render_notebook()
    
)
c4

在这里插入图片描述

绘制词云图

from wordcloud import WordCloud
import jieba
from tkinter import _flatten
from matplotlib.pyplot import imread
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
with open('stoplist.txt', 'r', encoding='utf-8') as f:
    stopWords = f.read()
with open('停用词.txt','r',encoding='utf-8') as t:
    stopWord = t.read()
total = stopWord.split() + stopWords.split()
def my_word_cloud(data=None, stopWords=None, img=None):
    dataCut = data.apply(jieba.lcut)  # 分词
    dataAfter = dataCut.apply(lambda x: [i for i in x if i not in stopWords])  # 去除停用词
    wordFre = pd.Series(_flatten(list(dataAfter))).value_counts()  # 统计词频
    mask = plt.imread(img)
    plt.figure(figsize=(20,20))
    wc  = WordCloud(scale=10,font_path='C:/Windows/Fonts/STXINGKA.TTF',mask=mask,background_color="white",)
    wc.fit_words(wordFre)
    plt.imshow(wc)
    plt.axis('off')
my_word_cloud(data=data["评论"],stopWords=stopWords,img="1.jpeg")

通过之前博客的学习,想必大家已经对Python网络爬虫有了了解,希望大家动手实践,这里就不放代码啦,建议参考:

推荐阅读:


到这里就结束了,如果对你有帮助你,欢迎点赞关注,你的点赞对我很重要
在这里插入图片描述


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK