Beautiful Soup 库的使用
source link: https://www.fdevops.com/2022/08/31/beautiful-soup-31162
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Beautiful Soup库的安装及基本使用
Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。
它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间。
安装Beautiful Soup:
pip install beautifulsoup4
Beautiful Soup使用的小例子:
>>> import requests
>>> from bs4 import BeautifulSoup
# 使用requests获取页面数据
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
# 使用Beautiful Soup解析页面数据
>>> soup = BeautifulSoup(demo, "html.parser")
>>> type(soup)
<class 'bs4.BeautifulSoup'>
Beautiful Soup库的基本元素
Beautiful Soup库解析器
解析器 | 使用方法 | 条件 |
bs4的HTML解析器 | BeautifulSoup(mk, ‘html.parser’) | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk, ‘lxml’) | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk, ‘xml’) | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk, ‘html5lib’) | pip install html5lib |
上面的四种解析器,各有各的优缺点,下面就介绍以下它们的优缺点,方便大家根据实际的场景选择使用那种解析器
解析器 | 优点 | 缺点 |
bs4的HTML解析器 | Python内置标准库,执行速度适中、文档容错能力强 | Python2.7.3及3.2.2之前的版本文档容错能力差 |
lxml的HTML解析器 | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml的XML解析器 | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib的解析器 | 最后的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
Beautiful Soup类的基本元素
基本元素 | 说明 |
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
Name | 标签的名字,<p>…</p>的名字是’p’,格式:<tag>.name |
Attributes | 标签的属性,字典形式组织,格式:<tag>.attrs |
NavigableString | 标签内非属性的字符串,<>…</>中的字符串,格式:<tag>.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
Beautiful Soup基本元素的演示
>>> type(soup.a)
<class 'bs4.element.Tag'>
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> type(soup.p.name)
<class 'str'>
>>> soup.p.name # p标签的名字
'p'
>>> soup.p.parent.name # p标签的父亲名字
'body'
Attributes
>>> type(soup.a.attrs)
<class 'dict'>
>>> soup.a.attrs # 获取a标签的所有属性
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> soup.a.attrs["href"] # 获取a标签的href属性
'http://www.icourse163.org/course/BIT-268001'
>>> soup.a.attrs["class"] # 获取a标签的class属性
['py1']
NavigableString
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> soup.p.string
'The demo python introduces several python courses.'
Comment
>>> soup = BeautifulSoup("<b><!-- This is a comment! --></b><p>This is not a comment!</p>", "html.parser")
>>> type(soup.b.string)
<class 'bs4.element.Comment'>
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> soup.b.string
' This is a comment! '
>>> soup.p.string
'This is not a comment!'
HTML内容遍历
标签树子孙遍历常用的方法
属性 | 说明 |
.contents | 子节点的列表,将<tag>所有儿子节点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点 |
.descendants | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
实例演示:
.contents
>>> soup = BeautifulSoup(r.text, "html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> len(soup.head.contents)
1
.children
>>> soup.head.children # 迭代器类型,遍历儿子节点
<list_iterator object at 0x103df8fd0>
>>> for child in soup.head.children:
... print(child)
...
<title>This is a python demo page</title>
.descendants
>>> soup.body.descendants # 生成器类型,方便生成迭代器,遍历子孙节点
<generator object Tag.descendants at 0x104367ad0>
>>> for child in soup.body.descendants:
... print(child)
...
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
...
标签树上行遍历常用的方法
属性 | 说明 |
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
实例演示:
.parent
>>> soup.title.parent # title标签的父标签
<head><title>This is a python demo page</title></head>
>>> soup.b.parent # b标签的父标签
<p class="title"><b>The demo python introduces several python courses.</b></p>
.parents
>>> for parent in soup.a.parents: # 输出a标签的所有先辈的名称
... if parent is not None:
... print(parent.name)
...
p
body
html
[document]
标签树平行遍历常用的方法
属性 | 说明 |
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
实例演示:
.next_sibling
>>> soup.a.next_sibling # a标签的平行节点标签有可能会是NavigableString类型,因此,当看到这种情况的时候,不需要惊讶,可以通过类型来进行标签的筛选
' and '
>>> soup.a.next_sibling.next_sibling # a标签的第二级平行标签
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.previous_sibling
>>> soup.a.previous_sibling # a标签的平行节点标签有可能会是NavigableString类型,因此,当看到这种情况的时候,不需要惊讶,可以通过类型来进行标签的筛选
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.parent # 通过查看a标签的父标签,我们可以发现,a标签的前续标签是一个NavigableString类型的数据
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
.next_siblings
>>> for sibling in soup.a.next_siblings:
... print(sibling)
...
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.previous_siblings
>>> for sibling in soup.a.previous_siblings:
... print(sibling)
...
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
基于bs4库的HTML格式输出
bs4库的prettify()方法,这个方法可以非常优美的HTML格式的代码输出出来。
此外还需要注意的一点是,bs4库会将它得到的数据全部转换为utf-8编码,因为Python3默认的编码就是utf-8,所以使用bs4库无障碍,但是若是你还是使用的Python2,则建议升级为Python3,否则就需要不断的去转换编码。
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify()) # 格式化输出全部标签
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>> print(soup.a.prettify()) # 格式化输出a标签
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
本文为原创文章,未经授权禁止转载本站文章。
原文出处:兰玉磊的个人博客
原文链接:https://www.fdevops.com/2022/08/31/beautiful-soup-31162
版权:本文采用「署名-非商业性使用-相同方式共享 4.0 国际」知识共享许可协议进行许可。
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK