6

Beautiful Soup 库的使用

 1 year ago
source link: https://www.fdevops.com/2022/08/31/beautiful-soup-31162
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Beautiful Soup 库的使用

兰玉磊 • 1天前 • Python • 阅读 17

Beautiful Soup库的安装及基本使用

Beautiful Soup 是一个可以从 HTMLXML 文件中提取数据的 Python 库。

它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间。

安装Beautiful Soup:

pip install beautifulsoup4

Beautiful Soup使用的小例子:

>>> import requests
>>> from bs4 import BeautifulSoup

# 使用requests获取页面数据
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text

# 使用Beautiful Soup解析页面数据
>>> soup = BeautifulSoup(demo, "html.parser")
>>> type(soup)
<class 'bs4.BeautifulSoup'>

Beautiful Soup库的基本元素

Beautiful Soup库解析器

解析器使用方法条件
bs4的HTML解析器BeautifulSoup(mk, ‘html.parser’)安装bs4库
lxml的HTML解析器BeautifulSoup(mk, ‘lxml’)pip install lxml
lxml的XML解析器BeautifulSoup(mk, ‘xml’)pip install lxml
html5lib的解析器BeautifulSoup(mk, ‘html5lib’)pip install html5lib

上面的四种解析器,各有各的优缺点,下面就介绍以下它们的优缺点,方便大家根据实际的场景选择使用那种解析器

解析器优点缺点
bs4的HTML解析器Python内置标准库,执行速度适中、文档容错能力强Python2.7.3及3.2.2之前的版本文档容错能力差
lxml的HTML解析器速度快、文档容错能力强需要安装C语言库
lxml的XML解析器速度快、唯一支持XML的解析器需要安装C语言库
html5lib的解析器最后的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

Beautiful Soup类的基本元素

基本元素说明
Tag标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
Name标签的名字,<p>…</p>的名字是’p’,格式:<tag>.name
Attributes标签的属性,字典形式组织,格式:<tag>.attrs
NavigableString标签内非属性的字符串,<>…</>中的字符串,格式:<tag>.string
Comment标签内字符串的注释部分,一种特殊的Comment类型

Beautiful Soup基本元素的演示

>>> type(soup.a)
<class 'bs4.element.Tag'>
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> type(soup.p.name)
<class 'str'>
>>> soup.p.name  # p标签的名字
'p'
>>> soup.p.parent.name  # p标签的父亲名字
'body'

Attributes

>>> type(soup.a.attrs)
<class 'dict'>
>>> soup.a.attrs  # 获取a标签的所有属性
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> soup.a.attrs["href"]  # 获取a标签的href属性
'http://www.icourse163.org/course/BIT-268001'
>>> soup.a.attrs["class"]  # 获取a标签的class属性
['py1']

NavigableString

>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> soup.p.string
'The demo python introduces several python courses.'

Comment

>>> soup = BeautifulSoup("<b><!-- This is a comment! --></b><p>This is not a comment!</p>", "html.parser")
>>> type(soup.b.string)
<class 'bs4.element.Comment'>
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> soup.b.string
' This is a comment! '
>>> soup.p.string
'This is not a comment!'

HTML内容遍历

标签树子孙遍历常用的方法

属性说明
.contents子节点的列表,将<tag>所有儿子节点存入列表
.children子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants子孙节点的迭代类型,包含所有子孙节点,用于循环遍历

实例演示:

.contents

>>> soup = BeautifulSoup(r.text, "html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> len(soup.head.contents)
1

.children

>>> soup.head.children # 迭代器类型,遍历儿子节点
<list_iterator object at 0x103df8fd0>
>>> for child in soup.head.children:
...     print(child)
...
<title>This is a python demo page</title>

.descendants

>>> soup.body.descendants # 生成器类型,方便生成迭代器,遍历子孙节点
<generator object Tag.descendants at 0x104367ad0>
>>> for child in soup.body.descendants:
...     print(child)
...
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.

...

标签树上行遍历常用的方法

属性说明
.parent节点的父亲标签
.parents节点先辈标签的迭代类型,用于循环遍历先辈节点

实例演示:

.parent

>>> soup.title.parent # title标签的父标签
<head><title>This is a python demo page</title></head>
>>> soup.b.parent # b标签的父标签
<p class="title"><b>The demo python introduces several python courses.</b></p>

.parents

>>> for parent in soup.a.parents: # 输出a标签的所有先辈的名称
...     if parent is not None:
...         print(parent.name)
...
p
body
html
[document]

标签树平行遍历常用的方法

属性说明
.next_sibling返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling返回按照HTML文本顺序的上一个平行节点标签
.next_siblings迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings迭代类型,返回按照HTML文本顺序的前续所有平行节点标签

实例演示:

.next_sibling

>>> soup.a.next_sibling  # a标签的平行节点标签有可能会是NavigableString类型,因此,当看到这种情况的时候,不需要惊讶,可以通过类型来进行标签的筛选
' and '
>>> soup.a.next_sibling.next_sibling # a标签的第二级平行标签
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

.previous_sibling

>>> soup.a.previous_sibling # a标签的平行节点标签有可能会是NavigableString类型,因此,当看到这种情况的时候,不需要惊讶,可以通过类型来进行标签的筛选
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.parent # 通过查看a标签的父标签,我们可以发现,a标签的前续标签是一个NavigableString类型的数据
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

.next_siblings

>>> for sibling in soup.a.next_siblings:
...     print(sibling)
...
 and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

.previous_siblings

>>> for sibling in soup.a.previous_siblings:
...     print(sibling)
...
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

基于bs4库的HTML格式输出

bs4库的prettify()方法,这个方法可以非常优美的HTML格式的代码输出出来。

此外还需要注意的一点是,bs4库会将它得到的数据全部转换为utf-8编码,因为Python3默认的编码就是utf-8,所以使用bs4库无障碍,但是若是你还是使用的Python2,则建议升级为Python3,否则就需要不断的去转换编码。

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify()) # 格式化输出全部标签
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>> print(soup.a.prettify()) # 格式化输出a标签
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

本文为原创文章,未经授权禁止转载本站文章。
原文出处:兰玉磊的个人博客
原文链接:https://www.fdevops.com/2022/08/31/beautiful-soup-31162
版权:本文采用「署名-非商业性使用-相同方式共享 4.0 国际」知识共享许可协议进行许可。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK