【阅读时间】查阅类文档
【内容简介】Xpath相关使用法法和例子文档，以供查阅（➜ 后是对应语句的输出output）

XPath 相关例子Note

from lxml import etree
sample1 = """<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#" src="x">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>
"""
def getxpath(html):
    return etree.HTML(html)
s1 = getxpath(sample1)

//绝对路径 text() 获取内容中的文字信息

s1.xpath('//title/text()') ➜ ['My page']

/ 相对路径

s1.xpath('/html/head/title/text()') ➜ ['My page']

获取属性src的值

s1.xpath('//h2/a/@src') ➜ ['x']

获取所有属性href的值

s1.xpath('//@href') ➜ ['#']

获取网页中的所有文本

s1.xpath('//text()')
➜
['\n  ',
 '\n    ',
 'My page',
 '\n  ',
 '\n  ',
 '\n    ',
 'Welcome to my ',
 'page',
 '\n    ',
 'This is the first paragraph.',
 '\n    ',
 '\n  ',
 '\n']

获取网页中的所有注释

s1.xpath('//comment()') ➜ [<!-- this is the end -->]

sample2 = """
<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2>Something here.</li>
    </ul>
  </body>
</html>
"""
s2 = getxpath(sample2)

获取所有li中的文本

s2.xpath('//li/text()') ➜ ['Quote 1', 'Quote 2 with ', 'Quote 3 with ', 'Something here.']

获取第一个第二个li中的文本，两种写法均可

s2.xpath('//li[position() = 1]/text()') ➜ ['Quote 1']

s2.xpath('//li[1]/text()') ➜ ['Quote 1']

s2.xpath('//li[position() = 2]/text()') ➜ ['Quote 2 with ']

s2.xpath('//li[2]/text()') ➜ ['Quote 2 with ']

奇数偶数最后一个

s2.xpath('//li[position() mod2 = 1]/text()') ➜ ['Quote 1', 'Quote 3 with ']

s2.xpath('//li[position() mod2 = 0]/text()') ➜ ['Quote 2 with ', 'Something here.']

s2.xpath('//li[last()]/text()') ➜ ['Something here.']

li下面a中的文本

s2.xpath('//li[a]/text()') ➜ ['Quote 2 with ', 'Quote 3 with ']

li下a或者h2的文本

s2.xpath('//li[a or h2]/text()') ➜ ['Quote 2 with ', 'Quote 3 with ', 'Something here.']

使用 | 同时获取 a 和 h2 中的内容

s2.xpath('//a/text()|//h2/text()') ➜ ['link', 'another link', 'Quote 4 title']

sample3 = """<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a>begin</li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a>end</li>
      <li data-xxxx="end" abc="abc"><a href="http://quotes.toscrape.com">Quotes To Scrape</a>end</li>
    </ul>
  </body>
</html>
"""
s3 = getxpath(sample3)

获取 a 标签下 href 以https开始的

s3.xpath('//a[starts-with(@href, "https")]/text()') ➜ ['Scrapy', 'Scrapinghub', 'Scrapinghub Blog']

获取 href=https://scrapy.org

s3.xpath('//li/a[@href="https://scrapy.org"]/text()') ➜ ['Scrapy']

获取 id = begin

s3.xpath('//li[@id="begin"]/text()') ➜ ['begin']

获取text = Scrapinghub

s3.xpath('//li/a[text()="Scrapinghub"]/text()') ➜ ['Scrapinghub']

获取某个标签下某个参数 = xx

s3.xpath('//li[@data-xxxx="end"]/text()') ➜ ['end']

s3.xpath('//li[@abc="abc"]/text()') ➜ ['end']

sample4 = u"""
<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#" src="x">page</a></h2>
    <p>This is the first paragraph.</p>
    <p class="test">
    编程语言<a href="#">python</a>
    <img src="#" alt="test"/>javascript
    <a href="#"><strong>C#</strong>JAVA</a>
    </p>
    <p class="content-a">a</p>
    <p class="content-b">b</p>
    <p class="content-c">c</p>
    <p class="content-d">d</p>
    <p class="econtent-e">e</p>
    <!-- this is the end -->
  </body>
</html>
"""
s4 = etree.HTML(sample4)

获取 class = test 标签中的所有文字

s4.xpath('//p[@class="test"]/text()')
➜ ['\n    编程语言', '\n    ', 'javascript\n    ', '\n    ']

使用String来获得文字段； strip() 移除字符串收尾字符，默认为空格

print (s4.xpath('string(//p[@class="test"])').strip())
➜
编程语言python
    javascript
    C#JAVA

获取所有class属性中以content开始的

s4.xpath('//p[starts-with(@class,"content")]/text()') ➜ ['a', 'b', 'c', 'd']

获取所有class属性中包含content的

s4.xpath(('//*[contains(@class,"content")]/text()')) ➜ ['a', 'b', 'c', 'd', 'e']

Xpath-Wiki

XPath 相关例子Note

Recommend

Use knowledge graphs to discover open source package vulnerabilities

使用buildx构建多平台可用Docker镜像

【区块链】共识算法与如何解决拜占庭将军问题

Your own time zone

私域社群怎么运营好？这三个坑千万别踩！

How GitHub Leverages Feature Flags to Ship Quickly and Safely

Apple packaging like you’ve never seen it before

互联网营销，不同阶层的人脉关系，对你真的有用吗?

你敢信！80年前的地铁竟然这么好看！

Jenny B Kowalski's A-Z (and a-z) as Variable Letterforms | CSS-Tricks

About Joyk