4

Xpath-Wiki

 3 years ago
source link: https://charlesliuyx.github.io/2017/08/28/Xpath%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

【阅读时间】查阅类文档
【内容简介】Xpath相关使用法法和例子文档,以供查阅(➜ 后是对应语句的输出output)

XPath 相关例子Note

from lxml import etree
sample1 = """<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my <a href="#" src="x">page</a></h2>
<p>This is the first paragraph.</p>
<!-- this is the end -->
</body>
</html>
"""
def getxpath(html):
return etree.HTML(html)
s1 = getxpath(sample1)

//绝对路径 text() 获取内容中的文字信息

s1.xpath('//title/text()') ➜ ['My page']

/ 相对路径

s1.xpath('/html/head/title/text()') ➜ ['My page']

获取属性src的值

s1.xpath('//h2/a/@src') ➜ ['x']

获取所有属性href的值

s1.xpath('//@href') ➜ ['#']

获取网页中的所有文本

s1.xpath('//text()')

['\n ',
'\n ',
'My page',
'\n ',
'\n ',
'\n ',
'Welcome to my ',
'page',
'\n ',
'This is the first paragraph.',
'\n ',
'\n ',
'\n']

获取网页中的所有注释

s1.xpath('//comment()') ➜ [<!-- this is the end -->]


sample2 = """
<html>
<body>
<ul>
<li>Quote 1</li>
<li>Quote 2 with <a href="...">link</a></li>
<li>Quote 3 with <a href="...">another link</a></li>
<li><h2>Quote 4 title</h2>Something here.</li>
</ul>
</body>
</html>
"""
s2 = getxpath(sample2)

获取所有li中的文本

s2.xpath('//li/text()') ➜ ['Quote 1', 'Quote 2 with ', 'Quote 3 with ', 'Something here.']

获取第一个 第二个li中的文本,两种写法均可

s2.xpath('//li[position() = 1]/text()') ➜ ['Quote 1']


s2.xpath('//li[1]/text()') ➜ ['Quote 1']
s2.xpath('//li[position() = 2]/text()') ➜ ['Quote 2 with ']
s2.xpath('//li[2]/text()') ➜ ['Quote 2 with ']

奇数 偶数 最后一个

s2.xpath('//li[position() mod2 = 1]/text()') ➜ ['Quote 1', 'Quote 3 with ']


s2.xpath('//li[position() mod2 = 0]/text()') ➜ ['Quote 2 with ', 'Something here.']
s2.xpath('//li[last()]/text()') ➜ ['Something here.']

li下面a中的文本

s2.xpath('//li[a]/text()') ➜ ['Quote 2 with ', 'Quote 3 with ']

lia或者h2的文本

s2.xpath('//li[a or h2]/text()') ➜ ['Quote 2 with ', 'Quote 3 with ', 'Something here.']

使用 | 同时获取 a 和 h2 中的内容

s2.xpath('//a/text()|//h2/text()') ➜ ['link', 'another link', 'Quote 4 title']


sample3 = """<html>
<body>
<ul>
<li id="begin"><a href="https://scrapy.org">Scrapy</a>begin</li>
<li><a href="https://scrapinghub.com">Scrapinghub</a></li>
<li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
<li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a>end</li>
<li data-xxxx="end" abc="abc"><a href="http://quotes.toscrape.com">Quotes To Scrape</a>end</li>
</ul>
</body>
</html>
"""
s3 = getxpath(sample3)

获取 a 标签下 href 以https开始的

s3.xpath('//a[starts-with(@href, "https")]/text()') ➜ ['Scrapy', 'Scrapinghub', 'Scrapinghub Blog']

获取 href=https://scrapy.org

s3.xpath('//li/a[@href="https://scrapy.org"]/text()') ➜ ['Scrapy']

获取 id = begin

s3.xpath('//li[@id="begin"]/text()') ➜ ['begin']

获取text = Scrapinghub

s3.xpath('//li/a[text()="Scrapinghub"]/text()') ➜ ['Scrapinghub']

获取某个标签下 某个参数 = xx

s3.xpath('//li[@data-xxxx="end"]/text()') ➜ ['end']


s3.xpath('//li[@abc="abc"]/text()') ➜ ['end']
sample4 = u"""
<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my <a href="#" src="x">page</a></h2>
<p>This is the first paragraph.</p>
<p class="test">
编程语言<a href="#">python</a>
<img src="#" alt="test"/>javascript
<a href="#"><strong>C#</strong>JAVA</a>
</p>
<p class="content-a">a</p>
<p class="content-b">b</p>
<p class="content-c">c</p>
<p class="content-d">d</p>
<p class="econtent-e">e</p>
<!-- this is the end -->
</body>
</html>
"""
s4 = etree.HTML(sample4)

获取 class = test 标签中的所有文字

s4.xpath('//p[@class="test"]/text()')
➜ ['\n 编程语言', '\n ', 'javascript\n ', '\n ']

使用String来获得文字段; strip() 移除字符串收尾字符,默认为空格

print (s4.xpath('string(//p[@class="test"])').strip())

编程语言python
javascript
C#JAVA

获取所有class属性中以content开始

s4.xpath('//p[starts-with(@class,"content")]/text()') ➜ ['a', 'b', 'c', 'd']

获取所有class属性中包含content的

s4.xpath(('//*[contains(@class,"content")]/text()')) ➜ ['a', 'b', 'c', 'd', 'e']

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK