1. 程式人生 > >Xpath語法詳解

Xpath語法詳解

本次示例使用python的lxml 對xpath進行演示

安裝lxml

pip install lxml

xpath常規用法

示例html

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""

查詢xxx下的所有xx元素

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath('//div/ul/li')  # //代表從節結點開始查詢,這裡查詢ul下為li的所有元素
for i in all_li:
    print(i)

執行結果:
	<Element li at 0x1a7955a2808>  # 0x1a7955a2808是記憶體地址,這是一組元素,如要顯示具體可以這樣(如:/a/text() # 檢視a標籤的文字(往下看也有演示))
	<Element li at 0x1a7955a27c8>
	<Element li at 0x1a7955a28c8>
	<Element li at 0x1a7955a2908>
	<Element li at 0x1a7955a2948>
	<Element li at 0x1a7955a29c8>

查詢xxx下的第一個xx元素

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath('//div/ul/li[1]')  # 查詢第一個li,注意在xpath中第一個下標不是0,而是1
print(all_li)

執行結果:
	[<Element li at 0x1d0e2612608>]

注意:如果網頁中存在多個相同元素,不使用下標進行查詢,系統只會預設查詢第一個,若第一個元素不符會直接丟擲異常。

查詢xx元素對應的文字資訊

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
# all_li = selector.xpath('//div/ul/li[1]/a/text()')[0]  # 這樣寫直接輸出a下面的第一個文字
all_li = selector.xpath('//div/ul/li[1]/a/text()')  # 使用text()提取a標籤下的文字資訊
print(all_li)  # 也可以使用下標直接取出結果如:all_li[0]輸出結果 first item

執行結果:
	['first item']

小知識

如果在使用的html頁面中只要元素是唯一的,也可以不從根目錄開始查詢,簡單示例幾種:

all_li = selector.xpath('//ul/li[1]/a/text()')[0]  #省去div一樣可以
all_li = selector.xpath('//*[@class="item-inactive"]/a/text()') [0]  # 直接使用class查詢第三個li的文字
all_li = selector.xpath('//a[@href="link2.html"]/text()')[0]  # 直接使用href查詢第二個li的文字

獲取xxx下元素的屬性

獲取單個屬性

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath('//li[3]/a/@href')[0]  # 獲取href的屬性
print(all_li)  

執行結果:
	link3.html

獲取class的全部屬性

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath('//li/@class')  # 獲取href的屬性
print(all_li)

執行結果:
	['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0', 'else-1']

xpath高階用法

查找出xxx元素以xx開頭的屬性

還是這段html來做演示:

<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""

使用starts-with()

示例程式碼:

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]")  # 獲取href的屬性
all_a = []
for i in all_li:
    all_a.append(i.xpath('a/text()')[0])  # 繼續對找到的li元素使用xpath查詢其裡面的內容

print(all_a)

執行結果:
	['first item', 'second item', 'third item', 'fourth item', 'fifth item']

也可以這樣寫:

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]/a/text()")  # 獲取href的屬性
print(all_li)

執行結果:
	['first item', 'second item', 'third item', 'fourth item', 'fifth item']

查詢所有文字

使用string()

示例程式碼:

from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

htm = """
<html>
	<div>
		<ul>
			<li class="item-0"><a href="link1.html">first item</a></li>
			<li class="item-1"><a href="link2.html">second item</a></li>
			<li class="item-inactive"><a href="link3.html">third item</a></li>
			<li class="item-1"><a href="link4.html">fourth item</a></li>
			<li class="item-0"><a href="link5.html">fifth item</a></li>
			<li class="else-1">something else</li>
			this is ul item
		</ul>
	</div>
</html>		
"""


selector = etree.HTML(htm)  # 初始化etree
all_li = selector.xpath("string(//ul)")  # 獲取ul下的所有文字
print(all_li)

執行結果:
	first item
	second item
	third item
	fourth item
	fifth item
	something else
	this is ul item

小小例項

獲取豆瓣首頁的豆瓣讀書文字及連結,在首頁取出一張圖片存入本地。

import requests
from lxml import etree  # 紅線提示找不到etree的初始化方法,沒關係不影響

r = requests.get('https://www.douban.com/')
r.encoding = 'utf-8'
html = etree.HTML(r.text)
text = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/@href')[0]
h1 = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/text()')[0]
logs = html.xpath('//*[@id="anony-sns"]/div/div[3]/div/div[1]/ul/li[3]/div/a/img/@src')[0]
print(text)
print(h1)
print(logs)
log = requests.get(logs)
with open('d:/a.gif', 'wb') as file:  # wb 二進位制形式寫入
    file.write(log.content)  # 儲存圖片

執行結果:
	https://book.douban.com
	豆瓣讀書
	https://img3.doubanio.com/f/shire/a1fdee122b95748d81cee426d717c05b5174fe96/pics/blank.gif