python使用xpath（超詳細）

阿新 • • 發佈：2020-10-07

>使用時先安裝 lxml 包 ## 開始使用和beautifulsoup類似，首先我們需要得到一個文件樹 - 把文字轉換成一個文件樹物件 ``` from lxml import etree if __name__ == '__main__': doc='''

first item
second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.HTML(doc) result = etree.tostring(html) print(str(result,'utf-8')) ``` - 把檔案轉換成一個文件樹物件 ``` from lxml import etree # 讀取外部檔案 index.html html = etree.parse('./index.html') result = etree.tostring(html, pretty_print=True) #pretty_print=True 會格式化輸出 print(result) ``` >

均會打印出文件內容 ## 節點、元素、屬性、內容 **xpath 的思想是通過路徑表達去尋找節點。節點包括`元素`，`屬性`，和`內容`** - 元素舉例 ``` html ---> ... div ---> ... a ---> ... ``` 這裡我們可以看到，這裡的`元素`和html中的`標籤`一個意思。單獨的元素是無法表達一個路徑的，所以單獨的元素不能獨立使用 ## 路徑表示式 ```xpath / 根節點，節點分隔符， // 任意位置 . 當前節點 .. 父級節點 @ 屬性 ``` ## 萬用字元 ```xpath * 任意元素 @* 任意屬性 node() 任意子節點（元素，屬性，內容) ``` ## 謂語 **使用中括號來限定元素，稱為謂語** ```xpath //a[n] n為大於零的整數，代表子元素排在第n個位置的元素 //a[last()] last() 代表子元素排在最後個位置的元素 //a[last()-] 和上面同理，代表倒數第二個 //a[position()<3] 位置序號小於3，也就是前兩個，這裡我們可以看出xpath中的序列是從1開始 //a[@href] 擁有href的元素 //a[@href='www.baidu.com'] href屬性值為'www.baidu.com'的元素 //book[@price>

2] price值大於2的元素 ``` ## 多個路徑用`|` 連線兩個表示式，可以進行 `或`匹配 ``` //book/title | //book/price ``` ## 函式 **xpath內建很多函式。更多函式檢視[https://www.w3school.com.cn/xpath/xpath_functions.asp](https://www.w3school.com.cn/xpath/xpath_functions.asp)** - contains(string1,string2) - starts-with(string1,string2) - ends-with(string1,string2) #不支援 - upper-case(string) #不支援 - text() - last() - position() - node() **可以看到last()也是個函式，在前面我們在謂語中已經提到過了** ## 案例 ### 定位元素 **匹配多個元素，返回列表** ```python from lxml import etree if __name__ == '__main__': doc='''

first item
second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.HTML(doc) print(html.xpath("//li")) print(html.xpath("//p")) ``` **【結果為】** ``` [, , , , ] [] #沒找到p元素 ``` ``` html = etree.HTML(doc) print(etree.tostring(html.xpath("//li[@class='item-inactive']")[0])) print(html.xpath("//li[@class='item-inactive']")[0].text) print(html.xpath("//li[@class='item-inactive']/a")[0].text) print(html.xpath("//li[@class='item-inactive']/a/text()")) print(html.xpath("//li[@class='item-inactive']/..")) print(html.xpath("//li[@class='item-inactive']/../li[@class='item-0']")) ``` **【結果為】** ``` b'

third item

\n ' None #因為第三個li下面沒有直接text，None third item # ['third item'] [] [, ] ``` ### 使用函式 #### contains 有的時候，class作為選擇條件的時候不合適`@class='....'` 這個是完全匹配，當王爺樣式發生變化時，class或許會增加或減少像`active`的`class`。用contains就能很方便 ``` from lxml import etree if __name__ == '__main__': doc='''

first item

second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.HTML(doc) print(html.xpath("//*[contains(@class,'item')]")) ``` **【結果為】** ``` [, , , , ] ``` #### starts-with ``` from lxml import etree if __name__ == '__main__': doc='''

first item

second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.HTML(doc) print(html.xpath("//*[contains(@class,'item')]")) print(html.xpath("//*[starts-with(@class,'ul')]")) ``` **【結果為】** ``` [, , , , , ] [] ``` #### ends-with ``` print(html.xpath("//*[ends-with(@class,'ul')]")) ``` **【結果為】** ``` Traceback (most recent call last): File "F:/OneDrive/pprojects/shoes-show-spider/test/xp5_test.py", line 18, in print(html.xpath("//*[ends-with(@class,'ul')]")) File "src\lxml\etree.pyx", line 1582, in lxml.etree._Element.xpath File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__ File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result lxml.etree.XPathEvalError: Unregistered function ``` **看來python的lxml並不支援有的xpath函式列表** #### upper-case **和ends-with函式一樣，也不支援。同樣報錯`lxml.etree.XPathEvalError: Unregistered function`** ``` print(html.xpath("//a[contains(upper-case(@class),'ITEM-INACTIVE')]")) ``` #### text、last ``` #最後一個li被限定了 print(html.xpath("//li[last()]/a/text()")) #會得到所有的``元素的內容，因為每個標籤都是各自父元素的最後一個元素。 #本來每個li就只有一個子元素，所以都是最後一個 print(html.xpath("//li/a[last()]/text()")) print(html.xpath("//li/a[contains(text(),'third')]")) ``` **【結果為】** ``` ['fifth item'] ['second item', 'third item', 'fourth item', 'fifth item'] [] ``` #### position ``` print(html.xpath("//li[position()=2]/a/text()")) #結果為['third item'] ``` **上面這個例子我們之前以及講解過了** **這裡有個疑問，就是`position()`函式能不能像`text()`那樣用呢* ``` print(html.xpath("//li[last()]/a/position()")) #結果 lxml.etree.XPathEvalError: Unregistered function ``` *這裡我們得到一個結論，函式不是隨意放在哪裡都能得到自己想要的結果* #### node **返回所有子節點，不管這個子節點是什麼型別（熟悉，元素，內容）** ``` print(html.xpath("//ul/li[@class='item-inactive']/node()")) print(html.xpath("//ul/node()")) ``` **【結果為】** ``` [] ['\n ', , '\n ', , '\n ', , '\n ', , '\n ', , ' 閉合標籤\n '] ``` ### 獲取內容 **剛剛已經提到過，可以使用`.text`和`text()`的方式來獲取元素的內容 ``` from lxml import etree if __name__ == '__main__': doc='''

first item
second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.XML(doc) print(html.xpath("//a/text()")) print(html.xpath("//a")[0].text) print(html.xpath("//ul")[0].text) print(len(html.xpath("//ul")[0].text)) print(html.xpath("//ul/text()")) ``` **【結果為】** ``` ['first item', 'second item', 'third item', 'fourth item', 'fifth item'] first item 18 ['\n ', '\n ', '\n ', '\n ', '\n ', ' 閉合標籤\n '] ``` **看到這裡，我們觀察到`text()`和`.text`的區別。自己總結吧。不太好表達，就不表達了** ### 獲取屬性 ``` print(html.xpath("//a/@href")) print(html.xpath("//li/@class")) ``` **【結果為】** ``` ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html'] ['item-0 active', 'item-1', 'item-inactive', 'item-1', 'item-0'] ``` ## 自定義函式我們從使用函式的過程中得到結論，就是有的函式不支援，有的支援，那問題來了，到底那些方法支援呢。我們在lxml官網找到了答案。[https://lxml.de/xpathxslt.html](https://lxml.de/xpathxslt.html)。lxml 支援XPath 1.0 ，想使用其他擴充套件，使用libxml2，和libxslt的標準相容的方式。[XPath 1.0官方文件](https://www.w3.org/TR/1999/REC-xpath-19991116/) 以及其他版本的XPath文件 [https://www.w3.org/TR/xpath/](https://www.w3.org/TR/xpath/) ``` lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way. ``` 除此之外，lxml還提供了自定義函式的方式來擴充套件xpath的支援度 [https://lxml.de/extensions.html](https://lxml.de/extensions.html) ``` from lxml import etree #定義函式 def ends_with(context,s1,s2): return s1[0].endswith(s2) if __name__ == '__main__': doc='''

first item
second item
third item
fourth item
fifth item # 注意，此處缺少一個

''' html = etree.XML(doc) ns = etree.FunctionNamespace(None) ns['ends-with'] = ends_with #將ends_with方法註冊到方法名稱空間中 print(html.xpath("//li[ends-with(@class,'active')]")) print(html.xpath("//li[ends-with(@class,'active')]/a/text()")) ``` **【結果為】** ``` [, ] ['first item', 'third item'] ``` - **形參`s1`會傳入xpath中的第一個引數`@class`，但這裡注意@class是個列表** - **形參`s2`會傳入xpath中的第二個引數`'active'`，`'active'`是個字串** 官網例子[https://lxml.de/extensions.html](https://lxml.de/extensions.html) ``` def hello(context, a): return "Hello %s" % a from lxml import etree ns = etree.FunctionNamespace(None) ns['hello'] = hello root = etree.XML('Haegar') print(root.xpath("hello('Dr. Falken')")) # 結果為 Hello Dr. Fa

python使用xpath（超詳細）

三分鐘學會用SpringMVC搭建最小系統（超詳細）

ASP.NET頁面之間傳值的方式之QueryString（超詳細）

MySql 5.6.21安裝及配置（超詳細）

自動化運維之Ansible應用基礎模塊（超詳細）

SqlServer數據庫SQL語句（超詳細）

Ubuntu18.04下基於 Anaconda3 安裝編譯 Caffe-GPU（超詳細）

自動化運維之Ansible應用基礎模組（超詳細）

框架整合——SpringMVC與MyBatis整合（超詳細）

okhttp原始碼分析（一）——基本流程（超詳細）

solr簡介、學習詳細過程！（超詳細~）

（超詳細）POJ3104 Drying(讓我痛不欲生的二分）

（超詳細）在使用mybatis時遇到查詢結果返回為空（NULL）的情況，但是查資料庫能查到

RAP2環境搭建整理（超詳細）

Python機器學習實戰專案--預測紅酒質量（超詳細）

AndroidStudio中匯入module（超詳細）

Django例項 —— 搭建一個部落格（超詳細）

Java反射機制（超詳細）

hadoop2.7第一個python例項（超詳細）

MFC視窗風格 WS_style/WS_EX_style（超詳細）

GitHub如何刪除專案庫Repositories（超詳細）

python使用xpath（超詳細）

相關推薦