小白學 Python 爬蟲（23）：解析庫 pyquery 入門

阿新 • • 發佈：2019-12-20

人生苦短，我用 Python

前文傳送門：

小白學 Python 爬蟲（1）：開篇

小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門

小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門

小白學 Python 爬蟲（5）：前置準備（四）資料庫基礎

小白學 Python 爬蟲（6）：前置準備（五）爬蟲框架的安裝

小白學 Python 爬蟲（7）：HTTP 基礎

小白學 Python 爬蟲（8）：網頁基礎

小白學 Python 爬蟲（9）：爬蟲基礎

小白學 Python 爬蟲（10）：Session 和 Cookies

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

小白學 Python 爬蟲（14）：urllib 基礎使用（四）

小白學 Python 爬蟲（15）：urllib 基礎使用（五）

小白學 Python 爬蟲（16）：urllib 實戰之爬取妹子圖

小白學 Python 爬蟲（17）：Requests 基礎使用

小白學 Python 爬蟲（18）：Requests 進階操作

小白學 Python 爬蟲（19）：Xpath 基操

小白學 Python 爬蟲（20）：Xpath 進階

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

引言

前面一篇我們介紹了 Beautiful Soup 中可以使用 CSS 選擇器，但是好像他的 CSS 選擇器並沒有想像中的強大。

本篇就介紹一個對 CSS 選擇器更加友好的類庫 —— pyquery 。它在語法上更加貼和 JQuery ，估計會成為各位後端開發人員的福音。

首先，還是先敬上各種官方地址：

官方文件：https://pyquery.readthedocs.io/en/latest/

PyPI：https://pypi.org/project/pyquery/

Github：https://github.com/gawel/pyquery

有問題，找官方，這句話是肯定不會錯滴~~

初始化

首先，各位同學需要確保已經安裝過 pyquery ，沒有安裝過的朋友可以翻一翻前面的前置準備，小編已經介紹過安裝方式。

先來看一個簡單的初始化的示例（還是使用上一篇的 HTML ，懶人真的沒救了）：

from pyquery import PyQuery

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

d = PyQuery(html)
print(d('p'))

結果如下：

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

以上是直接使用字串進行的初始化，同時它還支援直接傳入 URL 地址進行初始化：

d_url = PyQuery(url='https://www.geekdigging.com/', encoding='UTF-8')
print(d_url('title'))

結果如下：

<title>極客挖掘機</title>

這樣寫的話，其實 PyQuery 會先請求這個 URL ，然後用響應得到的 HTML 內容完成初始化，與下面這樣寫其實也是一樣的：

r = requests.get('https://www.geekdigging.com/')
r.encoding = 'UTF-8'
d_requests = PyQuery(r.text)
print(d_requests('title'))

CSS 選擇器

我們先來簡單感受下 CSS 選擇器的用法，真的是非常的簡單方便：

d_css = PyQuery(html)
print(d_css('.story .sister'))
print(type(d_css('.story .sister')))

結果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

這裡的寫法含義是我們先尋找 class 為 story 的節點，尋找到以後接著在它的子節點中繼續尋找 class 為 sister 的節點。

最後的列印結果中可以看到，它的型別依然為 pyquery.pyquery.PyQuery ，說明我們可以繼續使用這個結果解析。

查詢節點

我們接著介紹一下常用的查詢函式，這些查詢函式最讚的地方就是它們和 JQuery 的用法完全一致。

find() ：查詢節點的所有子孫節點。
children() ：只查詢子節點。
parent() ：查詢父節點。
parents() ：查詢祖先節點。
siblings() ：查詢兄弟節點。

下面來一些簡單的示例：

# 查詢子節點
items = d('body')
print('子節點：', items.find('p'))
print(type(items.find('p')))

# 查詢父節點
items = d('#link1')
print('父節點：', items.parent())
print(type(items.parent()))

# 查詢兄弟節點
items = d('#link1')
print('兄弟節點：', items.siblings())
print(type(items.siblings()))

結果如下：

子節點： <p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<class 'pyquery.pyquery.PyQuery'>
父節點： <p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<class 'pyquery.pyquery.PyQuery'>
兄弟節點： <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

遍歷

通過上面的示例，可以看到，如果 pyquery 取出來的有多個節點，雖然型別也是 PyQuery ，但是和 Beautiful Soup 不一樣的是返回的並不是列表，如果我們需要繼續獲取其中的節點，就需要遍歷這個結果，可以使用 items() 這個獲取結果進行遍歷：

a = d('a')
for item in a.items():
    print(item)

結果如下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

這裡我們呼叫 items() 後，會返回一個生成器，遍歷一下，就可以逐個得到 a 節點物件了，它的型別也是 PyQuery 型別。每個 a 節點還可以呼叫前面所說的方法進行選擇，比如繼續查詢子節點，尋找某個祖先節點等，非常靈活。

提取資訊

前面我們獲取到節點以後，接著就是要獲取我們所需要的資訊了。

獲取資訊主要分為兩個部分，一個是獲取節點的文字資訊，一個獲取節點的屬性資訊。

獲取文字資訊

a_1 = d('#link1')
print(a_1.text())

結果如下：

Elsie

如果想獲取這個節點內的 HTML 資訊，可以使用 html() 方法：

a_2 = d('.story')
print(a_2.html())

結果如下：

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

獲取屬性資訊

當我們獲取到節點以後，可以使用 attr() 來獲取相關的屬性資訊：

attr_1 = d('#link1')
print(attr_1.attr('href'))

結果如下：

http://example.com/elsie

除了我們可以使用 attr() 這個方法以外， pyquery 還為我們提供了 attr 屬性，比如上面的示例還可以寫成這樣：

print(attr_1.attr.href)

結果和上面的示例是一樣的。

小結

我們在前置準備中安裝的幾種解析器到此就介紹完了，綜合比較一下，Beautiful Soup 對新手比較友好，無需瞭解更多的其他知識就可以上手使用，但是對於複雜 DOM 的解析，依然需要一定的 CSS 選擇器的基礎，如果對 Xpath 比較熟練的話直接使用 lxml 倒是最為方便的，如果和小編一樣，對 JQuery 和 CSS 選擇器都比較熟悉，那麼 pyquery 倒是一個很不錯的選擇。

接下來小編計劃做幾個簡單的實戰分享，敬請期待哦~~~

示例程式碼

本系列的所有程式碼小編都會放在程式碼管理倉庫 Github 和 Gitee 上，方便大家取用。

示例程式碼-Github

示例程式碼-Gi

小白學 Python 爬蟲（23）：解析庫 pyquery 入門

引言

初始化

CSS 選擇器

查詢節點

遍歷

提取資訊

獲取文字資訊

獲取屬性資訊

小結

示例程式碼

相關推薦