小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

阿新 • • 發佈：2019-12-19

人生苦短，我用 Python

前文傳送門：

小白學 Python 爬蟲（1）：開篇

小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門

小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門

小白學 Python 爬蟲（5）：前置準備（四）資料庫基礎

小白學 Python 爬蟲（6）：前置準備（五）爬蟲框架的安裝

小白學 Python 爬蟲（7）：HTTP 基礎

小白學 Python 爬蟲（8）：網頁基礎

小白學 Python 爬蟲（9）：爬蟲基礎

小白學 Python 爬蟲（10）：Session 和 Cookies

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

小白學 Python 爬蟲（14）：urllib 基礎使用（四）

小白學 Python 爬蟲（15）：urllib 基礎使用（五）

小白學 Python 爬蟲（16）：urllib 實戰之爬取妹子圖

小白學 Python 爬蟲（17）：Requests 基礎使用

小白學 Python 爬蟲（18）：Requests 進階操作

小白學 Python 爬蟲（19）：Xpath 基操

小白學 Python 爬蟲（20）：Xpath 進階

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

引言

前面一篇我們介紹的選擇方法都是通過屬性來進行選擇的，這種方法使用起來非常簡單，但是，如果 DOM 結構比較複雜的話，這種方法就不是那麼友好了。

所以 Beautiful Soup 還為我們提供了一些搜尋方法，如 find_all() 和 find() ， DOM 節點不好直接用屬性方法來表示，我們可以直接搜尋嘛~~~

find_all()

先看下語法結構：

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜尋當前 tag 的所有 tag 子節點，並判斷是否符合過濾器的條件。

name

name 引數可以查詢所有名字為 name 的 tag ，字串物件會被自動忽略掉。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(name = "a"))
print(type(soup.find_all(name = "a")[0]))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<class 'bs4.element.Tag'>

這次的示例換成了字串，主要是為了各位同學看起來方便，再也不用去對照著圖片看了。

這個示例我們使用了 find_all() 方法，並且傳入了 name 引數，值為 a ，含義是我們要查詢所有的 <a> 節點，可以看到，返回的結果資料型別是列表，長度為 3 ，並且元素型別為 bs4.element.Tag 。

因為元素型別為 bs4.element.Tag ，我們可以通過前一篇文章介紹的屬性直接獲取其中的內容：

for a in soup.find_all(name = "a"):
    print(a.string)

結果如下：

Elsie
Lacie
Tillie

attrs

除了可以通過 name 進行搜尋，我們還可以通過屬性進行查詢：

print(soup.find_all(attrs={'id': 'link1'}))
print(soup.find_all(attrs={'id': 'link2'}))
print(type(soup.find_all(attrs={'id': 'link1'})))
print(type(soup.find_all(attrs={'id': 'link2'})))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>

這個示例我們傳入的是 attrs 引數，引數的資料型別是字典。

string

這個引數可用來匹配節點的文字，傳入的形式可以是字串，可以是正則表示式物件：

import re

print(soup.find_all(text=re.compile('sisters')))

結果如下：

['Once upon a time there were three little sisters; and their names were\n']

keyword

如果一個指定名字的引數不是搜尋內建的引數名，搜尋時會把該引數當作指定名字 tag 的屬性來搜尋，比如下面的示例我們直接搜尋 id 為 link 的節點和 class 為 title 的節點：

print(soup.find_all(id='link1'))
print(soup.find_all(class_='title'))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<p class="title"><b>The Dormouse's story</b></p>]

當然，我們也可以使用多個指定名字的引數同時過濾 tag 的多個屬性：

print(soup.find_all(href=re.compile("elsie"), id='link1'))

結果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

有些 tag 屬性在搜尋不能使用，比如 HTML5 中的 data-* 屬性，這時就需要用到上面介紹過的 attrs 引數了。

find()

find() 和 find_all() 非常的像，只不過 find() 不再像 find_all() 一樣直接返回所有的匹配節點，而是隻返回第一個匹配的元素。舉幾個簡單的栗子：

print(soup.find(name = "a"))
print(type(soup.find(name = "a")))

結果如下：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>

其餘的查詢方法各位同學可以參考官方文件，小編這裡簡單列舉一下：

find_parents() 和 find_parent() ：用來搜尋當前節點的父輩節點。
find_next_siblings() 和 find_next_sibling() ：前者返回後面所有的兄弟節點，後者返回後面第一個兄弟節點。
find_previous_siblings() 和 find_previous_sibling() ：前者返回前面所有的兄弟節點，後者返回前面第一個兄弟節點。
find_all_next() 和 find_next() ：前者返回節點後所有符合條件的節點，後者返回第一個符合條件的節點。
find_all_previous() 和 find_previous() ：前者返回節點後所有符合條件的節點，後者返回第一個符合條件的節點。

CSS

Beautiful Soup 除了提供前面這些屬性選擇、搜尋方法等方式來獲取節點，還提供了另外一種選擇器 —— CSS 選擇器。

如果對 CSS 選擇器不熟的話，可以參考：https://www.w3school.com.cn/css/index.asp 。

使用 CSS 選擇器方法非常簡單，只需要呼叫 select() 方法，傳入相應的 CSS 選擇器即可，還是寫幾個簡單的示例：

print(soup.select('#link1'))
print(type(soup.select('#link1')[0]))
print(soup.select('.story .sister'))

結果如下：

<class 'bs4.element.Tag'>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到，我們使用 CSS 選擇器獲得的結果同樣會是一個列表，並且裡面的元素同樣是 bs4.element.Tag ，這就意味著我們可以使用它的屬性來獲取對應的資訊。

小結

Beautiful Soup 就這麼簡單的介紹完了，稍微做點小總結：

在選擇解析器的時候儘量選擇 lxml ，官方推薦，據說是快。
節點屬性篩選雖然簡單但是功能有點弱雞。
find_all() 和 find() 其實可以很方便的幫助我們完成絕大多數的工作。
CSS 選擇器推薦有經驗的同學使用，畢竟嘛，選擇 DOM 節點，還是 CSS 選擇器來的方便好使不是麼？

示例程式碼

本系列的所有程式碼小編都會放在程式碼管理倉庫 Github 和 Gitee 上，方便大家取用。

示例程式碼-Github

示例程式碼-Gitee

參考

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門小白學 Pyth

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門小白學

小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇本篇內容較長，各位同學可以先收藏後再看~~ 在開始講爬蟲之前，還是先把環境搞搞好，工欲善其事必先利其器嘛~~~ 本篇文章主要介紹 Python 爬蟲所使用到的請求庫和解析庫，請求庫用來請求目標內容，解析庫用來解析請

小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝 Linux 基礎 CentOS 官網： https://www.centos.org/ 。 CentOS 官方下載連結： https://www.cent

小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門 Docker 基礎首先說一件事情，就在本文寫作前一天，Mirantis 這家公司宣佈

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

引言

find_all()

name

attrs

string

keyword

find()

CSS

小結

示例程式碼

參考

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門

小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門

小白學 Python 爬蟲（7）：HTTP 基礎

小白學 Python 爬蟲（8）：網頁基礎

小白學 Python 爬蟲（9）：爬蟲基礎

小白學 Python 爬蟲（10）：Session 和 Cookies

小白學 Python 爬蟲（11）：urllib 基礎使用（一）

小白學 Python 爬蟲（12）：urllib 基礎使用（二）

小白學 Python 爬蟲（13）：urllib 基礎使用（三）

小白學 Python 爬蟲（14）：urllib 基礎使用（四）

小白學 Python 爬蟲（15）：urllib 基礎使用（五）

小白學 Python 爬蟲（17）：Requests 基礎使用

小白學 Python 爬蟲（18）：Requests 進階操作

小白學 Python 爬蟲（19）：Xpath 基操

小白學 Python 爬蟲（20）：Xpath 進階

小白學 Python 爬蟲（23）：解析庫 pyquery 入門

小白學 Python 爬蟲（24）：2019 豆瓣電影排行

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

引言

find_all()

name

attrs

string

keyword

find()

CSS

小結

示例程式碼

參考

相關推薦