【Python3 爬蟲】Beautiful Soup庫的使用

阿新 • • 發佈：2018-03-28

attrs mouse 爬蟲 image 結構定義正則表達式 ttr document

之前學習了正則表達式，但是發現如果用正則表達式寫網絡爬蟲，那是相當的復雜啊！於是就有了Beautiful Soup

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據，因為簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。

安裝Beautiful Soup

使用命令安裝

pip install beautifulsoup4

技術分享圖片

出現上述截圖表示已經成功安裝

Beautiful Soup的使用

1.首先必須先導入BS4庫

from bs4 import BeautifulSoup

2.定義html內容（為後邊的例子演示做準備）

下面的一段HTML代碼將作為例子被多次用到.這是 愛麗絲夢遊仙境的 的一段內容(以後內容中簡稱為 愛麗絲 的文檔):

html = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class 
="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3 
">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

3.創建beautifulsoup 對象

#創建BeautifulSoup對象
soup = BeautifulSoup(html)
"""
若html內容存在文件a.html中，那麽可以這麽創建BeautifulSoup對象
soup = BeautifulSoup(open(a.html))
"""

4.格式化輸出

#格式化輸出
print(soup.prettify())

輸出結果：

技術分享圖片

5.Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構

每個節點都是Python對象,所有對象可以歸納為4種:

Tag
NavigableString
BeautifulSoup
Comment

（1）Tags

Tags是 HTML 中的一個個標簽，例如:

<a></a>

<p></p>

…

等都是標簽

下面感受一下怎樣用 Beautiful Soup 來方便地獲取 Tags

#獲取tags
print(soup.title)
#運行結果：<title>The Dormouse‘s story</title>
print(soup.head)
#運行結果：<head><title>The Dormouse‘s story</title></head>
print(soup.a)
#運行結果：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.p)
#運行結果：<p class="title"><b>The Dormouse‘s story</b></p>

不過有一點是，它查找的是在所有內容中的第一個符合要求的標簽，看<a>標簽的輸出結果就可以明白了！

我們可以使用type來驗證以下這些標簽的類型

#看獲取Tags的數據類型
print(type(soup.title))
#運行結果：<class ‘bs4.element.Tag‘>

對於Tags，還有2個屬性，name跟attrs

#查看Tags的兩個屬性name、attrs
print(soup.a.name)
#運行結果：a
print(soup.a.attrs)
#運行結果：{‘href‘: ‘http://example.com/elsie‘, ‘class‘: [‘sister‘], ‘id‘: ‘link1‘}

從上面的輸出結果我們可以看到標簽<a>的attrs屬性輸出結果是一個字典，我們要想獲取字典中的具體的值可以這樣

p = soup.a.attrs
print(p[‘class‘])
#print(p.get(‘class‘)) 與上述方法等價
#運行結果：[‘sister‘]

（2）NavigableString

我們已經獲取了Tags了，那麽我們如何來獲取Tags中的內容呢？

#獲取標簽內部的文字(NavigableString)
print(soup.a.string)
#運行結果：Elsie

同樣的，我們也可以通過type來查看他的類型

print(type(soup.a.string))
#運行結果：<class ‘bs4.element.NavigableString‘>

（3）BeautifulSoup

soup本身也是有這兩個屬性的，只是比較特殊而已

#查看BeautifulSoup的屬性
print(soup.name)
#運行結果：[document]
print(soup.attrs)
#運行結果：{}

（4）Comment

我們把上述html中的這一段修改為下面這個樣子（把<a></a>標簽中的內容修改為註釋內容）

<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>

我們可以使用Comment同樣提取被註釋的內容

#獲取標簽內部的文字
print(soup.a.string)
#運行結果：Elsie

查看其類型

print(type(soup.a.string))
#運行結果：<class ‘bs4.element.Comment‘>

【Python3 爬蟲】Beautiful Soup庫的使用

attrs mouse 爬蟲 image 結構定義正則表達式 ttr document 之前學習了正則表達式，但是發現如果用正則表達式寫網絡爬蟲，那是相當的復雜啊！於是就有了Beautiful Soup簡單來說，Beautiful Soup是python的一個庫，最主要

【Python3 爬蟲】Beautiful Soup庫的使用

安裝Beautiful Soup

Beautiful Soup的使用

【Python3 爬蟲】Beautiful Soup庫的使用

【Python3 爬蟲】04_urllib.request.urlretrieve

【Python3 爬蟲】06_robots.txt查看網站爬取限制情況

【Python3 爬蟲】爬取博客園首頁所有文章

【Python3 爬蟲】14_爬取淘寶上的手機圖片

【Python3爬蟲】有道翻譯

【Python3爬蟲】網易雲音樂歌單下載

【Python3爬蟲】Scrapy+MongoDB+MySQL

【Python3爬蟲】12306爬蟲

【Python3爬蟲】Scrapy使用IP代理池和隨機User-Agent

【Python3爬蟲】拉勾網爬蟲

【python3爬蟲】beautifulsoup4 安裝

【Python3爬蟲】微博使用者爬蟲

【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

【Python3爬蟲】使用Fidder實現APP爬取

【Python3爬蟲】百度貼吧爬蟲

【Python3爬蟲】下載酷狗音樂上的VIP付費歌曲

【Python3爬蟲】使用雲打碼識別驗證碼

【Python3爬蟲】用Python實現發送天氣預報郵件

【Python3爬蟲】用Python實現傳送天氣預報郵件

【Python3 爬蟲】Beautiful Soup庫的使用

安裝Beautiful Soup

Beautiful Soup的使用

相關推薦