python爬蟲之Beautiful Soup基礎知識+例項

阿新 • • 發佈：2020-08-12

#python爬蟲之Beautiful Soup基礎知識 >Beautiful Soup是一個可以從HTML或XML檔案中提取資料的python庫。它能通過你喜歡的轉換器實現慣用的文件導航，查詢，修改文件的方式。需要注意的是，Beautiful Soup已經自動將輸入文件轉換為Unicode編碼，輸出文件轉換為utf-8編碼。因此在使用它的時候不需要考慮編碼方式，僅僅需要說明一下原始編碼方式就可以了。 ## 一、安裝Beautiful Soup庫使用pip命令工具安裝Beautiful Soup4庫 ```python pip install beautifulsoup4 ``` ##二、BeautifulSoup庫的主要解析器 |解析器|使用方法|條件| |---|---|---| |bs4的html解析器|BeautifulSoup(markup, 'html.parser')|安裝bs4庫| |lxml的html解析器|BeautifulSoup(markup, 'lxml')|pip install lxml| |lxml的lxml解析器|BeautifulSoup(markup, 'lxml')|pip install lxml| |html5lib的解析器|BeautifulSoup(markup, 'html5lib')|pip install html5lib| 具體操作： ```python html = 'https://www.baidu.com' bs = BeautifulSoup(html, 'html.parser') ``` ##三、BeautifulSoup的簡單使用提取百度搜索頁面的部分原始碼為例： ```html

百度一下，你就知道新聞 hao123 地圖視訊貼吧更多產品 ``` 綜合requests和使用BeautifulSoup庫的html解析器,對其進行解析如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') print(bs.prettify()) # prettify 方式輸出頁面 ``` 結果如下： ```

百度一下，你就知道

新聞 hao123 地圖視訊貼吧

更多產品

關於百度 About Baidu

``` ##四、BeautifulSoup類的基本元素 BeautifulSoup將複製的HTML文件轉換成一個複雜的樹型結構，每個節點都是python物件，所有物件可以歸納為四種**Tag，NavigableString，Comment，Beautifulsoup**。 |基本元素|說明| |---|---| |Tag|標籤，最基本的資訊組織單元，分別用<>和標明開頭和結尾，格式：bs.a或者bs.p（獲取a標籤中或者p標籤中的內容）。| |Name|標籤的名字，格式為.name.| |Attributes|標籤的屬性，字典形式，格式：.attrs.| |NavigableString|標籤內非屬性字串，<>...中的字串,格式：.string.| |Comment|標籤內的註釋部分，一種特殊的Comment型別。| ### Tag 任何存在於HTML語法中的標籤都可以bs.tag訪問獲得，如果在HTML文件中存在多個相同的tag對應的內容時，bs.tag返回第一個。示例程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') # 獲取第一個a標籤的所有內容 print(bs.a) # 新聞 print(type(bs.a)) # ``` 在Tag標籤中最重要的就是html頁面中的nam和attrs屬性，使用方法如下： ```python print(bs.a.name) # a # 把a標籤的所有屬性列印輸出出來，返回一個字典型別 print(bs.a.attrs) # {'href': 'http://news.baidu.com', 'name': 'tj_trnews', 'class': ['mnav']} # 等價 bs.a.get('class') print(bs.a['class']) # ['mnav'] bs.a['class'] = 'newClass' # 對class屬性的值進行修改 print(bs.a) # 新聞 del bs.a['class'] # 刪除class屬性 print(bs.a) # 新聞 ``` ### NavigableString NavigableString中的string方法用於獲取標籤內部的文字，程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') print(bs.title.string) # 百度一下，你就知道 print(type(bs.title.string)) # ``` ###Comment Comment物件是一個特殊型別的NavigableString物件，其輸出的內容不包括註釋符號，用於輸出註釋的內容。 ```python from bs4 import BeautifulSoup html = """""" bs = BeautifulSoup(html, 'html.parser') print(bs.a.string) # 新聞 print(type(bs.a.string)) # ``` ### BeautifulSoup bs物件表示的是一個文件的全部內容，大部分時候，可以把它當作Tag物件，支援遍歷文件樹和搜尋文件中描述的大部分方法。因為Beautifulsoup物件並不是真正的HTML或者XML的tag，所以它沒有name和attribute屬性。所以BeautifulSoup物件一般包含值為"[document]"的特殊屬性.name ```python print(bs.name) # [document] ``` ## 五、基於bs4庫的HTML內容的遍歷方法在HTML中有如下特定的基本格式，也是構成HTML頁面的基本組成成分。 ![](https://img2020.cnblogs.com/blog/1878490/202008/1878490-20200812100550269-1088624377.png) 而在這種基本的格式下有三種基本的遍歷流程 * 下行遍歷 * 上行遍歷 * 平行遍歷三種遍歷方式分別是從當前節點出發，對之上、之下、平行的格式以及關係進行遍歷。 ### 下行遍歷下行遍歷分別有三種遍歷屬性，如下所示： |屬性|說明| |---|---| |.contents|子節點的列表，將所有兒子節點存入列表。| |.children|子節點的迭代型別，用於迴圈遍歷兒子節點。| |.descendants|子孫節點的迭代型別，包涵所有子孫節點，用於迴圈遍歷。| 程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') # 迴圈遍歷兒子節點 for child in bs.body.children: print(child) # 迴圈遍歷子孫節點 for child in bs.body.descendants: print(child) # 輸出子節點，以列表的形式 print(bs.head.contents) print(bs.head.contents[0]) # 用列表索引來獲取它的某一個元素 ``` ### 上行遍歷上行遍歷有兩種方式，如下所示： |屬性|說明| |---|---| |.parent|節點的父親標籤。| |.parents|節點先輩標籤的迭代型別，用於迴圈遍歷先輩節點，返回一個生成器。| 程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') for parent in bs.a.parents: if parent is not None: print(parent.name) print(bs.a.parent.name) ``` ### 平行遍歷平行遍歷有四種屬性，如下所示： |屬性|說明| |---|---| |.next_sibling|返回按照HTML文字順序的下一個平行節點標籤。| |.previous_sibling|返回按照HTML文字順序的上一個平行節點標籤。| |.next_siblings|迭代型別，返回按照HTML文字順序的所有後續平行節點標籤。| |.previous_siblings|迭代型別，返回按照HTML文字順序的前序所有平行節點標籤。| ![](https://img2020.cnblogs.com/blog/1878490/202008/1878490-20200812100646403-2008381432.png) 程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') for sibling in bs.a.next_siblings: print(sibling) for sibling in bs.a.previous_siblings: print(sibling) ``` ### 其它遍歷 |屬性|說明| |---|---| |.strings|如果Tag包含多個字串，即在子孫節點中有內容，可以用此獲取，然後進行遍歷。| |.stripped_strings|與strings用法一致，可以去除掉那些多餘的空白內容。| |.has_attr|判斷Tag是否包含屬性。| ## 六、檔案樹搜尋使用bs.find_all(name, attires, recursive, string, **kwargs)方法，用於返回一個列表型別，儲存查詢的結果。 |屬性|說明| |---|---| |name|對標籤的名稱的檢索字串。| |attrs|對標籤屬性值的檢索字串，可標註屬性檢索。| |recursive|是否對子孫全部檢索，預設為True。| |string|用與在資訊文字中特定字串的檢索。| ### name引數如果是**指定的字串**：會查詢與字串完全匹配的內容，程式碼如下： ```python a_list = bs.find_all("a") print(a_list) ``` **使用正則表示式**：將會使用BeautifulSoup4中的search()方法來匹配，程式碼如下： ```python import requests from bs4 import BeautifulSoup import re # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') t_list = bs.find_all(re.compile("p")) for item in t_list: print(item) ``` **傳入一個列表**：Beautifulsoup4將會與列表中的任一元素匹配到的節點返回，程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') t_list = bs.find_all(["meta", "link"]) for item in t_list: print(item) ``` 傳入一個函式或方法：將會根據函式或者方法來匹配，程式碼如下： ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') def name_is_exists(tag): return tag.has_attr("name") t_list = bs.find_all(name_is_exists) for item in t_list: print(item) ``` ### attrs引數並不是所有的屬性都可以使用上面這種方法進行搜尋，比如HTML的data屬性，用與指定屬性搜尋。 ```python import requests from bs4 import BeautifulSoup # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') t_list = bs.find_all(attrs={"class": "mnav"}) for item in t_list: print(item) ``` ### string引數通過string引數可以搜尋文件中的字串內容，與name引數的可選值一樣，string引數接受字串，正則表示式，列表。 ```python import requests from bs4 import BeautifulSoup import re # 使用requests庫載入頁面程式碼 r = requests.get('https://www.baidu.com') r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') t_list = bs.find_all(attrs={"class": "mnav"}) for item in t_list: print(item) # text用於搜尋字串 t_list = bs.find_all(text="hao123") for item in t_list: print(item) # text可以通其它引數混合使用用來過濾tag t_list = bs.find_all("a", text=["hao123", "地圖", "貼吧"]) for item in t_list: print(item) t_list = bs.find_all(text=re.compile("\d\d")) for item in t_list: print(item) ``` 使用find_all()方法，常用到的正則表示式形式**import re**程式碼如下： ```python bs.find_all(string = re.compile('python')) # 指定查詢內容 # 或者指定使用正則表示式要搜尋的內容 string = re.compile('python') # 字元為python bs.find_all(string) # 呼叫方法模版 ``` ## 七、常用的find()方法如下 |方法|說明| |---|---| |<>find()|搜尋且只返回一個結果，字串型別，同.find_all()引數。| | <>find_parent() |在先輩節點中返回一個結果，字串型別，同.find_all()引數。| |<>.find_parents()|在先輩節點中搜索，返回列表型別，同.find_all()引數。| |<>.find_next_sibling()|在後續平行節點中返回一個結果，同.find_all()引數。| |<>.find_next_siblings()|在後續平行節點中搜索，返回列表型別，同.find_all()引數。| |<>.find_previous_sibling()|在前序平行節點中返回一個結果，字串型別，同.find_all()引數。| |<>.find_previous_siblings()|在前序平行節點中搜索，返回列表型別，同.find_all()引數。| ## 八、爬取京東電腦資料爬取的例子直接輸出到螢幕。 (1)要爬取京東一頁的電腦商品資訊，下圖所示： ![](https://img2020.cnblogs.com/blog/1878490/202008/1878490-20200812100711635-544856496.png) (2)所爬取的網頁連線：https://search.jd.com/search?keyword=macbook%20pro&qrst=1&suggest=5.def.0.V09&wq=macbook%20pro (3)我們的目的是需要獲取京東這一個頁面上所有的電腦資料，包括價格，名稱，ID等。具體程式碼如下： ```python #!/usr/bin/env python # -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup headers = { 'User-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/66.0.3359.139 Safari/537.36" } URL = "https://search.jd.com/search?keyword=macbook%20pro&qrst=1&suggest=5.def.0.V09&wq=macbook%20pro" r = requests.get(URL, headers=headers) r.encoding = r.apparent_encoding html = r.text bs = BeautifulSoup(html, 'html.parser') all_items = bs.find_all('li', attrs={"class": "gl-item"}) for item in all_items: computer_id = item["data-sku"] computer_name = item.find('div', attrs={'class': 'p-name p-name-type-2'}) computer_price = item.find('div', attrs={'class': 'p-price'}) print('電腦ID為：' + computer_id) print('電腦名稱為：' + computer_name.em.text) print('電腦價格為：' + computer_price.find('i').string) print('------------------------------------------------------------') ``` 部分結果如下圖所示： ![](https://img2020.cnblogs.com/blog/1878490/202008/1878490-20200812100724777-1398370549.png)

python爬蟲之Beautiful Soup基礎知識+例項

python爬蟲之Beautiful Soup基礎知識+例項

Python爬蟲之Beautiful Soup解析庫的使用（五）

Python 爬蟲利器 Beautiful Soup 4 之文件樹的搜尋

python爬蟲入門--Beautiful Soup庫介紹及例項

Python爬蟲(4):Beautiful Soup的常用方法

2017.08.11 Python網絡爬蟲實戰之Beautiful Soup爬蟲

Python爬蟲之CSS基礎知識

python下很帥氣的爬蟲包 - Beautiful Soup 示例

python-之函數基礎知識

PYTHON之計算機語言基礎知識 —— 編程語言的分類

python小白之路（基礎知識一）

Python入門基礎知識例項，

零基礎寫python爬蟲之使用Scrapy框架編寫爬蟲

python學習之-網路程式設計基礎（網路基本知識）

python爬蟲之下載檔案的方式總結以及程式例項

Python學習筆記——pycharm 爬蟲：Beautiful soup

Python 之 Beautiful Soup 4文件

Python爬蟲之五：抓取智聯招聘基礎版

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

python大法之二-一些基礎（一）

python爬蟲之Beautiful Soup基礎知識+例項

相關推薦