beautiful soup 4.0（bs4）遍歷文件樹（2）

阿新 • • 發佈：2019-01-30

1、概述

在使用爬蟲程式對爬取的文件進行處理時，經常要做的一個操作就是遍歷文件樹。文件以樹形結構進行組織，所以遍歷文件的操作又叫遍歷文件樹。beautiful soup本身提供了很多遍歷文件樹的方法，本文主要討論遍歷文件樹的方法。

2、遍歷文件樹

2.1 準備工作

本文將使用公眾號的文章作為遍歷的物件，所以首先需要先把整個文件抓取下來，並將多餘的元素去除掉，只保留文件的主體部分以保證文件分析的準確性，請參照如下程式碼：

from bs4 import BeautifulSoup
import requests
bk_url = 'https://mp.weixin.qq.com/s?src=11&timestamp=1536542507&ver=1113&signature=bx7vKGc*P*VXdS7ymgZqblP666yq31soLbvKx1ehonqJAojCMT9aUq5y8Sv8CKTWd4C87pCCSw*kts5k4KaNUmdXuZkEMoVP59DPzKxQZhZlE6UCV3f3iw5qe5XvnJxs&new=1'
headers = {"User-Agent" : "User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
r = requests.get(bk_url,headers=headers)
r.encoding = 'utf-8'
soup =  BeautifulSoup(r.text,'html.parser')

2.2 獲取子節點

在遍歷文件樹時一個經常性的操作就是獲取子節點，我們可以使用下面幾種方法來獲取子節點。

2.2.1 使用節點名稱獲取子節點

可以使用節點名字直接獲取相關節點，請參照如下程式碼

from bs4 import BeautifulSoup
import requests
bk_url = 'https://mp.weixin.qq.com/s?src=11&timestamp=1536542507&ver=1113&signature=bx7vKGc*P*VXdS7ymgZqblP666yq31soLbvKx1ehonqJAojCMT9aUq5y8Sv8CKTWd4C87pCCSw*kts5k4KaNUmdXuZkEMoVP59DPzKxQZhZlE6UCV3f3iw5qe5XvnJxs&new=1'
headers = {"User-Agent" : "User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
r = requests.get(bk_url,headers=headers)
r.encoding = 'utf-8'
soup =  BeautifulSoup(r.text,'html.parser')
body = soup.body
print(body)

2.2.2 .contents 和 .children

1、元素的 .contents 屬性可以將tag的子節點以列表的方式輸出，請參照如下程式碼：

head = soup.head
head.contents

注意：字串沒有 .contents 屬性,因為字串沒有子節點

2、通過tag的 .children 生成器,可以對tag的子節點進行迴圈，請參照如下程式碼：

for child in head.children:
    print(child)

2.2.3 .descendants

children與contents屬性只能獲取元素的直接子節點，但子節點的子節點是無法獲取到的。如果需要獲取某一個元素的所有子孫節點可以使用.descendants方法。

請參照如下程式碼：

for child in head.descendants:
    print(child.name)

輸出效果如下圖所示：

2.2.4 .string

如果元素只有一個 NavigableString 型別子節點,那麼這個tag可以使用 .string 得到子節點，請參照如下程式碼：

print(head.title.string)

輸出結果如下圖所示：

如果一個元素僅有一個子節點,那麼這個tag也可以使用 .string 方法,輸出結果與當前唯一子節點的 .string 結果相同。如果tag包含了多個子節點,tag就無法確定 .string 方法應該呼叫哪個子節點的內容, .string 的輸出結果是 None ，請參照如下程式碼：

print(head.string)

輸出結果如下圖所示：

2.2.5 .strings 和 stripped_strings

如果元素中包含多個字串 ,可以使用 .strings 來迴圈獲取，請參照如下程式碼：

for child_str in head.strings:
    print(child_str)

輸出的字串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多餘空白內容，全部是空格的行會被忽略掉,段首和段末的空白會被刪除。請參照如下程式碼：

for child_str in head.stripped_strings:
    print(child_str)

2.2.6 父節點

1) parent獲取直接的父節點，請參考如下程式碼：

print(head.parent)

2) .parents 通過元素的 .parents 屬性可以遞迴得到元素的所有父輩節點,下面的例子使用了 .parents 方法遍歷了<a>標籤到根節點的所有節點。

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

2.2.7 兄弟節點

下面的例子以下面的html為基礎進行操作

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

1） .next_sibling 和 .previous_sibling

在文件樹中,使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節點：

請參照如下程式碼：

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

2）.next_siblings 和 .previous_siblings

通過 .next_siblings 和 .previous_siblings 屬性可以對當前節點的兄弟節點迭代輸出:

請參照如下程式碼：

for sibling in soup.a.next_siblings:
    print(repr(sibling))
    # u',\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u' and\n'
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    # u'; and they lived at the bottom of a well.'
    # None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
    # ' and\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u',\n'
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    # u'Once upon a time there were three little sisters; and their names were\n'
    # None

beautiful soup 4.0（bs4）遍歷文件樹（2）

1、概述

2、遍歷文件樹

2.1 準備工作

2.2 獲取子節點

2.2.1 使用節點名稱獲取子節點

2.2.2 .contents 和 .children

2.2.3 .descendants

2.2.4 .string

2.2.5 .strings 和 stripped_strings

2.2.6 父節點

2.2.7 兄弟節點

beautiful soup 4.0（bs4）遍歷文件樹（2）

知識點一，使用os庫遍歷文件夾（詳細講解）

Cocos2dx 遍歷文件夾下所有的文件（草稿）

CentOS7.0下部署NFS網絡文件系統（唐傑）

cocos2d-x學習筆記（十二）cocos2dx 3.10添加lua LuaFileSystem庫遍歷文件

數據結構Java版之遍歷二叉樹（六）

（for in）遍歷鍵名和（for of）遍歷鍵值，這種說法嚴謹嗎？

後序遍歷二叉樹（關鍵詞：樹/二叉樹/後序遍歷/後根遍歷/後序搜尋/後根搜尋）

中序遍歷二叉樹（關鍵詞：樹/二叉樹/中序遍歷/中根遍歷/中序搜尋/中根搜尋）

先序遍歷二叉樹（關鍵詞：樹/二叉樹/先序遍歷/先根遍歷/先序搜尋/先根搜尋）

JS中陣列實現（倒序遍歷陣列，陣列連線字串）

Morris Traversal方法遍歷二叉樹（非遞迴，不用棧，O(1)空間）

迭代器（使用集合遍歷元素的五種方式）

資料結構--非遞迴遍歷二叉樹（利用輔助棧）

spring學習（十一）——spring官方文件閱讀（5.0.7）——spring的@Bean與@Configuration註解

中序遍歷二叉樹（非遞迴演算法 c語言）

遍歷二叉樹（遞迴，非遞迴都有）

002 前、中、後序遍歷二叉樹（遞迴&非遞迴演算法）

獲取數值型數組的最大值和最小值，使用遍歷獲取每一個值，然後記錄最大值和最小值的方式。（數組遍歷嵌套if判斷語句）

數據結構第5章樹的二叉樹單元小結（2）遍歷二叉樹和線索二叉樹

beautiful soup 4.0（bs4）遍歷文件樹（2）

1、概述

2、遍歷文件樹

2.1 準備工作

2.2 獲取子節點

2.2.1 使用節點名稱獲取子節點

2.2.2 .contents 和 .children

2.2.3 .descendants

2.2.4 .string

2.2.5 .strings 和 stripped_strings

2.2.6 父節點

2.2.7 兄弟節點

相關推薦