Python爬蟲包 BeautifulSoup 學習（八） parent等應用

阿新 • • 發佈：2019-02-13

繼續使用上篇的html頁面內容：

html_doc = """ 
<html>
<head><title>The Dormouse's story</title></head> 
<p class="title"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" 
 class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
</html 
>"""

繼續分析文件樹 ,每個 tag或字串都有父節點 :被包含在某個 tag中。

.parent

通過 .parent 屬性來獲取某個元素的父節點。在例子html文件中，<head>標籤是<title>標籤的父節點:

title_tag = soup.title 
title_tag
# <title>The Dormouse's story</title> 
title_tag.parent 
# <head><title>The Dormouse's story</title></head>

title下的字串也有父節點:<title>標籤

title_tag.string.parent 
# <title>The Dormouse's story</title>

文件的頂層節點比如<html>的父節點是 BeautifulSoup 物件:

html_tag = soup.html 
type(html_tag.parent) 
# <class 'bs4.BeautifulSoup'>

BeautifulSoup 物件的 .parent 是None。

.parents

通過元素的.parents屬性可以遞迴得到元素的所有父輩節點 , 下面的例子使用了 .parents方法遍歷了<a>標籤到根節點的所有節點：

link = soup.a 
link 
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
for parent in link.parents: 
    if parent is None: 
        print(parent) 
    else: 
        print(parent.name) 
# p 
# body 
# html 
# [document] 
# None

兄弟節點

舉例說明：

<a>
    <b>text1</b>
    <c>text2</c>
</a>

這裡的b和c節點為兄弟節點.

.next_sibling 和 .previous_sibling

在文件樹中，使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節點：

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
sibling_soup.b.next_sibling   
sibling_soup.c.previous_sibling 

# <c>text2</c> 
# <b>text1</b>

b 標籤有.next_sibling 屬性 ,但是沒有 .previous_sibling 屬性，因為 b標籤在同級節點中是第一個。同理，c標籤有 .previous_sibling 屬性，卻沒有 .next_sibling 屬性。

link = soup.a
link 
link.next_sibling 

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
# u',\n'

注意：第一個a標籤的next_sibling 屬性值為，\n

link.next_sibling.next_sibling 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

第一個a標籤的next_sibling的next_sibling 屬性值為Lacie

.next_siblings 和 .previous_siblings

通過 .next_siblings 和 .previous_siblings 屬性對當前節點的兄弟節點迭代輸出：

for sibling in soup.a.next_siblings: 
    print(repr(sibling)) # u',\n' 

for sibling in soup.find(id="link3").previous_siblings:                                 print(repr(sibling)) 

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
# u' and\n' 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
# u'; and they lived at the bottom of a well.' 
# None 


# ' and\n' 
# <a class="sister" 
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
# u'Once upon a time there were three little sisters; and their names were\n' 
# None

回退和前進

舉例html如下：

<html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p>

HTML 解析器把這段字串轉換成一連的事件 : “ 開啟標籤 ”新增一段字串 ”,關閉標籤 ”,”開啟標籤 ”, 等。

Beautiful Soup提供了重現解析器初始化過程的方法。

next_element 和 .previous_element

.next_element 屬性指向解析過程中下一個被的物件 (字串或 tag),結果可能與 .next_sibling 相同 ,但通常是不一樣的。

last_a_tag = soup.find("a", id="link3") 
last_a_tag 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
last_a_tag.next_sibling 
# '; and they lived at the bottom of a well.'

但這個 <a>標籤的 .next_element 屬性結果是在標籤被解析之後的內容 ,不是<a>標籤後的句子部分 ,應該是字串 ”Tillie”:

last_a_tag.next_element 
# u'Tillie'

.previous_element 屬性剛好與.next_element 相反 ,它指向當前被解析的物件的前一個解析物件 :

last_a_tag.previous_element 
# u' and\n' 
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements 和 .previous_elements

通過 .next_elements 和 .previous_elements 的迭代器就可以向前或後訪問文件解析內容 ,就好像文件正在被解析一樣 :

for element in last_a_tag.next_elements:                  print(repr(element)) 
# u'Tillie' 
# u';\nand they lived at the bottom of a well.' 
# u'\n\n' 
# <p class="story">...</p> 
# u'...' 
# u'\n' 
# None

Python爬蟲包 BeautifulSoup 學習（八） parent等應用

繼續使用上篇的html頁面內容： html_doc = """ <html> <head><title>The Dormouse's story</title></head> <p cl

Python爬蟲包 BeautifulSoup 學習（七） children等應用

所使用的html為： html_doc = """ <html> <head><title>The Dormouse's story</title></head> <p class="t

Python爬蟲包 BeautifulSoup 學習（四） bs基本物件與函式

四大物件種類 BeautifulSoup將複雜HTML文件轉換成一個複雜的樹形結構。如圖所示每個節點都是Python物件，我們只用根據節點進行查詢就可以了，因為解析工作交給了框架本身。所有物件可以歸納為4種: Tag NavigableString

Python爬蟲包 BeautifulSoup 學習（十）各種html解析器的比較及使用

BeautifulSoup號稱Python中最受歡迎的HTML解析庫之一，但是這並不是唯一的選擇。解析庫 lxml 這個庫可以用來解析HTML和XML文件，以非常底層的實現而聞名，大部分原始碼都是C語言寫的，雖然學習這東西要花一定的時間，但是它的處理

Python爬蟲包 BeautifulSoup 學習（二）異常處理

面對網路不穩定，頁面更新等問題，很可能出現程式異常的問題，所以我們要對程式進行一些異常處理。大家可能覺得處理異常是一個比較麻煩的活，但在面對複雜網頁和任務的時候，無疑成為一個很好的程式碼習慣。網頁‘404’、‘500’等問題 try:

Python爬蟲包 BeautifulSoup 學習（十一） CSS 選擇器

BeautifulSoup支援最常用的CSS選擇器，在 Tag 或 BeautifulSoup 物件的 .select() 方法中傳入字串引數，即可使用CSS選擇器的語法找到tag。 CSS選擇器 CSS選擇器是一種單獨的文件搜尋語法。詳情請見此連結

JavaWeb學習（八）HttpServletRequest基本應用——獲得客戶端的請求（1）

一、HttpServletRequest介紹　　HttpServletRequest物件代表客戶端的請求，當客戶端通過HTTP協議訪問伺服器時，HTTP請求頭中的所有資訊都封裝在這個物件中，通過這個物件提供的方法，可以獲得客戶端請求的所有資訊。二、Request常用方法 2.1 獲得

ElasticSearch學習（八）在Java應用中實現批量操作（mget&bulk）和查詢刪除、match_all（查詢所有）

//mget批量查詢 @Test public void test6() throws Exception { //指定ES叢集 Settings settings = Settings.builder().put("clus

爬蟲庫之BeautifulSoup學習（二）

不必要 baidu html left 官方 blank 正則文本處理比較 BeautifulSoup官方介紹文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 四大對象種

爬蟲庫之BeautifulSoup學習（三）

子節點 rom lac repr 文檔 strong 爬蟲 time contents 遍歷文檔樹：　　1、查找子節點　　.contents　　　　tag的.content屬性可以將tag的子節點以列表的方式輸出。　　print soup.body.cont

爬蟲庫之BeautifulSoup學習（四）

所有字符串判斷 href gin int 過濾器 amp link 探索文檔樹： find_all(name,attrs,recursive,text,**kwargs) 方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件 1、name參數，可

python基礎學習（八）元組

元組的定義 Tuple（元組）與列表類似，不同之處在於元組的元素不能修改元組表示多個元素組成的序列元組在 Python 開發中，有特定的應用場景用於儲存一串資訊，資料之間使用 , 分隔元組用 () 定義元組的索引從 0 開始

重拾Python學習（八）----------IO程式設計

本文參考：廖雪峰的官方網站：https://www.liaoxuefeng.com 檔案讀寫讀檔案讀檔案的模式開啟一個檔案物件open() 一次讀取檔案的全部內容read() 關閉檔案close() >>> f = open('/Use

python資料分析pandas包入門學習（三）彙總和統計描述

本文參考《利用Python進行資料分析》的第五章 pandas入門 pandas擁有一組常用的數學和統計方法。它們大部分屬於約簡和彙總統計，用於從Series中提取單個值（如sum和mean），或從DataFrame的行或列中提取一個Series。跟對應的Numpy陣列

python dlib學習（八）：訓練人臉特徵點檢測器

前言前面的部落格（python dlib學習（二）：人臉特徵點標定）介紹了使用dlib識別68個人臉特徵點，但是當時使用的是dlib官方給出的訓練好的模型，這次要自己訓練一個特徵點檢測器出來。當然，想要達到state-of-art的效果需要自己調參，這也是一

python資料分析pandas包入門學習（四）處理缺失資料

本文參考《利用Python進行資料分析》的第五章 pandas入門 4 處理缺失資料缺失資料（missing data）在大部分資料分析應用中都很常見。Pandas的設計目標之一就是讓缺失資料的處理任務儘量輕鬆。例如，pandas物件上的所有描述統計都排除了缺失資料

python資料分析pandas包入門學習（二）基本功能

本文參考《利用Python進行資料分析》的第五章 pandas入門 2基本功能介紹操作Series和DataFrame中的資料的基本手段。重新索引reindex 當呼叫Series的reindex將會根據新索引進行重排；當某個索引值當前不存在，就引入缺失值；fill_

JAVA學習（八）

while ring 繼續 break 默認值 tin 都是遍歷次數二重循環一、回顧3種循環結構 1、while 語法條件表達式的初始值； while(條件表達式){ 循環操作；更改條件表達式的語句； } 特點：先判斷，再執行，有可能一次循環都沒有

HBase概念學習（八）開發一個類twitter系統之表設計

至少創建用戶列表 ase wke long 少包 mali 。。這邊文章先將可能的需求分析一下，設計出HBase表，下一步再開始編寫client代碼。 TwiBase系統 1、背景為了加深HBase基本概念的學習，參考HBase實戰這本書實際動手做了這個樣

線程學習--（八）queue

高性能數組 pre clas 並行 lin 性能 sync 緩沖 http://www.cnblogs.com/sigm/p/6186401.html 一、ConcurrentLinkedQueue 是一個適用於高並發場景下的隊列，通過無鎖的方式，實現了高並發狀態下的高性

Python爬蟲包 BeautifulSoup 學習（八） parent等應用

.parent

.parents

兄弟節點

.next_sibling 和 .previous_sibling

.next_siblings 和 .previous_siblings

回退和前進

next_element 和 .previous_element

.next_elements 和 .previous_elements

相關推薦