爬蟲系列之第2章-BS&Xpath模塊

阿新 • • 發佈：2018-09-30

rom 相對簡單的 ins spa 官網 get 字典類型

一、BeautifulSoup

BeautifulSoup簡介

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋如下：

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。
它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據，因為簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫.它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.你可能在尋找 Beautiful Soup3 的文檔,Beautiful Soup 3 目前已經停止開發,官網推薦在現在的項目中使用Beautiful Soup 4。

安裝

pip3 install beautifulsoup4

解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更加強大，速度更快，推薦安裝。

pip3 install lxml

另一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:

pip install html5lib

解析器對比：

技術分享圖片

官方文檔

簡單使用

下面的一段HTML代碼將作為例子被多次用到.這是 愛麗絲夢遊仙境的

的一段內容(以後內容中簡稱為 愛麗絲 的文檔):

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
 
"""

使用BeautifulSoup解析這段代碼,能夠得到一個 BeautifulSoup 的對象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

print(soup.head)  # 獲取head標簽
print(soup.head.title)  # 獲取head標簽中title標簽
print(soup.a)  # 註意：獲取的是第一個a標簽

從文檔中找到所有<a>標簽的鏈接:

for link in soup.find_all("a"):
    print(link.get("href"))

從文檔中獲取所有文字內容:

print(soup.get_text())

標簽對象

通俗點講就是 HTML 中的一個個標簽，Tag 對象與XML或HTML原生文檔中的tag相同:

soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>‘)
tag = soup.b
type(tag)
# <class ‘bs4.element.Tag‘>

Tag的名字：

標簽對象

soup對象再以愛麗絲夢遊仙境的html_doc為例，操作文檔樹最簡單的方法就是告訴它你想獲取的tag的name.如果想獲取 <head> 標簽,只要用 soup.head :

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser") # 獲取soup對象

print(soup.find_all("a")) # 獲取a標簽對象，(是標簽不是單獨的一個a)
print(soup.find_all("p")) # 獲取p標簽對象

for link in soup.find_all("a"):
    print(link.name)  # 獲取html中的所有a標簽 （a）

標簽屬性（可以做增刪改查，操作和字典一樣）

from bs4 import BeautifulSoup

for link in soup.find_all("a"):
    """link是一個a標簽對象，使用a標簽.獲取標簽中的其他數據"""
    # print(link.name)  # 獲取html中的所有a標簽 （a）
    print(link.get("id"))
    print(link.get("class"))
    print(link.get("href"))  # 獲取方法1
    print(link["href"])  # 獲取方法2
    print(link.attrs)  # {"class":"sister","id":"link1"}

標簽文本獲取

soup = BeautifulSoup(html_doc, "html.parser")

for link in soup.find_all("a"):
    print(link.text)  # 獲取文本方法1
    print(link.string)  # 獲取文本
    print(link.get_text())  # 獲取文本方法2

text & string的區別

print(soup.p.text)  # 取到p下所有的文本內容
print(soup.p.string)  # p下的文本只為一個可以取到，否則為None

遍歷文檔樹

嵌套選擇(就是“點出想要的數據”)

print(soup.head.title.string)
print(soup.body.a.string)

子節點、子孫節點

# 獲取子節點
print(soup.p.contents)  # p下所有子節點
print(soup.p.childern)  # 得到一個叠代器，包含p下所有子節點，使用for循環遍歷
for i, child in enumerate(soup.p.children):
    print(i, child)

# 獲取子孫節點
print(soup.p.descendants)  # 獲取子孫節點，p下所有的標簽都會選擇出來
for i, child in enumerate(soup.p.descendants):
    print(i, child)

父節點、祖先節點

print(soup.a.parent)  # 獲取a標簽的父節點
print(soup.a.parents)  # 找到a標簽所有的祖先節點，父親的父親，父親的父親的父親...
# 叠代器 for循環遍歷
for i in soup.p.parents:
    print(i)

兄弟節點

print(soup.a.next_sibling)  # 下一個兄弟
print("----", soup.a.previous_sibling)  # 上一個兄弟

print(list(soup.a.next_sibling))  # 下面的兄弟們 -->生成器對象
print(soup.a.previous_siblings)  # 上面的兄弟們 --> 生成器對象
for i in soup.a.previous_siblings:
    print(i)

搜索文檔樹

BeautifulSoup定義了很多搜索方法,這裏著重介紹2個: find() 和 find_all() .其它方法的參數和用法類似

1、五種過濾器

字符串：即標簽名

print(soup.find_all(name="a"))  # 結果為一個列表中包含所有的a標簽

正則表達式：結果 & 字符串一樣

import re

reg = re.compile("^b")
ret = soup.find_all(name=reg) # #找出b開頭的標簽，結果有body和b標簽
print(ret)

列表：查找列表中的標簽

ret = soup.find_all(name=["a", "b"])  # 有幾個找幾個標簽對象
print(ret)

第四種和第五種方法配合使用（不常用）

def has_class_but_no_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")


for tag in soup.find_all(name=has_class_but_no_id):
    print(tag)

2、find_all()其他過濾：

參數（find_all( name , attrs , recursive , text , **kwargs )）

import re

soup = BeautifulSoup(html_doc, "html.parser")

# keyword: key=value的形式，value可以是過濾器：字符串 , 正則表達式 , 列表, True .
print(soup.find_all(name=re.compile("^t")))  # 以字符匹配標簽
print(soup.find_all(id=re.compile("link1")))  # id匹配
print(soup.find_all(href=re.compile("http"), id=re.compile("")))  # 或的關系，
print(soup.find_all(id=True))  # 查找有di屬性的標簽

# 按照類名查找，註意關鍵字是class_，class_=value,value可以是五種選擇器之一
print(soup.find_all("a", class_="sister"))  # 查找類為sister的a標簽
print(soup.find_all("a", class_="sister ssss"))  # 查找類為sister和sss的a標簽，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile("^sis")))  # 查找類為sister的所有標簽

# sttrs:字典查找
print(soup.find_all("p", attrs={"class": "story"}))

# text: 值可以是：字符，列表，True，正則
print(soup.find_all(text="Elsie"))
print(soup.find_all("a", text="Elsie"))

# limit參數:如果文檔樹很大那麽搜索會很慢.如果我們不需要全部結果,可以使用 limit
# 參數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到
#  limit 的限制時,就停止搜索返回結果
print(soup.find_all("a", limit=2))  # limit參數是限制返回的條數

# recursive:調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,
# 如果只想搜索tag的直接子節點,可以使用參數 recursive=False 。
print(soup.html.find_all("a"))
print(soup.html.find_all("a", recursive=False))

簡單使用

像調用 find_all() 一樣調用tagfind_all() 幾乎是Beautiful Soup中最常用的搜索方法,所以我們定義了它的簡寫方法. BeautifulSoup 
對象和 tag 對象可以被當作一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:

print(soup.find_all("a"))
print(soup("a"))

# 同樣的效果
print(soup.title.find_all(text=True))
print(soup.title(text=True))

3、find()

#3、find( name , attrs , recursive , text , **kwargs )
find_all() 方法將返回文檔中符合條件的所有tag,盡管有時候我們只想得到一個結果.比如文檔中只有一個<body>標簽,那麽使用 find_all() 方法來查找<body>標簽就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法.下面兩行代碼是等價的:

soup.find_all(‘title‘, limit=1)
# [<title>The Dormouse‘s story</title>]
soup.find(‘title‘)
# <title>The Dormouse‘s story</title>

唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.
find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .
print(soup.find("nosuchtag"))
# None

soup.head.title 是 tag的名字 方法的簡寫.這個簡寫的原理就是多次調用當前tag的 find() 方法:

soup.head.title
# <title>The Dormouse‘s story</title>
soup.find("head").find("title")
# <title>The Dormouse‘s story</title>

4、其他方法

見官網:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent

5、css選擇器

我們在寫 CSS 時，標簽名不加任何修飾，類名前加點，id名前加 #，在這裏我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

（1）通過標簽名查找

print(soup.select("title"))  #[<title>The Dormouse‘s story</title>]
print(soup.select("b"))      #[<b>The Dormouse‘s story</b>]

（2）通過類名查找

print(soup.select(".sister")) 

‘‘‘
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

‘‘‘

（3）通過 id 名查找

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

（4）組合查找

組合查找即和寫 class 文件時，標簽名與類名、id名進行的組合原理是一樣的，例如查找 p 標簽中，id 等於 link1的內容，二者需要用空格分開

print(soup.select("p #link2"))

#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

直接子標簽查找

print(soup.select("p > #link2"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

（5）屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，註意屬性和標簽屬於同一節點，所以中間不能加空格，否則會無法匹配到。

print(soup.select("a[href=‘http://example.com/tillie‘]"))
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select 方法返回的結果都是列表形式，可以遍歷形式輸出，然後用 get_text() 方法來獲取它的內容：

for title in soup.select(‘a‘):
    print (title.get_text())

‘‘‘
Elsie
Lacie
Tillie
‘‘‘

修改文檔樹

Xpath語法

選取節點

nodename     選取nodename節點的所有子節點         xpath(‘//div’)         選取了所有div節點
/            從根節點選取                        xpath(‘/div’)          從根節點上選取div節點
//           選取所有的當前節點，不考慮他們的位置    xpath(‘//div’)         選取所有的div節點
.            選取當前節點                        xpath(‘./div’)         選取當前節點下的div節點
..           選取當前節點的父節點                 xpath(‘..’)            回到上一個節點
@            選取屬性                           xpath（’//@calss’）     選取所有的class屬性

ret=selector.xpath("//div")
ret=selector.xpath("/div")
ret=selector.xpath("./div")
ret=selector.xpath("//p[@id=‘p1‘]")
ret=selector.xpath("//div[@class=‘d1‘]/div/p[@class=‘story‘]")

謂語

表達式                                         結果
xpath(‘/body/div[1]’)                     選取body下的第一個div節點
xpath(‘/body/div[last()]’)                選取body下最後一個div節點
xpath(‘/body/div[last()-1]’)              選取body下倒數第二個div節點
xpath(‘/body/div[positon()<3]’)           選取body下前兩個div節點
xpath(‘/body/div[@class]’)                選取body下帶有class屬性的div節點
xpath(‘/body/div[@class=”main”]’)         選取body下class屬性為main的div節點
xpath(‘/body/div[@price>35.00]’)           選取body下price元素值大於35的div節點

ret=selector.xpath("//p[@class=‘story‘]//a[2]")
ret=selector.xpath("//p[@class=‘story‘]//a[last()]")

通配符 Xpath通過通配符來選取未知的XML元素

表達式                 結果
xpath（’/div/*’）     選取div下的所有子節點
xpath(‘/div[@*]’)    選取所有帶屬性的div節點

ret=selector.xpath("//p[@class=‘story‘]/*")
ret=selector.xpath("//p[@class=‘story‘]/a[@class]")

取多個路徑
使用“|”運算符可以選取多個路徑

表達式                         結果
xpath(‘//div|//table’)    選取所有的div和table節點

ret=selector.xpath("//p[@class=‘story‘]/a[@class]|//div[@class=‘d3‘]")
print(ret)

Xpath軸

軸可以定義相對於當前節點的節點集

軸名稱                      表達式                                  描述
ancestor                xpath(‘./ancestor::*’)              選取當前節點的所有先輩節點（父、祖父）
ancestor-or-self        xpath(‘./ancestor-or-self::*’)      選取當前節點的所有先輩節點以及節點本身
attribute               xpath(‘./attribute::*’)             選取當前節點的所有屬性
child                   xpath(‘./child::*’)                 返回當前節點的所有子節點
descendant              xpath(‘./descendant::*’)            返回當前節點的所有後代節點（子節點、孫節點）
following               xpath(‘./following::*’)             選取文檔中當前節點結束標簽後的所有節點
following-sibing        xpath(‘./following-sibing::*’)      選取當前節點之後的兄弟節點
parent                  xpath(‘./parent::*’)                選取當前節點的父節點
preceding               xpath(‘./preceding::*’)             選取文檔中當前節點開始標簽前的所有節點

preceding-sibling       xpath(‘./preceding-sibling::*’)     選取當前節點之前的兄弟節點
self                    xpath(‘./self::*’)                  選取當前節點

功能函數

使用功能函數能夠更好的進行模糊搜索

函數                  用法                                                               解釋
starts-with         xpath(‘//div[starts-with(@id,”ma”)]‘)                        選取id值以ma開頭的div節點
contains            xpath(‘//div[contains(@id,”ma”)]‘)                           選取id值包含ma的div節點
and                 xpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘)    選取id值包含ma和in的div節點
text()              xpath(‘//div[contains(text(),”ma”)]‘)                        選取節點文本包含ma的div節點

Element對象

from lxml.etree import _Element
for obj in ret:
    print(obj)
    print(type(obj))  # from lxml.etree import _Element

‘‘‘
Element對象

class xml.etree.ElementTree.Element(tag, attrib={}, **extra)

　　tag：string，元素代表的數據種類。
　　text：string，元素的內容。
　　tail：string，元素的尾形。
　　attrib：dictionary，元素的屬性字典。
　　
　　＃針對屬性的操作
　　clear()：清空元素的後代、屬性、text和tail也設置為None。
　　get(key, default=None)：獲取key對應的屬性值，如該屬性不存在則返回default值。
　　items()：根據屬性字典返回一個列表，列表元素為(key, value）。
　　keys()：返回包含所有元素屬性鍵的列表。
　　set(key, value)：設置新的屬性鍵與值。

　　＃針對後代的操作
　　append(subelement)：添加直系子元素。
　　extend(subelements)：增加一串元素對象作為子元素。＃python2.7新特性
　　find(match)：尋找第一個匹配子元素，匹配對象可以為tag或path。
　　findall(match)：尋找所有匹配子元素，匹配對象可以為tag或path。
　　findtext(match)：尋找第一個匹配子元素，返回其text值。匹配對象可以為tag或path。
　　insert(index, element)：在指定位置插入子元素。
　　iter(tag=None)：生成遍歷當前元素所有後代或者給定tag的後代的叠代器。＃python2.7新特性
　　iterfind(match)：根據tag或path查找所有的後代。
　　itertext()：遍歷所有後代並返回text值。
　　remove(subelement)：刪除子元素。



‘‘‘

爬蟲系列之第2章-BS&Xpath模塊

rom 相對簡單的 ins spa 官網 get 字典類型一、BeautifulSoup BeautifulSoup簡介簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋如下： Beautiful Sou

爬蟲系列之第2章-BS&Xpath模塊

一、BeautifulSoup

BeautifulSoup簡介

安裝

解析器

簡單使用

標簽對象

Tag的名字：

標簽對象

標簽屬性（可以做增刪改查，操作和字典一樣）

標簽文本獲取

text & string的區別

遍歷文檔樹

搜索文檔樹

1、五種過濾器

2、find_all()其他過濾：

參數（find_all( name , attrs , recursive , text , **kwargs )）

3、find()

5、css選擇器

修改文檔樹

Xpath語法

Element對象

相關推薦