Beautiful Soup庫詳解
安裝
pip install lxml pip install beautifulsoup4
驗證安裝
In [1]: from bs4 import BeautifulSoup In [2]: soup = BeautifulSoup('<p>Hello</p>', 'lxml') In [3]: print(soup.p.string) Hello
Beautiful Soup 介紹
Beautiful Soup 所支援的解析器

解析器
綜合對比,lxml解析器是比較好的選擇
只需要在初始化 Beautiful Soup 時,將第二個引數設定為 lxml 即可
from bs4 import BeautifulSoup html = ''' <html> <head><title>Beautiful Soup test</title></head> <body> <p class="first" name="first_p"><b>first content</b></p> <p class="second">second content <a href="http://example.com/first"></a> <a href="http://example.com/second"> ''' soup = BeautifulSoup(html, 'lxml') print(soup.prettify())# 增加縮排,美化輸出 print(soup.title.string)# 獲取title節點的文字內容
注意:以上程式碼中的html內容是不全的,有些標籤並沒有閉合
執行結果:
<html> <head> <title> Beautiful Soup test </title> </head> <body> <p class="first" name="first_p"> <b> first content </b> </p> <p class="second"> second content <a href="http://example.com/first"> </a> <a href="http://example.com/second"> </a> </p> </body> </html> Beautiful Soup test
BeautifulSoup 會自動將html標籤補全
節點選擇器
from bs4 import BeautifulSoup html = ''' <html> <head><title>Beautiful Soup test</title></head> <body> <p class="first" name="first_p"><b>first content</b></p> <p class="second">second content <a href="http://example.com/first"></a> <a href="http://example.com/second"> ''' soup = BeautifulSoup(html, 'lxml') print(soup.title) # <title>Beautiful Soup test</title> print(type(soup.title)) # <class 'bs4.element.Tag'> print(soup.title.string) # Beautiful Soup test print(soup.head) # <head><title>Beautiful Soup test</title></head> print(soup.p) # <p class="first" name="first_p"><b>first content</b></p>
節點名稱
In [3]: print(soup.title.name) title
節點所有屬性
In [4]: print(soup.p.attrs) {'class': ['first'], 'name': 'first_p'}
節點指定屬性
In [5]: print(soup.p.attrs['name']) first_p
節點指定屬性簡寫
In [6]: print(soup.p['name']) first_p
節點文字內容
In [7]: print(soup.p.string) first content
巢狀選擇
In [8]: print(soup.head.title) <title>Beautiful Soup test</title> In [9]: print(type(soup.head.title)) <class 'bs4.element.Tag'> In [10]: print(soup.head.title.string) Beautiful Soup test
關聯選擇
In [11]: print(soup.body.children) <list_iterator object at 0x10825a6d8> In [12]: for i, child in enumerate(soup.body.children): ...:print(i, child) ...: 0 1 <p class="first" name="first_p"><b>first content</b></p> 2 3 <p class="second">second content <a href="http://example.com/first"></a> <a href="http://example.com/second"> </a></p>
- children 所有子節點
- descendants 所有後代節點
- parent 直接父節點
- parents 祖先節點
- next_sibling 下一個兄弟節點
- previous_sibling 上一個兄弟節點
- next_siblings 後面的所有兄弟節點
- previous_siblings 前面的所有兄弟節點
方法選擇器
find_all
資料準備
In [13]: from bs4 import BeautifulSoup ...: ...: html = ''' ...: <div class="panel"> ...:<div class="panel-heading"> ...:<h4>Hello</h4> ...:</div> ...:<div class="panel-body"> ...:<ul class="list" id="list-1"> ...:<li class="element">Foo</li> ...:<li class="element">Bar</li> ...:<li class="element">Jay</li> ...:</ul> ...:<ul class="list list-small" id="list-2"> ...:<li class="element">Foo</li> ...:<li class="element">Bar</li> ...:</ul> ...:</div> ...: </div> ...: ''' ...: ...: soup = BeautifulSoup(html, 'lxml') ...: ...:
所有ul
In [16]: soup.find_all(name='ul') Out[16]: [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>]
由於獲取到的ul是Tag型別,可以進行迭代
In [17]: type(soup.find_all(name='ul')[0]) Out[17]: bs4.element.Tag In [18]: for ul in soup.find_all(name='ul'): ...:print(ul.find_all(name='li')) ...: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
再通過遍歷li,獲取li的文字
In [19]: for ul in soup.find_all(name='ul'): ...:print(ul.find_all(name='li')) ...:for li in ul.find_all(name='li'): ...:print(li.string) ...: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] Foo Bar Jay [<li class="element">Foo</li>, <li class="element">Bar</li>] Foo Bar
attrs
根據屬性查詢
In [26]: soup.find_all(attrs={'id': 'list-1'}) Out[26]: [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
text
匹配節點的文字內容
In [28]: import re # 返回所有匹配正則的節點文字組成的列表 In [29]: soup.find_all(text=re.compile('ar')) Out[29]: ['Bar', 'Bar']
find
返回第一個匹配的元素
In [30]: soup.find(text=re.compile('ar')) Out[30]: 'Bar' In [31]: soup.find('li') Out[31]: <li class="element">Foo</li>
關於find,還有其他用法:
-
find_parents() 和 find_parent()
-
find_next_siblings() 和 find_next_sibling()
-
find_previous_siblings() 和 find_previous_sibling()
-
find_all_next() 和 find_next()
-
fina_all_previous() 和 find_previous()
css 選擇器
只需呼叫 select() 方法,傳入相應的css選擇器即可
In [32]: soup.select('.panel .panel-heading') Out[32]: [<div class="panel-heading"> <h4>Hello</h4> </div>] In [33]: soup.select('ul li') Out[33]: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] In [34]: soup.select('#list-2 .element') Out[34]: [<li class="element">Foo</li>, <li class="element">Bar</li>] In [35]: soup.select('ul')[0] Out[35]: <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
巢狀選擇
In [36]: for ul in soup.select('ul'): ...:print(ul.select('li')) ...: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
獲取屬性
In [37]: for ul in soup.select('ul'): ...:print(ul['id']) ...:print(ul.attrs['id']) ...: list-1 list-1 list-2 list-2
獲取文字
In [39]: for li in soup.select('li'): ...:print('Get Text:', li.get_text()) ...:print('String:', li.string) ...: ...: Get Text: Foo String: Foo Get Text: Bar String: Bar Get Text: Jay String: Jay Get Text: Foo String: Foo Get Text: Bar String: Bar