Python爬蟲之Beautiful Soup解析庫的使用(五)
阿新 • • 發佈:2018-12-18
Python爬蟲之Beautiful Soup解析庫的使用
Beautiful Soup-介紹
Python第三方庫,用於從HTML或XML中提取資料官方:http://www.crummv.com/software/BeautifulSoup/
安裝:pip install beautifulsoup4
Beautiful Soup-語法
soup = BeautifulSoup(html_doc,'html.parser‘,from_encoding='utf-8' )
第一個引數:html文件字串
第二個引數:html解析器
第三個引數:html文件的編碼
Beautiful Soup-使用
標籤選擇器操作
注意:只會返回一個指定的標籤,這也是標籤選擇器的特性
選擇元素
from bs4 import BeautifulSoup html_doc=''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li> ''' soup = BeautifulSoup(html_doc,'lxml')#將html程式碼自動補全,並按html程式碼格式返回 print(soup.prettify())#輸出第一個a標籤 print(soup.a)#輸出第一個span標籤 print(soup.span)
執行結果如下:
<html> <body> <div class="container"> <a class="logo" href="/pc/home?sign=360_79aabe15"> </a> <nav data-mod="nnav" id="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"> <a class="nnav-item" data-ch="youlike" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank"> 推薦 <span> </span> </a> </li> <li data-index="1"> <a class="nnav-item" data-ch="good_safe2toera" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank"> 新時代 <span> </span> </a> </li> <li data-index="2"> <a class="nnav-item" data-ch="fun" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank"> 娛樂 <span> </span> </a> </li> <li data-index="3"> <a class="nnav-item" href="/pc/home? data-index="> </a> <a class="nnav-item" data-ch="economy" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank"> 財經 <span> </span> </a> </li> </ul> </div> </nav> </div> </body> </html> <a class="logo" href="/pc/home?sign=360_79aabe15"></a> <span></span>
獲取名稱
獲取屬性
獲取內容
from bs4 import BeautifulSoup html_doc=''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li> ''' soup = BeautifulSoup(html_doc,'lxml') #輸出第一個a標籤的name print(soup.a.name) #輸出第一個a標籤的的class屬性值,下面兩種方法都可以 print(soup.a.attrs['class']) print(soup.a['class']) #輸出第一個a標籤的內容 print(soup.a.string)
執行結果如下:
a ['logo'] ['logo'] None
巢狀選擇
from bs4 import BeautifulSoup html_doc=''' <a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推薦</span></a> ''' soup = BeautifulSoup(html_doc,'lxml') print(soup.a.span.string)
執行結果如下:
推薦
子節點和子孫節點操作
獲取所有的子節點
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a> <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> > <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文 </span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') #第一種方法 print(soup.div.contents) #第二種方法 print(soup.div.children) for i,child in enumerate(soup.div.children): print(i,child)
執行結果如下:
['\n', <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>, '\n', <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a> <a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文 </span>, '\n', <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>, '\n'] <list_iterator object at 0x0000000002E498D0> 0 1 <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> 2 3 <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a> <a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文 </span> 4 5 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a> 6
獲取所有的子孫節點
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a> <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> > <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.div.descendants) for i,child in enumerate(soup.div.descendants): print(i,child)
執行結果如下:
<generator object descendants at 0x00000000028F5AF0> 0 1 <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> 2 3 <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a> 4 <img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/> 5 6 <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a> <a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span> 7 8 <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a> 9 四級 10 11 <a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> 12 新東方線上 13 > 14 <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> 15 四級 16 > 17 <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> 18 英語四級詞彙 19 > 正文 20 21 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a> 22 <img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/> 23
父節點和祖先節點操作
獲取父節點和祖先節點
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a> <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> > <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> > <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.a.parent) #獲取父節點 print(soup.a.parents) #獲取祖先節點
執行結果如下:
<span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> <generator object parents at 0x00000000028C5B48>
兄弟節點操作
獲取兄弟節點
from bs4 import BeautifulSoup html=''' <div class="more_box" id="moreBox"> <h3>360識圖</h3> <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">載入更多</a> <p id="imgTotal" class="img_total">找到相關圖片約 2637 張</p> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.a.next_siblings) #獲取前面的兄弟節點 print(soup.a.previous_siblings) #獲取後面的兄弟節點
執行結果如下:
<generator object next_siblings at 0x0000000002885B48> <generator object previous_siblings at 0x0000000002885B48>
python生成器generator
l = [x * x for x in range(10)] g = (x * x for x in range(10)) print(l) print(g)
執行結果如下:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81] <generator object <genexpr> at 0x000000000251C468>
L 是一個list, 而 G 是一個generator:它們在建立時候最基本的不同就list是 [ ] ,而generator是 ( )
如果要一個個打印出來,可以通過next()函式來獲得generator的下一個返回值
g = (x * x for x in range(10)) for i in range(10): print(next(g))
執行結果如下
0 1 4 9 16 25 36 49 64 81
標準選擇器操作
#可根據標籤名、屬性、內容查詢文件,返回所有匹配結果
find_all(name,attrs,recusive,text,**kwargs) #查詢所有標籤為a的節點 soup.find_all('a') #查詢所有標籤為a,連結符合/view/123/htm形式的節點 soup.find_all('a',href='/view/123.htm') soup.find_all('a',href=re.compile(r'/view/\d+\.htm')) #查詢所有標籤為div,class為abc,文字為python的節點 soup.find_all('div',class_='abc',string='python') 屬性: #獲取查到的節點的標籤名稱 node.name #獲取查詢到的a節點的href屬性 node['href'] #獲取查詢到的a節點的連結文字 node.get_text() find(name,attrs,recusive,text,**kwargs) 可根據標籤名、屬性、內容查詢文件,和find_all使用方法差不多,只不過返回第一個符合匹配的結果 find_parents() find_parent() find_parents()返回所有祖先節點 ,find_parent()返回直接父節點 find_next_siblings() find_next_sibling() find_next_siblings()返回前面所有兄弟節點,find_next_sibling()返回後面第一個兄弟節點 find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面所有兄弟節點 , find_previous_sibling()返回前面第一個兄弟節點 find_all_next() find_next() find_all_next()返回節點後所有符合條件的節點 , find_next()返回第一個符合條件的節點 find_all_previous() find_previous() find_all_previous()返回節點後所有符合條件的節點 ,find_previous()返回第一個符合條件的節點
測試例項:
import bs4html_doc='''<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房產<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽車<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">體育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">國內'''#建立BeautifulSoup物件soup = bs4.BeautifulSoup(html_doc,'html.parser')#獲取所有的連結links = soup.find_all('a')for link in links: print(link.name,link['href'],link.get_text())#獲取/pc/home?sign=360_79aabe15的連結link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')print(link_node.name,link_node['href'],link_node.get_text())
執行結果如下:
a /pc/home?sign=360_79aabe15 a /pc/home?ch=youlike&sign=360_79aabe15 推薦 a /pc/home?ch=good_safe2toera&sign=360_79aabe15 新時代 a /pc/home?ch=fun&sign=360_79aabe15 娛樂 a /pc/home? data-index= 財經 a /pc/home?ch=economy&sign=360_79aabe15 財經 a /pc/home?ch=estate&sign=360_79aabe15 房產 a /pc/home?ch=car&sign=360_79aabe15 汽車 a /pc/home?ch=sport&sign=360_79aabe15 體育 a /pc/home?ch=domestic&sign=360_79aabe15 國內 a /pc/home?sign=360_79aabe15