1. 程式人生 > >Python爬蟲之Beautiful Soup解析庫的使用(五)

Python爬蟲之Beautiful Soup解析庫的使用(五)

Python爬蟲之Beautiful Soup解析庫的使用

Beautiful Soup-介紹

Python第三方庫,用於從HTML或XML中提取資料官方:http://www.crummv.com/software/BeautifulSoup/

安裝:pip install beautifulsoup4

Beautiful Soup-語法

soup = BeautifulSoup(html_doc,'html.parser‘,from_encoding='utf-8' )

第一個引數:html文件字串

第二個引數:html解析器

第三個引數:html文件的編碼

Beautiful Soup-使用

標籤選擇器操作

注意:只會返回一個指定的標籤,這也是標籤選擇器的特性

選擇元素

from bs4 import BeautifulSoup
html_doc='''
<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li>
'''
soup = BeautifulSoup(html_doc,'lxml')#將html程式碼自動補全,並按html程式碼格式返回
print(soup.prettify())#輸出第一個a標籤
print(soup.a)#輸出第一個span標籤
print(soup.span)

執行結果如下:

<html>
 <body>
  <div class="container">
   <a class="logo" href="/pc/home?sign=360_79aabe15">
   </a>
   <nav data-mod="nnav" id="nnav">
    <div class="nnav-wrap">
     <ul class="nnav-items" id="nnav_main">
      <li data-index="0">
       <a class="nnav-item" data-ch="youlike" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank">
        推薦
        <span>
        </span>
       </a>
      </li>
      <li data-index="1">
       <a class="nnav-item" data-ch="good_safe2toera" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank">
        新時代
        <span>
        </span>
       </a>
      </li>
      <li data-index="2">
       <a class="nnav-item" data-ch="fun" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank">
        娛樂
        <span>
        </span>
       </a>
      </li>
      <li data-index="3">
       <a class="nnav-item" href="/pc/home?
data-index=">
       </a>
       <a class="nnav-item" data-ch="economy" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank">
        財經
        <span>
        </span>
       </a>
      </li>
     </ul>
    </div>
   </nav>
  </div>
 </body>
</html>
<a class="logo" href="/pc/home?sign=360_79aabe15"></a>
<span></span>

獲取名稱

獲取屬性

獲取內容

from bs4 import BeautifulSoup
html_doc='''
<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li>
'''
soup = BeautifulSoup(html_doc,'lxml')
#輸出第一個a標籤的name
print(soup.a.name)
#輸出第一個a標籤的的class屬性值,下面兩種方法都可以
print(soup.a.attrs['class'])
print(soup.a['class'])
#輸出第一個a標籤的內容
print(soup.a.string)

執行結果如下:

a
['logo']
['logo']
None

巢狀選擇

from bs4 import BeautifulSoup
html_doc='''
<a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推薦</span></a>
'''
soup = BeautifulSoup(html_doc,'lxml')
print(soup.a.span.string)

執行結果如下:

推薦

子節點和子孫節點操作

獲取所有的子節點

from bs4 import BeautifulSoup
html='''
<div class="bc">
    <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span>
    <span class="fl" style="padding-top: 6px;">
        <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a>
        <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> >
        <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文
    </span>
    <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> 
</div>
'''

soup = BeautifulSoup(html,'lxml')
#第一種方法
print(soup.div.contents)
#第二種方法
print(soup.div.children)
for i,child in enumerate(soup.div.children):
   print(i,child)

執行結果如下:

['\n', <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>, '\n', <span class="fl" style="padding-top: 6px;">
<a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a>
<a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> >
        <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文
    </span>, '\n', <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>, '\n']
<list_iterator object at 0x0000000002E498D0>
0 

1 <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
2 

3 <span class="fl" style="padding-top: 6px;">
<a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a>
<a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> >
        <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> >
        <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文
    </span>
4 

5 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
6 

 獲取所有的子孫節點

from bs4 import BeautifulSoup
html='''
<div class="bc">
    <span class="fl" style="padding-top: 1px;">
      <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span>
      <span class="fl" style="padding-top: 6px;">
    <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a>
    <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> >
    <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> >
    <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span>
    <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a>  </div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.div.descendants)
for i,child in enumerate(soup.div.descendants):
   print(i,child)

執行結果如下:

<generator object descendants at 0x00000000028F5AF0>
0 

1 <span class="fl" style="padding-top: 1px;">
<a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
2 

3 <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a>
4 <img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/>
5 

6 <span class="fl" style="padding-top: 6px;">
<a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a>
<a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a> >
    <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a> >
    <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span>
7 

8 <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四級</a>
9 四級
10 

11 <a href="http://www.koolearn.com/" target="_self" title="新東方線上網路課堂">新東方線上</a>
12 新東方線上
13  >
    
14 <a href="http://cet4.koolearn.com/" target="_self" title="四級網路課堂">四級</a>
15 四級
16  >
    
17 <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a>
18 英語四級詞彙
19  > 正文
20 

21 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
22 <img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/>
23  

父節點和祖先節點操作

獲取父節點和祖先節點

from bs4 import BeautifulSoup
html='''
<div class="bc">
    <span class="fl" style="padding-top: 1px;">
      <a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新東方線上網路課堂"></a></span>
      <span class="fl" style="padding-top: 6px;">
    <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四級</a>
    <a title="新東方線上網路課堂" href="http://www.koolearn.com/" target="_self">新東方線上</a> >
    <a title="四級網路課堂" href="http://cet4.koolearn.com/" target="_self">四級</a> >
    <a href="http://cet4.koolearn.com/cihui/" title="英語四級詞彙">英語四級詞彙</a> > 正文</span>
    <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a>  </div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent) #獲取父節點
print(soup.a.parents) #獲取祖先節點

執行結果如下:  

<span class="fl" style="padding-top: 1px;">
<a href="http://www.koolearn.com/" target="_blank" title="新東方線上網路課堂"><img alt="新東方線上網路課堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
<generator object parents at 0x00000000028C5B48>

兄弟節點操作

獲取兄弟節點

from bs4 import BeautifulSoup
html='''
<div class="more_box" id="moreBox">
       <h3>360識圖</h3>
        <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">載入更多</a>
        <p id="imgTotal" class="img_total">找到相關圖片約 2637 張</p>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.a.next_siblings) #獲取前面的兄弟節點
print(soup.a.previous_siblings) #獲取後面的兄弟節點

執行結果如下:

<generator object next_siblings at 0x0000000002885B48>
<generator object previous_siblings at 0x0000000002885B48>

python生成器generator 

l = [x * x for x in range(10)]
g = (x * x for x in range(10))
print(l)
print(g)

執行結果如下:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
<generator object <genexpr> at 0x000000000251C468>

L 是一個list, 而 G 是一個generator:它們在建立時候最基本的不同就list是 [ ] ,而generator是 ( ) 

如果要一個個打印出來,可以通過next()函式來獲得generator的下一個返回值

g = (x * x for x in range(10))
for i in range(10):
   print(next(g))

執行結果如下

0
1
4
9
16
25
36
49
64
81

標準選擇器操作

#可根據標籤名、屬性、內容查詢文件,返回所有匹配結果
find_all(name,attrs,recusive,text,**kwargs)


#查詢所有標籤為a的節點
soup.find_all('a')

#查詢所有標籤為a,連結符合/view/123/htm形式的節點
soup.find_all('a',href='/view/123.htm')
soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))

#查詢所有標籤為div,class為abc,文字為python的節點
soup.find_all('div',class_='abc',string='python')

屬性:
#獲取查到的節點的標籤名稱
node.name

#獲取查詢到的a節點的href屬性
node['href']

#獲取查詢到的a節點的連結文字
node.get_text()


find(name,attrs,recusive,text,**kwargs)
可根據標籤名、屬性、內容查詢文件,和find_all使用方法差不多,只不過返回第一個符合匹配的結果

find_parents() find_parent()
find_parents()返回所有祖先節點 ,find_parent()返回直接父節點

find_next_siblings() find_next_sibling()
find_next_siblings()返回前面所有兄弟節點,find_next_sibling()返回後面第一個兄弟節點

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節點 , find_previous_sibling()返回前面第一個兄弟節點

find_all_next() find_next()
find_all_next()返回節點後所有符合條件的節點 , find_next()返回第一個符合條件的節點

find_all_previous() find_previous()
find_all_previous()返回節點後所有符合條件的節點 ,find_previous()返回第一個符合條件的節點

測試例項:

import bs4html_doc='''<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推薦<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新時代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娛樂<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">財經<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房產<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽車<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">體育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">國內'''#建立BeautifulSoup物件soup = bs4.BeautifulSoup(html_doc,'html.parser')#獲取所有的連結links = soup.find_all('a')for link in links:    print(link.name,link['href'],link.get_text())#獲取/pc/home?sign=360_79aabe15的連結link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')print(link_node.name,link_node['href'],link_node.get_text())

執行結果如下:

a /pc/home?sign=360_79aabe15 
a /pc/home?ch=youlike&sign=360_79aabe15 推薦
a /pc/home?ch=good_safe2toera&sign=360_79aabe15 新時代
a /pc/home?ch=fun&sign=360_79aabe15 娛樂
a /pc/home?
data-index= 財經
a /pc/home?ch=economy&sign=360_79aabe15 財經
a /pc/home?ch=estate&sign=360_79aabe15 房產
a /pc/home?ch=car&sign=360_79aabe15 汽車
a /pc/home?ch=sport&sign=360_79aabe15 體育
a /pc/home?ch=domestic&sign=360_79aabe15 國內

a /pc/home?sign=360_79aabe15