1. 程式人生 > >python之BeautifulSoup之二 帶屬性值的抓取(find_all('tag', attrs={'class':'value'})

python之BeautifulSoup之二 帶屬性值的抓取(find_all('tag', attrs={'class':'value'})

系統:Windows/python 2.7.11

利用BeautifulSoup庫抓取頁面的一些標籤TAG值

再抓取一些特定屬性的值

示例標籤:

<cc>            
<div id="post_content_79076951035" class="d_post_content j_d_post_content ">            進來呀<br>都是自己喜歡的<br>拿圖就走你是狗
<br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=f4a2042b3c87e9504217f3642039531b/55f8e6cd7b899e514d1131fc44a7d933c9950db8.jpg" size="20418" height="852" width="480">
<br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=914d48d14d36acaf59e096f44cd88d03/6a57b319ebc4b745190bbcfec9fc1e178b8215b8.jpg" size="12400" height="600" width="400">
<br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=522fecd8bca1cd1105b672288910c8b0/6c318744ebf81a4cfbfce421d12a6059242da60a.jpg" size="21266" height="852" width="479"></div>
<br>
</cc>

===============================以下為程式碼部分==================================


#coding=utf-8
import urllib2
from bs4 import BeautifulSoup
def getImg(url):
    html = urllib2.urlopen(url)
    page = html.read()
    soup = BeautifulSoup(page, "html.parser")
    for s in soup.find_all('cc'): #獲取標籤為cc的tag值,得到結果:[<cc>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx........</cc>,<cc>....</cc>]集合
        if 'img' not in str(s): #判斷,若獲取的cc值裡面沒有img標籤,則結束本次迴圈
            continue
        d = s.find_all('img', attrs={'class':'BDE_Image'})  #獲取標籤為img,其中一個屬性:class="BDE_Image" 所有資料,放進集合
        lenth = len(d)   #集合的個數
        for i in range(lenth): 
            print d[i].attrs['src']    #列印,屬性為src的內容,機後面的http://xxxxxxxxxxxxxxxxx

url = 'http://tieba.baidu.com/p/4161148236?fr=frs'
getImg(url)

========================================end========================================