1. 程式人生 > >《Python網絡數據采集》筆記之BeautifulSoup

《Python網絡數據采集》筆記之BeautifulSoup

text 便簽 pip 使用 dal findall con content attribute

一 初見網絡爬蟲

都是使用的python3。

一個簡單的例子:

from  urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

在 Python 2.x 裏的 urllib2 庫, 在 Python 3.x 裏,urllib2 改名為 urllib,被分成一些子模塊:urllib.request、 urllib.parse 和 urllib.error。

二 BeautifulSoup

1.使用BeautifulSoup

註意:1.通過pip install BeautifulSoup4 安裝模塊

2. 建立可靠的網絡連接,能處理程序可能會發生的異常

如下面這個例子:

from urllib.error import HTTPError
from urllib.request import urlopen
from  bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try
: bsobj = BeautifulSoup(html.read()) title = bsobj.body.h1 except AttributeError as e: return None return title title = getTitle("http://pythonscraping.com/pages/page1.html") if title == None: print("title was not found") else: print(title)

2. 網絡爬蟲可以通過 class 屬性的值,獲得指定的內容

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/warandpeace.html")

bsobj = BeautifulSoup(html)

# 通過bsobj對象,用fillAll函數抽取class屬性為red的span便簽
contentList = bsobj.findAll("span",{"class":"red"})

for content in contentList:
    print(content.get_text())
    print(\n)

3. 通過導航樹

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page3.html")
bsobj = BeautifulSoup(html)


#找出子標簽
for child in bsobj.find("table",{"id":"giftList"}).children:
    print(child)

#找出兄弟標簽
for silbling in bsobj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(silbling)

for h2title in bsobj.findAll("h2"):
     print(h2title.get_text())

print(bsobj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

5. 正則表達式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import  re

html = urlopen("http://pythonscraping.com/pages/page3.html")
bsobj = BeautifulSoup(html)
#返回字典對象images
images = bsobj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:  
    print(image["src"])

《Python網絡數據采集》筆記之BeautifulSoup