《Python網絡數據采集》筆記之BeautifulSoup
阿新 • • 發佈:2017-07-23
text 便簽 pip 使用 dal findall con content attribute
一 初見網絡爬蟲
都是使用的python3。
一個簡單的例子:
from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read())
在 Python 2.x 裏的 urllib2 庫, 在 Python 3.x 裏,urllib2 改名為 urllib,被分成一些子模塊:urllib.request、 urllib.parse 和 urllib.error。
二 BeautifulSoup
1.使用BeautifulSoup
註意:1.通過pip install BeautifulSoup4 安裝模塊
2. 建立可靠的網絡連接,能處理程序可能會發生的異常
如下面這個例子:
from urllib.error import HTTPError from urllib.request import urlopen from bs4 import BeautifulSoup def getTitle(url): try: html = urlopen(url) except HTTPError as e: return None try: bsobj = BeautifulSoup(html.read()) title = bsobj.body.h1 except AttributeError as e: return None return title title = getTitle("http://pythonscraping.com/pages/page1.html") if title == None: print("title was not found") else: print(title)
2. 網絡爬蟲可以通過 class 屬性的值,獲得指定的內容
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/warandpeace.html") bsobj = BeautifulSoup(html) # 通過bsobj對象,用fillAll函數抽取class屬性為red的span便簽 contentList = bsobj.findAll("span",{"class":"red"}) for content in contentList: print(content.get_text()) print(‘\n‘)
3. 通過導航樹
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/page3.html") bsobj = BeautifulSoup(html) #找出子標簽 for child in bsobj.find("table",{"id":"giftList"}).children: print(child) #找出兄弟標簽 for silbling in bsobj.find("table",{"id":"giftList"}).tr.next_siblings: print(silbling) for h2title in bsobj.findAll("h2"): print(h2title.get_text()) print(bsobj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
5. 正則表達式和BeautifulSoup
from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://pythonscraping.com/pages/page3.html") bsobj = BeautifulSoup(html) #返回字典對象images images = bsobj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) for image in images: print(image["src"])
《Python網絡數據采集》筆記之BeautifulSoup