1. 程式人生 > >python爬蟲學習第十二天

python爬蟲學習第十二天

今天學習了用Beautifulsoup函式來獲取指定的節點,以及用當前結點順藤摸瓜找到其子節點,後代節點,兄弟節點,父節點。

練習1 findAll 函式抽取只包含在 標籤裡的文字
還順便把class=’red’標籤裡的內容也提取了

# from urllib.request import urlopen
# from bs4 import BeautifulSoup

# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# persons = bsObj.findAll('span',{'class':'green'})
# conversasions = bsObj.findAll('span',{'class':'red'}) # for name in persons: # print(name.get_text()) # print('\n') # for talks in conversasions: # print(talks.get_text())

練習2 查詢內容匹配的html元素
查詢html元素在昨天已經練習過了就是find/findall函式。
利用這兩個函式的tag引數與tagAtrribute引數可以讓我們檢索大多數標籤,此外我們還可以通過text引數(下面的例子正是如此)匹配內容包含制定字串的標籤

# from urllib.request import urlopen
# from bs4 import BeautifulSoup

# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# test = bsObj.findAll(text = 'the prince')
# print(len(test))

練習3 子標籤和後代標籤 注意他們的區別

子標籤就是一個父標籤的下一級,而後代標籤是指一個父標籤 下面所有級別的標籤。所有的子標籤都是後代標 籤,但不是所有的後代標籤都是子標籤。

# from urllib.request import urlopen
# from bs4 import BeautifulSoup

# r = urlopen('http://www.pythonscraping.com/pages/page3.html')

# bsObj = BeautifulSoup(r)
# for child in bsObj.find('table',{'id':'giftList'}).children:
#   print(child)
# print('\n')
# for descendant in bsObj.find('table',{'id':'giftList'}).descendants:
#   print(descendant)

練習4 用next_siblings獲取兄弟節點

# from urllib.request import urlopen
# from bs4 import BeautifulSoup

# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# for sibling in bsObj.find('table',{'id':'giftList'}).tr.next_siblings:
#   print(sibling)

練習5 用parent/parents操作父節點

# from urllib.request import urlopen
# from bs4 import BeautifulSoup

# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# money = bsObj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling
# print(money.get_text())