大神kennethreitz寫出requests-html，號稱為人設計的網頁解析庫

HTML · 發表 2019-03-18 11:02:49

摘要： requests庫的作者kennethreitz又設計出一個新的庫 requests-html . 目前stars數高達9195 之前的requests庫號稱是給人用的請求庫，而requests-html號稱是給人用的html解析庫。 kennet...

requests庫的作者kennethreitz又設計出一個新的庫 requests-html . 目前stars數高達9195

之前的requests庫號稱是給人用的請求庫，而requests-html號稱是給人用的html解析庫。 kennethreitz 的牛掰我是相信的，他不會吹的。新庫的文件我閱讀了一遍確實很不簡單，以後學習爬蟲可能再也不要requests+bs4作為起步的標配了，直接用requests-html一個庫就可以搞定所有的事情。

我在谷歌趨勢搜尋了requests-html，發現最早搜尋是2018年1月。大鄧距離掌握爬蟲圈最新技術落後了一年多，我知道的太晚了。以後大家有什麼好的新的東西可以留言或者後臺留言。

requests-html強大之處在於：

擁有了requests之外的超強且神奇的頁面解析能力
完全支援javascript
定位元素支援CSS選擇器（jQuery，類似於pyquery庫的用法）、Xpath選擇器
訪問過程偽裝成成瀏覽器行為模式（User-agent）
對於靜態頁面而言，本庫內建自動翻頁，省去構造網址的苦差事

安裝

文件中說目前支援python3.6，但是我經過安裝和測試，在python3.7也能正常安裝和使用

pip install requests-html

智慧翻頁（待改進）

這是我看到的最亮的功能，但是實際使用還是有問題的，但是我仍要把ta列在第一個要講的內容。平常我們寫靜態網頁的爬蟲前，需要先發現網址規律，如

第一頁https://book.douban.com/tag/小說
第二頁https://book.douban.com/tag/小說?start=20&type=T
第三頁https://book.douban.com/tag/小說?start=40&type=T
第四頁https://book.douban.com/tag/小說?start=60&type=T

當我們可能批量發起請求的時候，程式碼需要這樣寫

from bs4 import BeautifulSoup
import requests 

base = 'https://book.douban.com/tag/小說?start={page}&type=T
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}

for i in range(100):
url = base.format(page=i*20)
resp = requests.get(url, headers=headers)
bsObj = BeautifulSoup(resp.text, 'html.parser')

但是requests-html只需要

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://book.douban.com/tag/小說')
for html in r.html:
print(html)

<HTML url='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'>

但是實際使用過程中該方法並未奏效，kennethreitz也在文件中提到

There’s also intelligent pagination support (always improving)

always improving就是該庫在智慧翻頁這塊表現還差強人意，還需要一直改進。但是這個功能的設想真的很棒，期待早日更新出能使用的智慧翻頁功能。

希望大家的心情不要希望->失望，其實後面還有很多精彩的內容等待著你

正兒八經的GET請求

我們對python程式語言的官網 https:// python.org/ ,發起GET請求的，並得到網站響應Response物件。

該Response物件方法與requests庫的類似，我們看看常用的方法

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://python.org/')
r

Run and output!

<Response [200]>

獲取響應的html文字資料

r.text[:50]

Run and output!

'<!doctype html>\n<!--[if lt IE 7]><html class="n'

獲取響應的html資料，以二進位制顯示

r.content[:50]

Run and output!

b'<!doctype html>\n<!--[if lt IE 7]><html class="n'

將響應轉化為HTML型別，方便解析定位。

r.html

Run and output!

<HTML url='https://www.python.org/'>

HTML物件的方法

#混合著絕對和相對網址
print(len(r.html.links))
list(r.html.links)[:5]

Run and output!

119
['/success-stories/category/arts/',
 'https://kivy.org/',
 'https://www.python.org/psf/codeofconduct/',
 'http://www.scipy.org',
 'https://docs.python.org/3/license.html']

htmlObj.absolute_links將相對路徑也轉化為絕對路徑

print(len(r.html.absolute_links))
list(r.html.absolute_links)[:5]

Run and output!

119
['https://kivy.org/',
 'https://www.python.org/psf/codeofconduct/',
 'http://www.scipy.org',
 'https://jobs.python.org',
 'https://docs.python.org/3/license.html']

Notes

相對路徑網址 // http:// docs.python.org/3/tutor ial/

絕對路徑網址 http:// docs.python.org/3/tutor ial/

HTML.links獲取網址
HTML.absolute_links獲得絕對路徑網址

我們發現兩種方法返回的網址數量都是119，所以absolute_links實際上對links中的相對路徑進行了填充將其轉化為絕對路徑網址。

支援Javascript

requests-html支援javascrip，現在我們找一個網站 https://pythonclock.org/ ，我們看到有一個倒計時時間表。這個頁面內建了

javascript，像這種資料正常的網頁解析庫是無法解析到的。

from requests_html import HTMLSession
session = HTMLSession()
r2 = session.get('https://pythonclock.org/')
r2.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]

Run and output!

'</h1>\n</div>\n<div class="python-27-clock"></div>\n<div class="center">\n<div class="guido-button-block">\n<button class="js-guido-mode guido-button">'

requests-html有一個render渲染方法，可以用Chromium把javascript渲染出來，但是第一次使用時會下載chromium，大概需要幾分鐘時間把。

r2.html.render()
r2.html.search('Python 2 will retire in only {months} months!')

Run and output!

'</h1>\n</div>\n<div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>\n<div class="center">\n<div class="guido-button-block">\n<button class="js-guido-mode guido-button">'

上面的結果已經得到了倒計時的資料，接下來可以這樣提取時間

periods = [element.text for element in r.html.find('.countdown-period')]
amounts = [element.text for element in r.html.find('.countdown-amount')]
countdown_data = dict(zip(periods, amounts))
countdown_data

Run and output!

{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}

CSS定位

從HTML物件中抽取指定位置的元素

htmlObj.find('元素選擇器', first=False) 返回滿足條件的所有Element元素， 返回的資料型別是由Element組成的列表。

r.html.find('#about')

Run and output!

[<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>]

將first設定為True，只返回滿足條件的第一個Element，此時返回的不是列表，而是Element。

about = r.html.find('#about',first=True)
about

Run and output!

<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>

Element物件方法

r = session.get('https://github.com/')
htmlObj = r.html
htmlObj.xpath('a',first=True)

Run and output!

<Element 'a' class=('btn', 'ml-2') href='https://help.github.com/articles/supported-browsers'>

更多內容請看requests-html文件 http:// html.python-requests.org /

大神kennethreitz寫出requests-html，號稱為人設計的網頁解析庫

安裝

智慧翻頁（待改進）

正兒八經的GET請求

HTML物件的方法

支援Javascript

CSS定位

Element物件方法

您可能也會喜歡…