嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

阿新 • • 發佈：2018-12-11

本文目錄

Requests庫介紹
requests.get(url, params, **kwargs)方法及其他請求方法介紹
Response類屬性簡介
Reponse類中的encoding與apparent_encoding對比
Requests庫異常簡介 ———————————————————————————————————————

1. `Requests`庫介紹

requests庫是一個優秀的第三方請求庫，當然python自帶的標準庫urllib庫下的request模組的相關方法也可以進行網頁請求。
本人常用的兩種網頁請求技術路線：1). requests庫下的get(), post()

方法；2).urllib.request.urlopen()方法

2. `requests.get()`方法介紹

直接貼原始碼進行分析：

def get(url, params=None, **kwargs):
    r"""Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('get', url, params=params, **kwargs)

通過原始碼可知，原來get()方法呼叫了requests.request(method, url, **kwargs)方法建立一個Request物件向伺服器傳送get請求，返回一個Response物件。ok，可想而知，其他方法如requests.post()方法內部的呼叫與requests.get()類似，在此不一一介紹了。

3. `Response`與`Request`物件簡介

1). `Response`：

此處僅介紹嵩天老師說的幾個屬性，主要有status_code,encoding,apparent_encoding,content,text

在此之前，先簡單介紹一下python的內建裝飾器@property

的知識： 簡單說，@property的作用就是將一個getter()方法變成一個屬性呼叫。而以上有一些屬性就是通過@property註解對應的方法後得以通過屬性形式進行呼叫的。

以下是Response類的屬性： 以下均可從requests.Response的原始碼中檢視

__attrs__ = [
        '_content', 'status_code', 'headers', 'url', 'history',
        'encoding', 'reason', 'cookies', 'elapsed', 'request'
    ]

可以看出，像apparent_encoding，content，text皆是由@property註解對應方法後得以通過屬性形式呼叫的。

status_code：HTTP請求的返回狀態碼，常用的如200(成功)，404(資源不存在)，403(禁止訪問)，500(伺服器錯誤)等等。
encoding：從HTTP header中的charset取出的編碼
apparent_encoding：從實際返回的網頁內容中分析出來的編碼。它是由@property註解apparent_encoding()方法從而作為屬性呼叫。原始碼如下：

@property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library."""
        return chardet.detect(self.content)['encoding']

text：以unicode編碼方式返回資源，下面貼出原始碼的方法註釋：

@property
    def text(self):
        """Content of the response, in unicode."""

content：以二進位制編碼方式返回資源，下面貼出原始碼的方法註釋：

 @property
    def content(self):
        """Content of the response, in bytes."""

4. `Response`類中`encoding`與`apparent_encoding`的對比

前面提到過，encoding指的是從HTTP header中的charset取得的編碼，apparent_encoding是從實際的請求資源內容中分析出來的編碼。故二者可能存在不一致的情況，此時可以將apparent_encoding的值賦給encoding，從而獲取正常的網頁內容。示例如下：

以百度首頁為例，執行一下程式碼

import requests

url = "http://www.baidu.com"
response = requests.get(url)
print(response.encoding)            # 輸出結果：ISO-8859-1
print(response.apparent_encoding)   # 輸出結果：utf-8

從上面程式碼可以看出，encoding與apparent_encoding返回的結果並不相同。因此，如果要使網頁內容正常顯示，只需要新增程式碼response.encoding = response.apparent_encoding即可。

5. `Requests`庫的異常簡介

異常	說明
`requests.ConnectionError`	網路連線錯誤異常，如DNS查詢失敗、拒絕連線等
`requests.HTTPError HTTP`	HTTP錯誤異常
`requests.URLRequired`	URL缺失異常
`requests.TooManyRedirects`	超過最大重定向次數，產生重定向異常
`requests.ConnectTimeout`	連線遠端伺服器異常
`requests.Timeout`	請求URL超市，產生超時異常

需要注意的是，requests.ConnectTimeout與requests.Timeout的區別在於前者僅僅是連線遠端伺服器時產生的超時異常，後者是指整個請求過程超時的產生的異常。此外，老師在課堂中提到一個簡單的通用程式碼框架，如下：

import requests

def getHTMLText(url):
	try:
		response = requests.get(url, timeout=30)
		r.raise_for_status()		# 若狀態碼不是200，則丟擲HTTPError異常
		reponse.encoding = response.apparent_encoding
		return response.text
	except:
		return "產生異常"
	
if __name__ == "main":
	url = "http://www.baidu.com"
	print(getHTMLText(url))

此處需要注意一個新知識：response.raise_for_status()，該方法用於檢查伺服器返回的狀態碼，如果不是200(或者2xx)則會丟擲HTTPError異常

嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

本文目錄

1. `Requests`庫介紹

2. `requests.get()`方法介紹

3. `Response`與`Request`物件簡介

1). `Response`：

4. `Response`類中`encoding`與`apparent_encoding`的對比

5. `Requests`庫的異常簡介

嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 1

Python網路爬蟲與資訊提取Day2

Python網路爬蟲與資訊提取Day1

Python網路爬蟲與資訊提取_爬蟲例項（學習筆記）

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 4

Python網路爬蟲與資訊提取（三）bs4入門

Python網路爬蟲與資訊提取（中國大學mooc）

Python網路爬蟲與資訊提取-Day14-（例項）股票資料定向爬蟲

Python網路爬蟲與資訊提取-Day5-Requests庫網路爬取實戰

Python網路爬蟲與資訊提取-Day9-資訊標記與提取方法

python網路爬蟲與資訊提取（四）Robots協議

j記錄學習--python網路爬蟲與資訊提取

Python網路爬蟲與資訊提取（五）資訊標記與資訊提取的一般方法

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 3

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 2

Python 爬蟲基礎學習--網路爬蟲與資訊提取

python網路爬蟲與資訊採取之解析網頁例項---oJ期末成績排名

python 網路爬蟲與資訊採取之異常處理

python網路爬蟲與資訊採取之解析網頁（三）---- BeautifulSoup庫的導航樹例項

嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

本文目錄

1. Requests庫介紹

2. requests.get()方法介紹

3. Response與Request物件簡介

1). Response：

4. Response類中encoding與apparent_encoding的對比

5. Requests庫的異常簡介

相關推薦

1. `Requests`庫介紹

2. `requests.get()`方法介紹

3. `Response`與`Request`物件簡介

1). `Response`：

4. `Response`類中`encoding`與`apparent_encoding`的對比

5. `Requests`庫的異常簡介