python爬蟲學習筆記——使用requests庫編寫爬蟲（1）

阿新 • • 發佈：2019-02-05

首先感謝http://python.jobbole.com ，我是看了此站的文章之後才有寫此文的想法，本人也是開始學python不久，此文僅僅是記錄一些學習過程中遇到的問題，邊學邊寫，初次寫這樣的博文，差錯在所難免，如有差錯也請指出，感激不盡。

然而關於網上使用requests寫爬蟲的文章，在我學習過程中，感覺都很少。。。或者說不盡人意吧，大家都用的urllib，或者3.0裡的urllib2。其實在我看來，requests就是將urllib中的一些麻煩的東西做了整合，更加清楚明瞭。

關於requests庫的下載與安裝，不做過多贅述，百度一下有很多文章

這裡有requests庫的官方文件以及中文翻譯，有些翻譯雖然有點生硬，但大致能懂，我也是摸著這個手冊過河的：http://cn.python-requests.org

安裝好requests庫，之後，開啟api.py,檢視有哪些介面。

首先把api中的 request的定義放上來，方便檢視，也方便解釋後面的函式，這裡看不懂沒關係，因為要結合後面的函式看。

def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
 
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
 
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    Usage::
      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """
# By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)

先講get,get定義如下

def get(url, params=None, **kwargs):
"""Sends a GET request.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """
kwargs.setdefault('allow_redirects', True)
return request('get', url, params=params, **kwargs)

get函式用於傳送一個get請求，就好像訪問網頁一樣，想服務端傳送請求，其中有這麼幾個引數：

url：域名，如http://www.acfun.tv；舉個例子 r = request.get(url='http://www.acfun.tv'),向url的內容所指域名傳送請求，講返回的request給r，然後可以再對r做一些操作。

params：翻譯過來就是引數，在requests的定義中可以看到，params（optional），即這個引數是可選的，同時表明，這個引數應是一個字典或是bytes型別的值。（這個詳細之後會說）

**kwargs:其他可選引數，諸如timeout,data,jason等等，其實就是request中的可選引數（同上，詳細內容之後再說）

看到這裡，可能有點暈，這request是個啥，從api中得知，request返回的是session.requst(......),這又得去看session.py,這樣就會越來越麻煩，從簡理解，request就好像一個容器，儲存著服務端的返回資訊，比如說頁面的html程式碼，以及一些相應報文等（其實就是response嘛！檢視原始碼的我眼淚掉下來），爬蟲就要從這些裡面篩選資訊進行操作。當然有能力去細細研究，一個個去把原始碼弄清楚也是可以的，加深理解，只不過太費時間。。。

這裡舉一個簡單的例子，理論啥的看得人頭暈，實踐一下比較清楚

import requests

url = 'http://www.acfun.tv'
r = requests.get(url)
print(r.content)#編碼原因，這裡使用content
print(r)
#執行結果：b'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head><.......
<Response [200]>
#這裡print(r.content)的結果太長，只放開頭一段

明眼人一看就知道，r.content的內容是網頁的原始碼，這樣我們知道，可以通過get()將整個網頁扒下來，那麼r是什麼？結果來看是<Response [200]>,這是啥意思？

我們檢視一下r的型別

import  requests

url = 'http://www.acfun.tv'
r = requests.get(url)
print(type(r))
#執行結果：<class 'requests.models.Response'>

從結果來看，這個r是來自requests模組 models.py裡的Response類

檢視Reponse 的原始碼，可以看到他的成員變數和函式，前面的content也是其中之一，在這裡不放了。。。一個類的定義還是挺長的。

在這裡要注意兩個成員，content和text，從原始碼得知，content是成員函式，返回的是self._content的成員變數，text是成員函式，返回的是

content = str(self.content, encoding, errors='replace')

return content 再通過檢視其型別，得知，content的型別的‘bytes’，text的型別是‘str’，並且是自動編碼之後的，知道這一點很重要，眾所周知，py3.0取消了decode(),所以一般推薦使用text，當然有些地方也看情況而定。之所以要搞清楚這一點，是為了之後使用正則表示式進行篩選過濾的時候，不弄出一些型別不匹配的問題，相信使用過urllib的一定有過經歷，尤其在py2.7中，並沒有統一編碼，py3中統一編碼為utf-8，至於編碼問題，詳見http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html。

最後簡單介紹一下其他的常用成員（其實沒寫的還有些比較生疏，之後慢慢補全）

url：目標url

raw：英文都可以看得出來，原始碼，初始碼，反正你是不會看得懂這裡面寫的啥的

headers：響應頭，我用的是火狐瀏覽器，按下f12可檢視這個headers

encoding：編碼形式，可更改，決定text的編碼形式

cookies：餅乾，伺服器記錄並辨識你的電腦的身份的玩意

總之，requests中的get，就是根據你所輸入的引數，模擬瀏覽器向伺服器傳送請求，獲得response，響應內容當然包括網頁的原始碼，有了原始碼，就可以抓取相應頁面中想要的資訊。本文只是初步講解了requests庫中get的方法，request庫中還有很多類似方法，如post，put等等，我也在一一學習。此文也是個人理解筆記，有些地方理解的可能不正確，或是不夠深。寫到這裡，才感覺自己寫的進度是不是太慢了，花了一個晚上才寫了一個request和get，但是在閱讀原始碼的時候，平時遇到的問題就一瞬間清晰了很多。

python爬蟲學習筆記——使用requests庫編寫爬蟲（1）

Pyhon網路爬蟲學習筆記—抓取本地網頁（一）

Python機器學習筆記：線性判別分析（LDA）演算法

Python機器學習筆記：奇異值分解（SVD）演算法

七月八號linux學習筆記-常見的linux命令（1）

React學習筆記之react基礎篇（1）

Lua程式設計學習筆記(一) Lua基本語法（1）

Angular6學習筆記8：服務(Service)（1）

Esper學習筆記三：EPL語法（1）

python機器學習：：資料預處理（1）【轉】

五、學習筆記-Linux軟體包管理（1）

《組合語言（第3版）（王爽著）》學習筆記一：基礎知識（1）

基於libevent的http協議學習筆記之認識基本函式（1）

PHP操作xml學習筆記之增刪改查（1）—增加

python爬蟲學習筆記——使用requests庫編寫爬蟲（1）

python爬蟲學習筆記-requests用法

爬蟲學習筆記-urllib庫

Scrapy爬蟲學習筆記 - windows下搭建開發環境1

python3網絡爬蟲學習——基本庫的使用（1）

【Python3 爬蟲學習筆記】Scrapy框架的使用 1

python學習筆記列表和元組（三）

python爬蟲學習筆記——使用requests庫編寫爬蟲（1）

相關推薦