1. 程式人生 > >python爬蟲之requests庫詳解(一,如何通過requests來獲得頁面資訊)

python爬蟲之requests庫詳解(一,如何通過requests來獲得頁面資訊)

前言:

  1. 爬蟲的基礎是與網頁建立聯絡,而我們可以通過get和post兩種方式來建立連線,而我們可以通過引入urllib庫[在python3的環境下匯入的是urllib;而python2的環境下是urllib和urllib2]或者requests庫來實現,從程式的複雜度和可讀性考慮,requests庫顯然更能滿足程式設計師的需求,但是我沒有找到這個庫詳細的中文講解,這也是我寫這篇文章的原因。
  2. 文中可能有一些拓展知識,不喜歡可以略讀過去。

一,如何使用requests庫

1,首先我們需要匯入requests包:

import requests

 2,然後我們可以通過get或者post(兩者有一定的區別,請根據自己的需求合理的選擇)來請求頁面:

req_1 = requests.get('https://m.weibo.cn/status/4278783500356969')
req_2 = requests.post('https://m.weibo.cn/status/4278783500356969')

 A:這裡多說一下我們通過這兩個方式得到了什麼?

Now, we have a Response object called req_1/req_2. We can get all the information we need from this object.
#這是官方文件中給出的說明,我們得到的是一個物件,裡面包含了我們請求的頁面的程式碼(可以print出來看一下)及相關資訊,
#而我們可以通過'.'操作符來訪問這個物件內的資訊,在文末我會詳細的歸納出來【注1】.

 B:再拓展一下我們對一個url還有哪些操作?

 req = requests.put('http://httpbin.org/put', data = {'key':'value'})
 req = requests.delete('http://httpbin.org/delete')
 req = requests.head('http://httpbin.org/get')
 req = requests.options('http://httpbin.org/get')

3,我們多數情況下還需要在請求中新增一些引數,如果你接觸過urllib的話,你就會驚歎於requests的方便:

 A:先說一下如何將引數/表單,或者其它資訊新增到請求中

  • 傳遞引數/表單:

get:

payload = {'key1': 'value1', 'key2': 'value2'}
req = requests.get('http://httpbin.org/get', params=payload)

這裡的value可以為一個列表

post:

yourData = {'key':'value'}
req = requests.post('http://httpbin.org/post', data=yourData)

#下面兩個例子是展示表單中可以有多種型別的值

#例1
payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
r1 = requests.post('http://httpbin.org/post', data=payload_tuples)
payload_dict = {'key1': ['value1', 'value2']}
r2 = requests.post('http://httpbin.org/post', data=payload_dict)
print(r1.text)
{
  ...
  "form": {
    "key1": [
      "value1",
      "value2"
    ]
  },
  ...
}

#例二
r1.text == r2.text
True


#這個例子是說明表單的編碼的形式是多樣的,比如以json來傳遞

#寫法一
import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
req = requests.post(url, data=json.dumps(payload))

#寫法二
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
req = requests.post(url, json=payload)
  • 傳遞header

get:

headers = {'user-agent': 'my-app/0.0.1'}
req = requests.get(url, headers=headers)

post:

header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
data = {'_xsrf': xsrf, 'email': '郵箱', 'password': '密碼',
        'remember_me': True}
session = requests.Session()
result = session.post('https://www.zhihu.com/login/email', headers=header, data=data) #這裡的result是一個json格式的字串,裡面包含了登入結果
  • 傳遞cookies

get:

 url = 'http://httpbin.org/cookies'
 req = requests.get(url, cookies=dict(cookies_are='working'))

post:

import requests
r = requests.get(url1)  # 你第一次的url
headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Cache-Control':'no-cache',
    'Content-Length':'6',
    'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
    'Host':'www.mm131.com',
    'Pragma':'no-cache',
    'Origin':'http://www.mm131.com/xinggan/',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
}  # headers的例子,看你的post的headers
headers['cookie'] = ';'.join([headers['cookie'], ['='.join(i) for i in r.cookies.items()]])
r = requests.post(url2, headers=headers, data=data)  # 你第二次的url
  • 傳遞檔案

post:

#低階版:
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}

req = requests.post(url, files=files)
req.text
{
  ...
  "files": {
    "file": "<censored...binary...data>"
  },
  ...
}

#進階版:
url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}

req = requests.post(url, files=files)
req.text
{
  ...
  "files": {
    "file": "<censored...binary...data>"
  },
  ...
}

#字串也可以上傳:
url = 'http://httpbin.org/post'
files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}

req = requests.post(url, files=files)
req.text
{
  ...
  "files": {
    "file": "some,data,to,send\\nanother,row,to,send\\n"
  },
  ...
}

 B:再拓展一下get和post的函式原型,可以讓大家對引數有一個更加全面的瞭解:

get:

def get(url, params=None, **kwargs):
    r"""Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('get', url, params=params, **kwargs)

post:

def post(url, data=None, json=None, **kwargs):
    r"""Sends a POST request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('post', url, data=data, json=json, **kwargs)

 C:然後拓展一個打印出添加了引數的之後的url的方法:

print(req.url)

D:我們需要注意的另一個事情是編碼問題:

你如果使用print(req.text),那麼requests會自動幫你編碼來顯示結果(原檔案是以二進位制形式返回的,而urllib則需要手動編碼),如果你想改變編碼方式也很簡單:

req.encoding = 'ISO-8859-1'

而如果你想要得到一個二進位制的結果:

 req.content()

另外你如果想要一個json格式的結果 :

req.json()
# !一定要做異常的處理,很有可能請求的網頁與json不適配或者壓根請求就出問題

如果你想要一個未經過處理的response:

req = requests.get('https://api.github.com/events', stream=True)
req.raw
<urllib3.response.HTTPResponse object at 0x101194810>

req.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

#當然,我們需要做一些異常的處理
with open(filename, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

4.如果你需要獲取response的資訊的話:

req.headers
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

req.headers['Content-Type']
'application/json'

req.headers.get('content-type')
'application/json'

5.如何取得cookies並使用:

#基本取出
>>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value'
#基本使用
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'


#使用cookiesJar來完成兩個過程
>>> jar = requests.cookies.RequestsCookieJar()
>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
>>> url = 'http://httpbin.org/cookies'
>>> r = requests.get(url, cookies=jar)
>>> r.text
'{"cookies": {"tasty_cookie": "yum"}}'

6,其它內容(挖坑以後填):

A:狀態碼

B:超時

C:異常和錯誤的處理