概述和HTTP請求和響應處理

阿新 • • 發佈：2018-12-05

1、概述

　　爬蟲，應該稱為網路爬蟲，也叫網頁蜘蛛人，網路螞蟻等

　　搜尋引擎，就是網路爬蟲的應用者

2、爬蟲分類

　　通用爬蟲：

　　　　常見就是搜尋引擎，無差別的收集資料，儲存，提交關鍵字，構建索引庫，給使用者提供搜尋介面

　　　　爬取一般流程：

　　　　　　1、初始一批URL，將這些URL放到待爬的佇列

　　　　　　2、從佇列取出這些URL，通過DNS 解析IP ，對IP 對應的站點下載HTML頁面，儲存到本地伺服器中，爬取完URL放到已經爬取的佇列中

　　　　　　3、分析這些網頁內筒，找出網頁裡面的其他關心的URL連線，繼續執行第二步，直到爬取條件結束。

　　　　搜尋引擎如何獲取一個新網站的URL

- - 新網站主動提交給搜尋引擎
  - 通過其他網站頁面的外連線
  - 搜尋引擎和DNS 服務商合作，獲取最新收錄的網站。

　　聚焦爬蟲：

　　　　有針對的編寫特定領域資料的爬蟲程式，針對某些類別的資料採集的爬蟲，是面向主體的爬蟲

3、Robots 協議：

　　指定一個robots.txt 檔案，告訴爬蟲引擎什麼可以爬，什麼不可以爬

　　Allow：允許， Disallow：不允許

　　可以使用萬用字元

　　例如：淘寶：http://www.taobao.com/robots.txt

View Code

　　這是一個君子協定，爬亦有道

　　這個協議為了讓搜尋引擎更有效率的搜尋自己內容，提供瞭如Sitemap 這樣的檔案

　　Sitemap 往往死一個XML 檔案，提供了網站想讓大家爬取的內容的更新資訊

　　這個檔案禁止爬取的往往又是我們可能感興趣的內容，它反而洩露了這些地址。

4、HTTP請求和響應的處理

　　其實爬取網頁就是通過HTTP 協議訪問網頁，不過通過瀏覽器訪問往往是認為行為，把這種行為變成程式來訪問。

　　urllib包：

　　　　urllib 是標準庫，它一個工具包模組，包含下面的模組處理 url

- - urllib.request 用於開啟和讀寫url
  - urllib.error 包含了有urllib.request引起的異常。
  - urllib.parse 用於解析url
  - urllib.robotparser 分析robots.txt 檔案

　　　　Python2 中提供了urlib 和urllib2 ，前者提供了較為底層的介面，urllib2 對urllib 進行了進一步的封裝，P櫻桃紅3中將urllib合併到餓了urllib中，並更名為標準庫urllib包

　　urllib.request模組

　　　　模組定義了在基本和摘要式身份驗證，重定向，cookies等應用中開啟url（主要是HTTP）的函式和類

　　　　urlopen方法：

　　　　　　urlopen(url,data=None)

　　　　　　url 是連結地址字串，或請求類的例項

　　　　　　data提交的資料，如果data為None，發起的是GET請求，否則發起POST請求，

　　　　　　見 urllib.request.Requset.get_method返回 http.client.HTTPResponse類的響應物件，這是一個類檔案物件　　　　

 1 from urllib.request import urlopen
 2 from urllib import request
 3 
 4 
 5 # 開啟一個url返回一個響應物件，類檔案物件
 6 # 下面的連結，會301 跳轉
 7 response = urlopen('https://www.bing.com') #GET 方法
 8 print(response)-----類檔案物件
 9 with response:
10     print(1, type(response))
11     print(2, response.status, response.reason)
12     print(3, response.geturl) 13 print(4, response.info()) 14 print(5, response.read()) 15 16 print(response.closed)

列印結果

　　　　上面，通過urllib.requset.urlopen 方法，發起一個HTTP的GET請求，web 伺服器返回了網頁內容，響應的資料被封裝到類檔案物件中，可以通過read方法，readline方法，readlines方法，獲取資料，status，和reason 表示狀態碼， info方法表示返回header資訊等

　　User-Agent問題

　　　　上例的程式碼非常精簡，即可以獲得網站的響應資料，但是目前urlopen方法通過url 字串和data發起HTTP請求

　　　　如果想修改HTTP頭，例如：useragent 就得藉助其他方式

　　　　原碼中構造的useragent 如下：　　　　

 1 class OpenerDirector:
 2     def __init__(self):
 3         client_version = "Python-urllib/%s" % __version__
 4         self.addheaders = [('User-agent', client_version)]
 5         # self.handlers is retained only for backward compatibility
 6         self.handlers = []
 7         # manage the individual handlers
 8         self.handle_open = {}
 9         self.handle_error = {}
10         self.process_response = {} 11 self.process_request = {}

　　　　當前顯示為 Python-urllib/3.7

　　　　有些網站是反爬蟲的，所以要把爬蟲偽裝成瀏覽器，隨便開啟一個瀏覽器，複製瀏覽器的UA（useragent）值，用來偽裝。

　　Request類

　　　　Request（url, data=None, headers={} ）

　　　　初始化方法，構造一個請求物件，可新增一個header的字典

　　　　data 引數決定是GET 還是POST 請求（data 為None是GET，有資料，就是POST）

　　　　add_header(key, val）為header總增加一個鍵值對。　　　　

 1 from urllib.request import Request, urlopen
 2 
 3 # 開啟一個url 返回一個Requset 請求物件
 4 # url = 'http://movie.douban.com/' 注意尾部的斜杆一定要有
 5 url = 'http://www.bing.com/'
 6 
 7 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 8 
 9 # 構造一個請求物件
10 request = Request(url)
11 request.add_header('User-Agent',ua)
12 
13 print(type(request),'=============')
14 print(5,request.get_header('User-Agent'))
15 # 構建響應物件
16 response = urlopen(request, timeout=20) # requset物件或者url 都可以
17 
18 print(type(response)) 19 20 21 with response: 22 # getcode本質上返回的就是 status 23 print(1, response.status, response.getcode(),response.reason) 24 # 返回資料的url，如果重定向，這個url和原始url不一樣 25 print(2, response.geturl()) 26 # 返回響應頭資訊 27 print(3, response.info()) 28 # 讀取返回的內容 29 print(4, response.read()) 30 31 print(5,request.get_header('User-Agent')) 32 print(6,'user-agent'.capitalize())

列印結果

　　urllib.parse 模組

　　該模組可以完成對url的編解碼

　　　　編碼：urlencode函式第一個引數要求是一個字典或者二元組序列

1 from urllib import parse
2 
3 u = parse.urlencode({
4     'url':'http://www.magedu.com/python',
5     'p_url':'http:www.magedu.com/python?id=1&name=張三'
6 })
7 print(u)

url=http%3A%2F%2Fwww.magedu.com%2Fpython&p_url=http%3Awww.magedu.com%2Fpython%3Fid%3D1%26name%3D%E5%BC%A0%E4%B8%89

　　從執行結果來看冒號。斜杆 & 等號，問號都被編碼，%之後實際上是單位元組十六進位制表示的值

　　一般來說，url中的地址部分，一般不需要使用中文路徑，但是引數部分，不管 GET 還是post 方法，提交的資料中，可能有斜杆等符號，這樣的字元表示資料，不表示元字元，如果直接傳送給伺服器端，就會導致接收方無法判斷誰是元字元，誰是資料，為了安全，一般會將資料部分的字串做url 編碼，這樣就不會有歧義了

，後來可以傳送中文，同樣會做編碼，一般先按照字符集的encoding要求轉化成位元組序列，每一個位元組對應的十六進位制字串前加上百分號即可。

 1 '''
 2 網頁使用utf-8 編碼
 3 https://www.baidu.com/s?wd=中
 4 上面的url編碼後，如下：
 5 https://www.baidu.com/s?wd=%E4%B8%AD
 6 '''
 7 from urllib import parse
 8 
 9 u = parse.urlencode({'wd':'中'}) # 編碼
10 print(u)
11 
12 url = 'https://www.baidu.com/s?{}'.format(u)
13 print(url)
14 
15 print('中'.encode('utf-8'))
16 # 解碼
17 print(parse.unquote(u)) 18 print(parse.unquote(url))

　　列印結果：

1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt21.py
2 wd=%E4%B8%AD
3 https://www.baidu.com/s?wd=%E4%B8%AD
4 b'\xe4\xb8\xad'
5 wd=中
6 https://www.baidu.com/s?wd=中
7 
8 Process finished with exit code 0

5、提交方法method

　　最常用的HTTP互動資料的方法是GET ,POST

　　GET 方法，資料是通過URL 傳遞的，也就是說資料時候在http 報文的header部分

　　POST方法，資料是放在http報文的body 部分提交的

　　資料都是鍵值對形式，多個引數之間使用&符號連結，

　　GET方法：

　　連線 bing 搜尋引擎官網，獲取一個搜尋的URL： http://cn.bing.com/search?q=張三

　　需求：

　　請寫程式需完成對關鍵字的bing 搜尋，將返回的結果儲存到一個網頁檔案中。　　

 1 from urllib.request import Request, urlopen
 2 from urllib.parse import urlencode
 3 
 4 keyword = input('>>輸入關鍵字')
 5 
 6 data = urlencode({'q':keyword})
 7 
 8 base_url = 'http://cn.bing.com/search'
 9 
10 url = '{}?{}'.format(base_url, data)
11 
12 print(url) 13 14 # 偽裝 15 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36' 16 request = Request(url, headers={'User-agent':ua}) 17 response = urlopen(request) 18 19 with response: 20 with open('./bing.html', 'wb') as f: 21 f.write(response.read())

　　結果：

　　http GET 獲取的文字是位元組形式（二進位制）

　　POST 方法：

　　　　http://httpbin.org/ 測試網站　　像一個echo，你發什麼，給你什麼

 1 from urllib.request import Request, urlopen
 2 from urllib.parse import urlencode
 3 import simplejson
 4 
 5 request = Request('http://httpbin.org/post')
 6 
 7 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 8 
 9 request.add_header('User-agent', ua)
10 
11 data = urlencode({'name':'張三,@=/&=', 'age':'12'}) 12 13 print(data) 14 15 res = urlopen(request, data=data.encode())# POST f方法 Form提交資料 16 with res: 17 print(res.read().decode())

　　結果:

 1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt23.py
 2 name=%E5%BC%A0%E4%B8%89%2C%40%3D%2F%26%3D&age=12
 3 {
 4   "args": {}, 
 5   "data": "", 
 6   "files": {}, 
 7   "form": { 8 "age": "12", 9 "name": "\u5f20\u4e09,@=/&=" 10  }, 11 "headers": { 12 "Accept-Encoding": "identity", 13 "Connection": "close", 14 "Content-Length": "48", 15 "Content-Type": "application/x-www-form-urlencoded", 16 "Host": "httpbin.org", 17 "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36" 18  }, 19 "json": null, 20 "origin": "61.149.196.193", 21 "url": "http://httpbin.org/post" 22 } 23 24 25 Process finished with exit code 0

　　處理JSON資料

　　　　檢視豆瓣電影，看到最近熱門電影的熱門

　　　　通過分析，我們知道這部分內容，是通過AJAX 從後臺拿到的json資料

　　　　訪問ur 是：https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0

　　　　　　伺服器返回的資料如上

1 from urllib import parse
2 將url解碼：
3 print(parse.unquote('https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0'))
4 
5 ----------------------------------------------------------------------------------
6 https://movie.douban.com/j/search_subjects?type=movie&tag=熱門&page_limit=50&page_start=0

　　　　通過程式碼獲取上述截圖內容：

6、HTTPS 證書忽略

　　HTTPS使用SSL 安全套接層協議，在傳輸層對網路資料進行加密，HTTPS 使用的時候，需要證書，而證書需要cA認證　

 1 from urllib.request import Request, urlopen
 2 
 3 # 可以訪問
 4 # request = Request('http://www.12306.cn/mormhweb')
 5 # request = Request('https://www.baidu.com')
 6 
 7 request = Request('https://www.12306.cn/mormhweb/')
 8 
 9 
10 print(request)
11 
12 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
13 
14 request.add_header('User-agent', ua)
15 
16 with urlopen(request) as res:
17     print(res._method) 18 print(res.read())

　　忽略證書不安全資訊：　　

 1 from urllib.request import Request, urlopen
 2 import ssl
 3 
 4 # 可以訪問
 5 # request = Request('http://www.12306.cn/mormhweb')
 6 # request = Request('https://www.baidu.com')
 7 
 8 request = Request('https://www.12306.cn/mormhweb/')
 9 
10 
11 print(request)
12 
13 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
14 
15 request.add_header('User-agent', ua)
16 # 忽略不信任的證書(不用校驗的上下文） 17 context = ssl._create_unverified_context() 18 res = urlopen(request, context=context) 19 
20 
21 with res: 22 print(res._method) 23 print(res.geturl()) 24 print(res.read().decode())

7、urllib3 庫

　　https:// urllib3.readthedocs.io/en/latest

　　標準庫urllib缺少了一些關鍵的功能，非標準庫的第三方庫 urlib3 提供了，比如說連線池管理

　　安裝：pip install urllib3　　

 1 import urllib3
 2 
 3 url = 'http://movie.douban.com/'
 4 
 5 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 6 
 7 # 連線池管理
 8 with urllib3.PoolManager() as http:
 9     response = http.request('GET', url, headers={"User-agent":ua})
10     print(1,type(response))
11     print(2,response.status, response.reason)
12     print(3,response.headers) 13 print(4,response.data.decode())

　　結果：

View Code

8、requests庫（開發真正用的庫）

　　requests 使用了 urllib3，但是 API 更加友好，推薦使用

 1 import requests
 2 
 3 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 4 
 5 url = 'http://movie.douban.com/'
 6 
 7 response = requests.request('GET', url,headers={'User-agent':ua})
 8 
 9 with response:
10     print(type(response))
11     print(response)
12     print(response.url) 13 print((response.status_code)) 14 print(response.request.headers) #請求頭 15 print(response.headers) # 響應頭 16 print(response.encoding) 17 response.encoding = 'utf-8' 18 print(response.text[:100]) 19 20 with open('./movie.html', 'w', encoding='utf-8') as f: 21 f.write(response.text) # 儲存檔案

　　結果：

 1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt25.py
 2 <class 'requests.models.Response'>
 3 <Response [200]>
 4 https://movie.douban.com/
 5 200
 6 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
 7 {'Date': 'Wed, 05 Dec 2018 03:34:12 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=30', 'Vary': 'Accept-Encoding', 'X-Xss-Protection': '1; mode=block', 'X-Douban-Mobileapp': '0', 'Expires': 'Sun, 1 Jan 2006 01:00:00 GMT', 'Pragma': 'no-cache', 'Cache-Control': 'must-revalidate, no-cache, private', 'Set-Cookie': 'll="108288"; path=/; domain=.douban.com; expires=Thu, 05-Dec-2019 03:34:12 GMT, bid=8QfNwA452hU; Expires=Thu, 05-Dec-19 03:34:12 GMT; Domain=.douban.com; Path=/', 'X-DOUBAN-NEWBID': '8QfNwA452hU', 'X-DAE-Node': 'brand15', 'X-DAE-App': 'movie', 'Server': 'dae', 'X-Content-Type-Options': 'nosniff', 'Content-Encoding': 'gzip'}
 8 utf-8
 9 <!DOCTYPE html>
10 <html lang="zh-cmn-Hans" class="ua-windows ua-webkit">
11 <head>
12     <meta http-equiv="
13 
14 Process finished with exit code 0

　　request預設使用Session 物件，是為了在多次和伺服器互動中保留會話的資訊，例如cookie

　　否則，每次都要重新發起請求

 1 # 直接使用Session
 2 import requests
 3 
 4 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 5 
 6 urls = ['https://www.baidu.com','https://www.baidu.com']
 7 
 8 session = requests.Session()
 9 with session:
10     for url in urls:
11         # response = session.get(url, headers={'User-agent':ua})
12         response = requests.request('GET', url, headers={'User-agent':ua}) # 相當於每次都是新的請求，也就是開啟了兩個瀏覽器而已
13         print(response) 14  with response: 15 print(response.request.headers) 16 print(response.cookies) 17 print(response.text[:20]) 18 print('----------------------------------------------') 19 ''' 20 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt19.py 21 <Response [200]> 22 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 23 <RequestsCookieJar[<Cookie BAIDUID=808F0938C6A9CC144A1F6BEA823FF4F5:FG=1 for .baidu.com/>, <Cookie BIDUPSID=808F0938C6A9CC144A1F6BEA823FF4F5 for .baidu.com/>, <Cookie H_PS_PSSID=1444_21081_27509 for .baidu.com/>, <Cookie PSTM=1543981317 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 24 <!DOCTYPE html> 25 <!-- 26 ---------------------------------------------- 27 <Response [200]> 28 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 29 <RequestsCookieJar[<Cookie BAIDUID=808F0938C6A9CC14FA9EECCB0280B074:FG=1 for .baidu.com/>, <Cookie BIDUPSID=808F0938C6A9CC14FA9EECCB0280B074 for .baidu.com/>, <Cookie H_PS_PSSID=26523_1469_25810_21098_26350_22073 for .baidu.com/>, <Cookie PSTM=1543981317 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 30 <!DOCTYPE html> 31 <!-- 32 ---------------------------------------------- 33 34 Process finished with exit code 0 35 36 ''' 37 38 with session: 39 for url in urls: 40 response = session.get(url, headers={'User-agent':ua}) 41 # response = requests.request('GET', url, headers={'User-agent':ua}) 42 print(response) 43  with response: 44 print(response.request.headers) 45 print(response.cookies) 46 print(response.text[:20]) 47 print('----------------------------------------------') 48 ''' 49 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt19.py 50 <Response [200]> 51 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 52 <RequestsCookieJar[<Cookie BAIDUID=5A320955507582B839E723DB6F55B2BD:FG=1 for .baidu.com/>, <Cookie BIDUPSID=5A320955507582B839E723DB6F55B2BD for .baidu.com/>, <Cookie H_PS_PSSID=1434_21091_18559_27245_27509 for .baidu.com/>, <Cookie PSTM=1543981366 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 53 <!DOCTYPE html> 54 <!-- 55 ---------------------------------------------- 56 <Response [200]> 第二次訪問，帶上了cookie 57 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'BAIDUID=5A320955507582B839E723DB6F55B2BD:FG=1; BIDUPSID=5A320955507582B839E723DB6F55B2BD; H_PS_PSSID=1434_21091_18559_27245_27509; PSTM=1543981366; delPer=0; BDSVRTM=0; BD_HOME=0'} 58 <RequestsCookieJar[<Cookie H_PS_PSSID=1434_21091_18559_27245_27509 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 59 <!DOCTYPE html> 60 <!-- 61 ---------------------------------------------- 62 63 Process finished with exit code 0 64 65 '''

1、概述

　　爬蟲，應該稱為網路爬蟲，也叫網頁蜘蛛人，網路螞蟻等

　　搜尋引擎，就是網路爬蟲的應用者

2、爬蟲分類

　　通用爬蟲：

　　　　常見就是搜尋引擎，無差別的收集資料，儲存，提交關鍵字，構建索引庫，給使用者提供搜尋介面

　　　　爬取一般流程：

　　　　　　1、初始一批URL，將這些URL放到待爬的佇列

　　　　　　2、從佇列取出這些URL，通過DNS 解析IP ，對IP 對應的站點下載HTML頁面，儲存到本地伺服器中，爬取完URL放到已經爬取的佇列中

　　　　　　3、分析這些網頁內筒，找出網頁裡面的其他關心的URL連線，繼續執行第二步，直到爬取條件結束。

　　　　搜尋引擎如何獲取一個新網站的URL

- - 新網站主動提交給搜尋引擎
  - 通過其他網站頁面的外連線
  - 搜尋引擎和DNS 服務商合作，獲取最新收錄的網站。

　　聚焦爬蟲：

　　　　有針對的編寫特定領域資料的爬蟲程式，針對某些類別的資料採集的爬蟲，是面向主體的爬蟲

3、Robots 協議：

　　指定一個robots.txt 檔案，告訴爬蟲引擎什麼可以爬，什麼不可以爬

　　Allow：允許， Disallow：不允許

　　可以使用萬用字元

　　例如：淘寶：http://www.taobao.com/robots.txt　　　　

View Code

　　這是一個君子協定，爬亦有道

　　這個協議為了讓搜尋引擎更有效率的搜尋自己內容，提供瞭如Sitemap 這樣的檔案

　　Sitemap 往往死一個XML 檔案，提供了網站想讓大家爬取的內容的更新資訊

　　這個檔案禁止爬取的往往又是我們可能感興趣的內容，它反而洩露了這些地址。

4、HTTP請求和響應的處理

　　其實爬取網頁就是通過HTTP 協議訪問網頁，不過通過瀏覽器訪問往往是認為行為，把這種行為變成程式來訪問。

　　urllib包：

　　　　urllib 是標準庫，它一個工具包模組，包含下面的模組處理 url

- - urllib.request 用於開啟和讀寫url
  - urllib.error 包含了有urllib.request引起的異常。
  - urllib.parse 用於解析url
  - urllib.robotparser 分析robots.txt 檔案

　　urllib.request模組

　　　　模組定義了在基本和摘要式身份驗證，重定向，cookies等應用中開啟url（主要是HTTP）的函式和類

　　　　urlopen方法：

　　　　　　urlopen(url,data=None)

　　　　　　url 是連結地址字串，或請求類的例項

　　　　　　data提交的資料，如果data為None，發起的是GET請求，否則發起POST請求，

　　　　　　見 urllib.request.Requset.get_method返回 http.client.HTTPResponse類的響應物件，這是一個類檔案物件　　　　

 1 from urllib.request import urlopen
 2 from urllib import request
 3 
 4 
 5 # 開啟一個url返回一個響應物件，類檔案物件
 6 # 下面的連結，會301 跳轉
 7 response = urlopen('https://www.bing.com') #GET 方法
 8 print(response)-----類檔案物件
 9 with response:
10     print(1, type(response))
11     print(2, response.status, response.reason)
12     print(3, response.geturl) 13 print(4, response.info()) 14 print(5, response.read()) 15 16 print(response.closed)

列印結果

　　User-Agent問題

　　　　上例的程式碼非常精簡，即可以獲得網站的響應資料，但是目前urlopen方法通過url 字串和data發起HTTP請求

　　　　如果想修改HTTP頭，例如：useragent 就得藉助其他方式

　　　　原碼中構造的useragent 如下：　　　　

 1 class OpenerDirector:
 2     def __init__(self):
 3         client_version = "Python-urllib/%s" % __version__
 4         self.addheaders = [('User-agent', client_version)]
 5         # self.handlers is retained only for backward compatibility
 6         self.handlers = []
 7         # manage the individual handlers
 8         self.handle_open = {}
 9         self.handle_error = {}
10         self.process_response = {} 11 self.process_request = {}

　　　　當前顯示為 Python-urllib/3.7

　　　　有些網站是反爬蟲的，所以要把爬蟲偽裝成瀏覽器，隨便開啟一個瀏覽器，複製瀏覽器的UA（useragent）值，用來偽裝。

　　Request類

　　　　Request（url, data=None, headers={} ）

　　　　初始化方法，構造一個請求物件，可新增一個header的字典

　　　　data 引數決定是GET 還是POST 請求（data 為None是GET，有資料，就是POST）

　　　　add_header(key, val）為header總增加一個鍵值對。　　　　

 1 from urllib.request import Request, urlopen
 2 
 3 # 開啟一個url 返回一個Requset 請求物件
 4 # url = 'http://movie.douban.com/' 注意尾部的斜杆一定要有
 5 url = 'http://www.bing.com/'
 6 
 7 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 8 
 9 # 構造一個請求物件
10 request = Request(url)
11 request.add_header('User-Agent',ua)
12 
13 print(type(request),'=============')
14 print(5,request.get_header('User-Agent'))
15 # 構建響應物件
16 response = urlopen(request, timeout=20) # requset物件或者url 都可以
17 
18 print(type(response)) 19 20 21 with response: 22 # getcode本質上返回的就是 status 23 print(1, response.status, response.getcode(),response.reason) 24 # 返回資料的url，如果重定向，這個url和原始url不一樣 25 print(2, response.geturl()) 26 # 返回響應頭資訊 27 print(3, response.info()) 28 # 讀取返回的內容 29 print(4, response.read()) 30 31 print(5,request.get_header('User-Agent')) 32 print(6,'user-agent'.capitalize())

列印結果

　　urllib.parse 模組

　　該模組可以完成對url的編解碼

　　　　編碼：urlencode函式第一個引數要求是一個字典或者二元組序列

1 from urllib import parse
2 
3 u = parse.urlencode({
4     'url':'http://www.magedu.com/python',
5     'p_url':'http:www.magedu.com/python?id=1&name=張三'
6 })
7 print(u)

url=http%3A%2F%2Fwww.magedu.com%2Fpython&p_url=http%3Awww.magedu.com%2Fpython%3Fid%3D1%26name%3D%E5%BC%A0%E4%B8%89

　　從執行結果來看冒號。斜杆 & 等號，問號都被編碼，%之後實際上是單位元組十六進位制表示的值

，後來可以傳送中文，同樣會做編碼，一般先按照字符集的encoding要求轉化成位元組序列，每一個位元組對應的十六進位制字串前加上百分號即可。

 1 '''
 2 網頁使用utf-8 編碼
 3 https://www.baidu.com/s?wd=中
 4 上面的url編碼後，如下：
 5 https://www.baidu.com/s?wd=%E4%B8%AD
 6 '''
 7 from urllib import parse
 8 
 9 u = parse.urlencode({'wd':'中'}) # 編碼
10 print(u)
11 
12 url = 'https://www.baidu.com/s?{}'.format(u)
13 print(url)
14 
15 print('中'.encode('utf-8'))
16 # 解碼
17 print(parse.unquote(u)) 18 print(parse.unquote(url))

　　列印結果：

1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt21.py
2 wd=%E4%B8%AD
3 https://www.baidu.com/s?wd=%E4%B8%AD
4 b'\xe4\xb8\xad'
5 wd=中
6 https://www.baidu.com/s?wd=中
7 
8 Process finished with exit code 0

5、提交方法method

　　最常用的HTTP互動資料的方法是GET ,POST

　　GET 方法，資料是通過URL 傳遞的，也就是說資料時候在http 報文的header部分

　　POST方法，資料是放在http報文的body 部分提交的

　　資料都是鍵值對形式，多個引數之間使用&符號連結，

　　GET方法：

　　連線 bing 搜尋引擎官網，獲取一個搜尋的URL： http://cn.bing.com/search?q=張三

　　需求：

　　請寫程式需完成對關鍵字的bing 搜尋，將返回的結果儲存到一個網頁檔案中。　　

 1 from urllib.request import Request, urlopen
 2 from urllib.parse import urlencode
 3 
 4 keyword = input('>>輸入關鍵字')
 5 
 6 data = urlencode({'q':keyword})
 7 
 8 base_url = 'http://cn.bing.com/search'
 9 
10 url = '{}?{}'.format(base_url, data)
11 
12 print(url) 13 14 # 偽裝 15 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36' 16 request = Request(url, headers={'User-agent':ua}) 17 response = urlopen(request) 18 19 with response: 20 with open('./bing.html', 'wb') as f: 21 f.write(response.read())

　　結果：

　　http GET 獲取的文字是位元組形式（二進位制）

　　POST 方法：

　　　　http://httpbin.org/ 測試網站　　像一個echo，你發什麼，給你什麼

 1 from urllib.request import Request, urlopen
 2 from urllib.parse import urlencode
 3 import simplejson
 4 
 5 request = Request('http://httpbin.org/post')
 6 
 7 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 8 
 9 request.add_header('User-agent', ua)
10 
11 data = urlencode({'name':'張三,@=/&=', 'age':'12'}) 12 13 print(data) 14 15 res = urlopen(request, data=data.encode())# POST f方法 Form提交資料 16 with res: 17 print(res.read().decode())

　　結果:

 1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt23.py
 2 name=%E5%BC%A0%E4%B8%89%2C%40%3D%2F%26%3D&age=12
 3 {
 4   "args": {}, 
 5   "data": "", 
 6   "files": {}, 
 7   "form": { 8 "age": "12", 9 "name": "\u5f20\u4e09,@=/&=" 10  }, 11 "headers": { 12 "Accept-Encoding": "identity", 13 "Connection": "close", 14 "Content-Length": "48", 15 "Content-Type": "application/x-www-form-urlencoded", 16 "Host": "httpbin.org", 17 "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36" 18  }, 19 "json": null, 20 "origin": "61.149.196.193", 21 "url": "http://httpbin.org/post" 22 } 23 24 25 Process finished with exit code 0

　　處理JSON資料

　　　　檢視豆瓣電影，看到最近熱門電影的熱門

　　　　通過分析，我們知道這部分內容，是通過AJAX 從後臺拿到的json資料

　　　　訪問ur 是：https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0

　　　　　　伺服器返回的資料如上

1 from urllib import parse
2 將url解碼：
3 print(parse.unquote('https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0'))
4 
5 ----------------------------------------------------------------------------------
6 https://movie.douban.com/j/search_subjects?type=movie&tag=熱門&page_limit=50&page_start=0

　　　　通過程式碼獲取上述截圖內容：

6、HTTPS 證書忽略

　　HTTPS使用SSL 安全套接層協議，在傳輸層對網路資料進行加密，HTTPS 使用的時候，需要證書，而證書需要cA認證　

 1 from urllib.request import Request, urlopen
 2 
 3 # 可以訪問
 4 # request = Request('http://www.12306.cn/mormhweb')
 5 # request = Request('https://www.baidu.com')
 6 
 7 request = Request('https://www.12306.cn/mormhweb/')
 8 
 9 
10 print(request)
11 
12 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
13 
14 request.add_header('User-agent', ua)
15 
16 with urlopen(request) as res:
17     print(res._method) 18 print(res.read())

　　忽略證書不安全資訊：　　

 1 from urllib.request import Request, urlopen
 2 import ssl
 3 
 4 # 可以訪問
 5 # request = Request('http://www.12306.cn/mormhweb')
 6 # request = Request('https://www.baidu.com')
 7 
 8 request = Request('https://www.12306.cn/mormhweb/')
 9 
10 
11 print(request)
12 
13 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
14 
15 request.add_header('User-agent', ua)
16 # 忽略不信任的證書(不用校驗的上下文） 17 context = ssl._create_unverified_context() 18 res = urlopen(request, context=context) 19 
20 
21 with res: 22 print(res._method) 23 print(res.geturl()) 24 print(res.read().decode())

7、urllib3 庫

　　https:// urllib3.readthedocs.io/en/latest

　　標準庫urllib缺少了一些關鍵的功能，非標準庫的第三方庫 urlib3 提供了，比如說連線池管理

　　安裝：pip install urllib3　　

 1 import urllib3
 2 
 3 url = 'http://movie.douban.com/'
 4 
 5 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 6 
 7 # 連線池管理
 8 with urllib3.PoolManager() as http:
 9     response = http.request('GET', url, headers={"User-agent":ua})
10     print(1,type(response))
11     print(2,response.status, response.reason)
12     print(3,response.headers) 13 print(4,response.data.decode())

　　結果：

View Code

8、requests庫（開發真正用的庫）

　　requests 使用了 urllib3，但是 API 更加友好，推薦使用

 1 import requests
 2 
 3 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 4 
 5 url = 'http://movie.douban.com/'
 6 
 7 response = requests.request('GET', url,headers={'User-agent':ua})
 8 
 9 with response:
10     print(type(response))
11     print(response)
12     print(response.url) 13 print((response.status_code)) 14 print(response.request.headers) #請求頭 15 print(response.headers) # 響應頭 16 print(response.encoding) 17 response.encoding = 'utf-8' 18 print(response.text[:100]) 19 20 with open('./movie.html', 'w', encoding='utf-8') as f: 21 f.write(response.text) # 儲存檔案

　　結果：

 1 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt25.py
 2 <class 'requests.models.Response'>
 3 <Response [200]>
 4 https://movie.douban.com/
 5 200
 6 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
 7 {'Date': 'Wed, 05 Dec 2018 03:34:12 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=30', 'Vary': 'Accept-Encoding', 'X-Xss-Protection': '1; mode=block', 'X-Douban-Mobileapp': '0', 'Expires': 'Sun, 1 Jan 2006 01:00:00 GMT', 'Pragma': 'no-cache', 'Cache-Control': 'must-revalidate, no-cache, private', 'Set-Cookie': 'll="108288"; path=/; domain=.douban.com; expires=Thu, 05-Dec-2019 03:34:12 GMT, bid=8QfNwA452hU; Expires=Thu, 05-Dec-19 03:34:12 GMT; Domain=.douban.com; Path=/', 'X-DOUBAN-NEWBID': '8QfNwA452hU', 'X-DAE-Node': 'brand15', 'X-DAE-App': 'movie', 'Server': 'dae', 'X-Content-Type-Options': 'nosniff', 'Content-Encoding': 'gzip'}
 8 utf-8
 9 <!DOCTYPE html>
10 <html lang="zh-cmn-Hans" class="ua-windows ua-webkit">
11 <head>
12     <meta http-equiv="
13 
14 Process finished with exit code 0

　　request預設使用Session 物件，是為了在多次和伺服器互動中保留會話的資訊，例如cookie

　　否則，每次都要重新發起請求

 1 # 直接使用Session
 2 import requests
 3 
 4 ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
 5 
 6 urls = ['https://www.baidu.com','https://www.baidu.com']
 7 
 8 session = requests.Session()
 9 with session:
10     for url in urls:
11         # response = session.get(url, headers={'User-agent':ua})
12         response = requests.request('GET', url, headers={'User-agent':ua}) # 相當於每次都是新的請求，也就是開啟了兩個瀏覽器而已
13         print(response) 14  with response: 15 print(response.request.headers) 16 print(response.cookies) 17 print(response.text[:20]) 18 print('----------------------------------------------') 19 ''' 20 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt19.py 21 <Response [200]> 22 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 23 <RequestsCookieJar[<Cookie BAIDUID=808F0938C6A9CC144A1F6BEA823FF4F5:FG=1 for .baidu.com/>, <Cookie BIDUPSID=808F0938C6A9CC144A1F6BEA823FF4F5 for .baidu.com/>, <Cookie H_PS_PSSID=1444_21081_27509 for .baidu.com/>, <Cookie PSTM=1543981317 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 24 <!DOCTYPE html> 25 <!-- 26 ---------------------------------------------- 27 <Response [200]> 28 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 29 <RequestsCookieJar[<Cookie BAIDUID=808F0938C6A9CC14FA9EECCB0280B074:FG=1 for .baidu.com/>, <Cookie BIDUPSID=808F0938C6A9CC14FA9EECCB0280B074 for .baidu.com/>, <Cookie H_PS_PSSID=26523_1469_25810_21098_26350_22073 for .baidu.com/>, <Cookie PSTM=1543981317 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 30 <!DOCTYPE html> 31 <!-- 32 ---------------------------------------------- 33 34 Process finished with exit code 0 35 36 ''' 37 38 with session: 39 for url in urls: 40 response = session.get(url, headers={'User-agent':ua}) 41 # response = requests.request('GET', url, headers={'User-agent':ua}) 42 print(response) 43  with response: 44 print(response.request.headers) 45 print(response.cookies) 46 print(response.text[:20]) 47 print('----------------------------------------------') 48 ''' 49 D:\python3.7\python.exe E:/code_pycharm/test_in_class/tt19.py 50 <Response [200]> 51 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} 52 <RequestsCookieJar[<Cookie BAIDUID=5A320955507582B839E723DB6F55B2BD:FG=1 for .baidu.com/>, <Cookie BIDUPSID=5A320955507582B839E723DB6F55B2BD for .baidu.com/>, <Cookie H_PS_PSSID=1434_21091_18559_27245_27509 for .baidu.com/>, <Cookie PSTM=1543981366 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 53 <!DOCTYPE html> 54 <!-- 55 ---------------------------------------------- 56 <Response [200]> 第二次訪問，帶上了cookie 57 {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'BAIDUID=5A320955507582B839E723DB6F55B2BD:FG=1; BIDUPSID=5A320955507582B839E723DB6F55B2BD; H_PS_PSSID=1434_21091_18559_27245_27509; PSTM=1543981366; delPer=0; BDSVRTM=0; BD_HOME=0'} 58 <RequestsCookieJar[<Cookie H_PS_PSSID=1434_21091_18559_27245_27509 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]> 59 <!DOCTYPE html> 60 <!-- 61 ---------------------------------------------- 62 63 Process finished with exit code 0 64 65 '''

概述和HTTP請求 和 響應處理

1、概述

2、爬蟲分類

3、Robots 協議：

4、HTTP請求和響應的處理

urllib.parse 模組

5、提交方法method

忽略證書不安全資訊：

7、urllib3 庫

8、requests庫（開發真正用的庫）

requests 使用了 urllib3， 但是 API 更加友好，推薦使用

1、概述

2、爬蟲分類

3、Robots 協議：

4、HTTP請求和響應的處理

urllib.parse 模組

5、提交方法method

忽略證書不安全資訊：

7、urllib3 庫

8、requests庫（開發真正用的庫）

requests 使用了 urllib3， 但是 API 更加友好，推薦使用

相關推薦

概述和HTTP請求和響應處理

　　urllib.parse 模組

　　忽略證書不安全資訊：　　

　　requests 使用了 urllib3，但是 API 更加友好，推薦使用

　　urllib.parse 模組

　　忽略證書不安全資訊：　　

　　requests 使用了 urllib3，但是 API 更加友好，推薦使用