1. 程式人生 > >python爬蟲-基礎入門-python爬蟲突破封鎖

python爬蟲-基礎入門-python爬蟲突破封鎖

python爬蟲-基礎入門-python爬蟲突破封鎖

>> 相關概念

 

  >> request概念:是從客戶端向伺服器發出請求,包括使用者提交的資訊及客戶端的一些資訊。客戶端可通過HTML表單或在網頁地址後面提供引數的方法提交資料。讓後通過request物件的相關方法來獲取這些資料。request的各種方法主要用來處理客戶端瀏覽器提交的請求中的各項引數和選項。而python爬蟲中的request其實就是通過python向伺服器發出request請求,得到其返回的資訊。

  

  >> post 和 get資料傳輸:

    > 常見的http請求方法有get、post、put、delete等

    > get是比較簡單的http請求,直接會將傳送給web伺服器的資料放在請求地址後面,即在請求地址後面使用 ?key1=value&key2=value2形式傳遞資料,只適合資料量少,且沒有安全性要求的請求。

    > post是將需要傳送給web伺服器的資料經過編碼放到請求體中,可以傳遞大量資料,並且有一定的安全性,常用於表單提交

 

  >> 構造合理的HTTP請求

    > 有些網站不會同意程式直接用上面的方式進行訪問,如果識別有問題,那麼站點根本不會響應,所以為了完全模擬瀏覽器的工作,需要設定一些Headers Http的請求頭的資訊。

    > HTTP請求頭是在每次向網路伺服器傳送請求 時,傳遞的一組屬性和配置資訊。HTTP定義了十幾種古怪的請求頭型別,不過大多數的不常用。只有下面的七個欄位被大多數瀏覽器用來初始化所有網路請求

屬性 內容
Host  
Connection 預設進行持久連結alive,clos標明當前正在使用tcp連結在當天請求處理完畢後會被斷掉
Accept 代表瀏覽器可以接受伺服器回發的內容型別
User-Agent 向訪問網站提供你所使用的瀏覽器型別、作業系統及版本、CPU型別、瀏覽器渲染引擎、瀏覽器語音、瀏覽器外掛等資訊的標識
Referrer  
Accept-Encoding  
Accept-Language 瀏覽器可 接受的語言

 

 

    >> 簡單示例:

 1 #-*- coding: utf-8 -*-
 2 
 3 import urllib.request
 4 
 5 def baiduNet() :
 6     headers = {
 7         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
 8         'Connection': 'keep-alive'
 9     }
10     request = urllib.request.Request("http://www.baidu.com", headers=headers)
11     response = urllib.request.urlopen(request).read()
12     netcontext = response.decode("utf-8")
13 
14     file = open("baidutext.txt", "w", encoding='UTF-8')
15     file.write(netcontext)
16 
17 if __name__ == "__main__" :
18     baiduNet()

     >> 示例升級:

 1 #-*- coding: utf-8 -*-
 2 
 3 import urllib.request
 4 import random
 5 
 6 def requests_headers():
 7     head_connection = ['Keep-Alive','close']
 8     head_accept = ['text/html,application/xhtml+xml,*/*']
 9     head_accept_language = ['zh-CN,fr-FR;q=0.5','en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3']
10     head_user_agent = ['Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
11                        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
12                        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
13                        'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
14                        'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
15                        'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
16                        'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
17                        'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
18                        'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
19                        'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
20                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
21                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
22                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
23                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
24                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
25                        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
26                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
27                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)',
28                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11']
29 
30     #header 為常用屬性隨機產生值
31     header = {
32         'Connection':head_connection[random.randrange(0,len(head_connection))],
33         'Accept':head_accept[0],
34         'Accept-Language':head_accept_language[random.randrange(0,len(head_accept_language))],
35         'User-Agent':head_user_agent[random.randrange(0,len(head_user_agent))],
36     }
37     return header #返回值為 header這個字典
38 
39 
40 def baiduNet() :
41     headers = requests_headers()
42     request = urllib.request.Request("http://www.baidu.com", headers=headers)
43     response = urllib.request.urlopen(request).read()
44     netcontext = response.decode("utf-8")
45 
46     file = open("baidutext.txt", "w", encoding='UTF-8')
47     file.write(netcontext)
48 
49 if __name__ == "__main__" :
50     baiduNet()

 

    

    >> 由於一直用同一個IP爬取目標網站的資料,如果訪問的次數過多,目標網站伺服器會禁止你的訪問,所以需要經常更換自己的IP,這時候就需要代理伺服器了。

     》》示例程式碼:

 1 #-*- coding: utf-8 -*-
 2 
 3 import urllib.request
 4 import random
 5 
 6 def requests_headers():
 7     head_connection = ['Keep-Alive','close']
 8     head_accept = ['text/html,application/xhtml+xml,*/*']
 9     head_accept_language = ['zh-CN,fr-FR;q=0.5','en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3']
10     head_user_agent = ['Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
11                        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
12                        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
13                        'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
14                        'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
15                        'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
16                        'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
17                        'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
18                        'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
19                        'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
20                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
21                        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
22                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
23                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
24                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
25                        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
26                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
27                        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)',
28                        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11']
29 
30     #header 為常用屬性隨機產生值
31     header = {
32         'Connection':head_connection[random.randrange(0,len(head_connection))],
33         'Accept':head_accept[0],
34         'Accept-Language':head_accept_language[random.randrange(0,len(head_accept_language))],
35         'User-Agent':head_user_agent[random.randrange(0,len(head_user_agent))],
36     }
37     return header #返回值為 header這個字典
38  
39 def baiduNetProxy():
40 
41     headers = requests_headers()
42     proxies = ["代理ip地址:代理埠" ]
43     # 生產代理伺服器
44     proxy_handler = urllib.request.ProxyHandler({"http":random.choice(proxies)})
45     # 建立支援處理http請求的物件
46     opener = urllib.request.build_opener(proxy_handler)
47     header = []
48 
49     for key, value in headers.items():
50         elem = (key, value)
51         header.append(elem)
52     opener.addheaders = header  # 新增headers
53 
54     request = opener.open("http://www.baidu.com")
55     response = request.read()
56     netcontext = response.decode("utf-8")
57 
58     file = open("baidutext.txt", "w", encoding='UTF-8')
59     file.write(netcontext)
60 
61 if __name__ == "__main__" :
62     baiduNetProxy()

 

 

如有問題,歡迎糾正!!!

如有轉載,請標明源處:https://www.cnblogs.com/Charles-Yuan/p/9903489.html