Python爬蟲開發與專案實戰 3: 初識爬蟲

阿新 • • 發佈：2019-01-12

3.1 網路爬蟲概述

概念：按照系統結構和實現技術，大致可分：通用網路爬蟲、聚焦爬蟲、增量式爬蟲、深層爬蟲。實際的爬蟲系統通常是幾種技術的相結合實現的。

搜尋引擎：屬於通用爬蟲，但存在一定的侷限性：

檢索結果包含大量使用者不關心的網頁

有限的伺服器資源與無限的網路資料資源之間的矛盾

SEO往往對資訊含量密集且具有一定結構的資料無能為力，如音視訊等

基於關鍵字的檢索，難以支援根據語義資訊提出的查詢

為了解決上述問題，定向抓取相關網頁資源的聚焦爬蟲應運而生

聚焦爬蟲：一個自動下載網頁的程式，為面向主題的使用者查詢準備資料資源

增量式爬蟲：採取更新和只爬新產生的網頁。減少時間和空間上的耗費，但增加演算法複雜度和實現難度

深層爬蟲：網頁分表層網頁（SEO可以索引的）和深層網頁（表單後的）

場景：BT搜尋網站（https://www.cilisou.org/），雲盤搜尋網站（http://www.pansou.com/）

基本工作流程如下：

首先選取一部分精心挑選的種子URL
將這些URL放入待抓取URL佇列

從待抓取URL佇列中讀取URL，解析DNS，得到IP，下載網頁，儲存網頁，將URL放進已抓取URL佇列
分析已抓取URL佇列中的URL，分析網頁中的URL，比較去重，後放入待抓取URL佇列，進入下一個迴圈。

3.2 HTTP請求的Python實現

Python中實現HTTP請求的三種方式：urllib2/urllib httplib/urllib Requests

urllib2/urllib實現：Python中的兩個內建模組，以urllib2為主，urllib為輔

1.實現一個完整的請求與響應模型

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print html

將請求響應分為兩步：一步是請求，一步是響應

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request)
html = response.read()
print html

POST方式：

有時伺服器拒絕你的訪問，因為伺服器會檢驗請求頭。常用的反爬蟲的手段。

2、實現請求頭headers處理

import urllib
import urllib2
url = 'http://www.xxxx.com/login'
user_agent = ''
referer = 'http://www.xxxx.com/'
postdata = {'username': 'qiye',
             'password': 'qiye_pass' }
# 寫入頭資訊
headers = {'User-Agent': user_agent, 'Referer': referer}
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
html = response.read()

3、Cookie處理：使用CookieJar函式進行Cookie的管理

import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.zhihu.com')
for item in cookie:
	print item.name + ':' + item.value

SessionID_R3:4y3gT2mcOjBQEQ7RDiqDz6DfdauvG8C5j6jxFg8jIcJvE5ih4USzM0h8WRt1PZomR1C9755SGG5YIzDJZj7XVraQyomhEFA0v6pvBzV94V88uQqUyeDnsMj8MALBSKr
4、Timeout設定超時

import urllib2
request = urllib2.Request('http://www.zhihu.com')
response = urllib2.urlopen(request, timeout=2)
html = response.read()
print html

5、獲取HTTP響應碼

import urllib2
try:
	response = urllib2.urlopen('http://www.google.com')
	print response
except urllib2.HTTPError as e:
	if hasattr(e, 'code'):
		print 'Error code:', e.code

6、重定向：urllib2預設情況下會針對HTTP 3XX返回碼自動進行重定向

只要檢查Response的URL和Request的URL是否相同

import urllib2
response = urllib2.urlopen('http://www.zhihu.com')
isRedirected = response.geturl() == 'http://www.zhihu.com'

7、Proxy的設定：urllib2預設會使用環境變數http_proxy來設定HTTP Proxy，但我們一般不採用這種方式，而用ProxyHandler在程式中動態設定代理。

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.zhihu.com/')
print response.read()

install_opener()會設定全域性opener,但如想使用兩個不同的Proxy代理，比較好的做法是直接呼叫的open方法代替全域性urlopen方法

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy, )
response = opener.open('http://www.zhihu.com/')
print response.read()

httplib/urllib實現：一個底層基礎模組，可以看到建立HTTP請求的每一步，但是實現的功能比較少。

Requests：更人性化，是第三方模組，pip install requests

import requests
r = requests.get('http://www.baidu.com')
print r.content

2、響應與編碼

import requests
r = requests.get('http://www.baidu.com')
print 'content-->' + r.content
print 'text-->' + r.text
print 'encoding-->' + r.encoding
r.encoding = 'utf-8'
print 'new text-->' + r.text

pip install chardet 一個非常優秀的字串/檔案編碼檢查模組

直接將chardet探測到的編碼，賦給r.encoding實現解碼，r.text輸出就不會有亂碼了。

import requests
import chardet
r = requests.get('http://www.baidu.com')
print chardet.detect(r.content)
r.encoding = chardet.detect(r.content)['encoding']
print r.text

流模式

import requests
r = requests.get('http://www.baidu.com', stream=True)
print r.raw.read(10)

3、請求頭headers處理

import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
print r.content

4、響應碼code和響應頭headers處理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
if r.status_code == requests.codes.ok:
	print r.status_code    #響應碼
	print r.headers        #響應頭
	print r.headers.get('content-type')  # 推薦這種方式
	print r.headers['content-type']      # 不推薦這種方式
else:
	r.raise_for_status()

5、Cookie處理

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
r = requests.get('http://www.baidu.com', headers=headers)
# 遍歷出所有的cookie欄位的值
for cookie in r.cookies.keys():
	print cookie + ":" + r.cookies.get(cookie)

將自定義的Cookie值傳送出去

# -*- coding: utf-8 -*-
import requests
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {'User-Agent': user_agent}
cookies = dict(name='qiye', age='10')
r = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print r.text

Requests提供了session的概念，使我們不需要關心Cookie值，可連續訪問網頁

# -*- coding: utf-8 -*-
import requests
loginUrl = "http://www.xxx.com/login"
s = requests.Session()
# 首次訪問，作為遊客，伺服器分配一個cookie
r = s.get(loginUrl, allow_redirects=True)
datas = {'name':'qiye', 'passwd': 'qiye'}
# 向登入連結傳送post請求，遊客許可權轉為會員許可權
r = s.post(loginUrl, data=datas.allow_redirects=Trues)
print r.text

這是一個正式遇到的問題，如果沒有第一不訪問登入的頁面，而是直接向登入連結傳送Post請求，系統會把你當做非法使用者，因為訪問登入介面式會分配一個Cookie，需要將這個Cookie在傳送Post請求時帶上，這種使用Session函式處理Cookie的方式之後會很常用。

6、重定向與歷史資訊

只需設定以下allow_redicts欄位即可，可通過r.history欄位檢視歷史資訊

# -*- coding: utf-8 -*-
import requests
r= requests.get('http://github.com')   # 重定向為https://github.com
print r.url
print r.status_code
print r.history

7、超時設定

requests.get('http://github.com', timeout=2)

8、代理設定

# -*- coding: utf-8 -*-
import requests
proxies = {
	"http" = "http://0.10.10.01:3234",
	"https" = "http://0.0.0.2:1020",
}
r= requests.get('http://github.com', proxies=proxies)

也可通過環境變數HTTP_PROXY和HTTPS_PROXY來配置，但不常用。

你的代理需要使用HTTP Basic Auth，可以用http://user:password&host/語法

Python爬蟲開發與專案實戰 3: 初識爬蟲

3.1 網路爬蟲概述

3.2 HTTP請求的Python實現

Python爬蟲開發與專案實戰 3: 初識爬蟲

Python工業網際網路監控專案實戰3—websocket to UI

分享《精通Python網路爬蟲：核心技術、框架與專案實戰》中文PDF+原始碼

Python爬蟲專案實戰3 | 圖片文字識別（以驗證碼識別為例）

推薦《精通Python網路爬蟲核心技術、框架與專案實戰》附下載連結

小冊上新：Taro 多端開發實現原理與專案實戰

微信小程式入門與實戰常用元件 API 開發技巧專案實戰

Android NDK開發之旅(6)：JNI函式完全解析與專案實戰

Taro 多端開發實現原理與專案實戰

敏捷開發與專案管理實戰之敏捷需求分析

Python資料分析與挖掘實戰程式碼糾錯程式碼3-1

《Python核心程式設計》之資料庫程式設計快速入門與專案實戰

Django 2.0 專案實戰 (3): 使用者重置密碼與退出登入

Python之Django商城專案實戰（一）：搭建開發環境

微信小程式入門與實戰常用元件API開發技巧專案實戰

《python資料分析與挖掘實戰》筆記-3.1程式碼問題

React.js入門基礎與專案實戰開發視訊教程

菜鷄日記——《Python資料分析與挖掘實戰》實驗6-1 拉格朗日插值法

【slighttpd】基於lighttpd架構的Server專案實戰(3)—Master&Worker模式

Python資料分析與挖掘實戰 pdf下載

Python爬蟲開發與專案實戰 3: 初識爬蟲

3.1 網路爬蟲概述

3.2 HTTP請求的Python實現

相關推薦