1. 程式人生 > >初識網路爬蟲-網路爬蟲概述

初識網路爬蟲-網路爬蟲概述

3.1 網路爬蟲概述

3.1.1 網路爬蟲及其應用

分類:通用,聚焦,增量,深層
搜尋引擎:通用網咯爬蟲
定向抓取相關網頁中資源:聚焦爬蟲
增量式爬蟲:針對已經更新的網頁資源
深層網路爬蟲:隱藏在表層連結後面的web頁面
網路爬蟲實際運用場景:BT網站;雲盤搜尋;

3.1.2 網路爬蟲結構

在這裡插入圖片描述

3.2 HTTP請求python實現

三種方式:urllib2/urllib,httplib/urllib以及Requests

3.2.1 urllib2/urllib實現

1.向指定的url發出請求:

import urlliib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print(html)

分解為請求和響應:

import urllib2
#請求
request = urllib2.Request('http://www.zhihu.com')
#響應
response = urllib2.urlopen(request)
html = response.read()
print(html)

POST請求,新增請求資料

import urllib
import urllib2
url = 'http://www.zhihu.com'
postdata = {'username','qiye','passward','qiye-pass'}
data = urllib.urlcode(poatdata)
req = urllib2.Request(url,data)
response = urllib2.urlopen(req)
html = response.read()

2.請求頭headers處理

import urllib
import urllib2
url = 'http://www.zhihu.com'
user_agent = '...'
referer = '...'
postdata = {...}
#將user-agent和referer寫入頭資訊
headers = {'User-Agent':user-agent,'Referer:referer}
data = urllib.urlcode(poatdata)
req = urllib2.Request(url,data,headers)
response = urllib2.urlopen(req)
html = response.read()

3.cookie處理
得到某個cookie項的值

import urllib2
import cookielib
cookie = cookielib.CookieJar()
#設定開啟方式
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('...')
for item in cookie:
    print item.name+':'+item.value

#自己新增cookie內容

import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie','email='+"..."))
req = urllib2.Request("...")
response = opener.open(req)
print(response.headers)
retdata = response.read()

4.設定超時資訊Timeout

import urllib2
request = urllib2.Request('...')
response = urllib2.urlopen(request,timeout=2)
html = response.read()
print(html)

5.獲取HTTP響應碼

import urllib2
try:
    response = urllib2.urlopen('...')
    print(response)
except urllib2.HTTPError as e:
    if hasattr(e,'code'):
        print('Error code:',e.code)

6.重定向
在這裡插入圖片描述
7.Proxy的設定

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener([proxy,])
urllib2.install_operner(opener)
response = urllib2.urlopen('...')
print(response.read())

3.2.2 http/urllib實現

httplib模組式一個底層模組,可以看到HTTP請求的每一步,在爬蟲開發過程基本用不到,這裡進行知識普及:
在這裡插入圖片描述

3.2.3 更人性化的Requests

1.完整請求響應模型
GET:

import requests
r = requests.get('...')
print(r.content)

POST:

import requests
postdata = {...}
r = requests.post('...',data=postdata)
print(r.content)

在這裡插入圖片描述
2.響應與編碼

import requests
r = resquests.get('...')
print('content-->>'+r.content)
print('text-->>'+r.text)
print('encoding-->>'+r.encoding)
r.encoding = 'utf-8'
print('new text -->>'+r.text)

字串/檔案編碼檢測模組chardet
直接將chardet檢測到的編碼,賦值給r.encoding實現編碼,r.text輸出就不會有亂碼

import requests
r = request.get('...')
print(chardet.detect(r.content))
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)

3.請求頭headers處理

import requests
user_agent = '...'
headers = {'User-Agent':user_agent}
r = requests.get('...',headers = headers)
print(r.content)

4.響應碼code和響應頭headers的處理
獲取響應碼:status_code欄位
獲取響應頭:headers欄位

import requests
r = requests.get('...')
if r.status_code == requests.codes.OK:
    print(r.status_code)#獲取響應碼
    print(r.headers)#獲取響應頭
    print(r.headers.get('content-type'))#獲取其中欄位
else:
    r.raise_for_status()#主動丟擲異常  

5.cookie處理
獲取cookie欄位

import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
r = requests.get('...',headers = headers)
for cookie in r.cookie.keys():
    print(cookie+':'+r.cookie.get(cookie))    

新增自定義cookie

import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
cookies = dict(name='qiye',age='10')
r = requests.get('...',headers = headers,cookies = cookies)
print(r,text)

Requests提供session概念自動給程式新增cookies

import requests
loginurl = '...'
s = requests.Session()
#首先訪問登陸介面作為遊客,伺服器會分配一個cookie
r = s.get(loginurl,allow_redirects=True)
datas = {'name':'qiye',apsswd':'qiye'}
#向登入連結傳送post請求,驗證成功,遊客許可權轉為會員許可權
r = s.post(loginurl,data,allow+True)
print(r,text)

6.重定向與歷史資訊
處理重定向:allow_redirects
檢視歷史資訊:r.history

import requests
r = requests.get('...')
print(r.url)
print(r.status_code)
print(r.history)

7.超時設定

requests.get('...',timeout=2)

8.代理設定

import requests
proxies = {"....","......"}
requests.get("...",proxies = proxies)