Python爬蟲：認識urllib/urllib2以及requests

阿新 • • 發佈：2018-02-08

更多查看 sts urllib2 chrome 超時設置 word 3.0 erro

首先說明一下我的爬蟲環境是基於py2.x的，為什麽用這個版本呢，因為py2.x的版本支持的多，而且一般會使用py2.x環境，基本在py3.x也沒有太大問題，好了，進入正題！

urllib 與 urllib2

urllib與urllib2是Python內置的，要實現Http請求，以urllib2為主,urllib為輔.

構建一個請求與響應模型

import urllib2

strUrl = "http://www.baidu.com"
response = urllib2.urlopen(strUrl)
print response.read()

得到：
<div class="s_tab" id="s_tab">
    <b>網頁</b><a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘news‘})">新聞</a><a href="http://tieba.baidu.com/f?kw=&fr=wwwt" wdfield="kw"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘tieba‘})">貼吧</a><a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘zhidao‘})">知道</a><a href="http://music.baidu.com/search?fr=ps&ie=utf-8&key=" wdfield="key"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘music‘})">音樂</a><a href="http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘pic‘})">圖片</a><a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘video‘})">視頻</a><a href="http://map.baidu.com/m?word=&fr=ps01000" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘map‘})">地圖</a><a href="http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8" wdfield="word"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘wenku‘})">文庫</a><a href="//www.baidu.com/more/"  onmousedown="return c({‘fm‘:‘tab‘,‘tab‘:‘more‘})">更多?</a>
</div>

這樣就獲取了整個網頁內容.
說明
urlopen(strUrl,data,timeout)

第一個參數URL必傳的，第二個參數data是訪問URL時要傳送的數據，第三個timeout是設置超時時間，後面兩個參數不是必傳的.

Get與Post傳送數據
post與get傳送數據是兩個比較常用的數據傳送方式，一般只需要掌握這兩種方式就可以了.

Get方式傳送數據

import urllib2
import urllib

values = {}
values[‘username‘] = ‘136xxxx0839‘
values[‘password‘] = ‘123xxx‘
data = urllib.urlencode(values)#這裏註意轉換格式
url = ‘https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001‘
getUrl = url+‘?‘+data
request = urllib2.Request(getUrl)
response = urllib2.urlopen(request)
# print response.read()
print getUrl

得到：https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001?username=136xxxx0839&password=123xxx

post數據傳送方式

values = {}
values[‘username‘] = ‘136xxxx0839‘
values[‘password‘] = ‘123xxx‘
data = urllib.urlencode(values)
url = ‘https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001‘
request = urllib2.Request(url,data)
response = urllib2.urlopen(request)
print response.read()

兩種請求方式差異點：
post與request方式的數據傳輸時註意urllib2.Request(url,data)這裏面的數據傳輸

註意處理請求的headers
很多時候我們服務器會檢驗請求是否來自於瀏覽器，所以我們需要在請求的頭部偽裝成瀏覽器來請求服務器.一般做請求的時候，最好都要偽裝成瀏覽器，防止出現拒絕訪問等錯誤，這也是一種反爬蟲的一種策略

user_agent = {‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400‘}
header = {‘User-Agent‘:user_agent}
url = ‘http://www.qq.com/‘
request = urllib2.Request(url,headers=header)
response = urllib2.urlopen(request)
print response.read().decode(‘gbk‘)#這裏註意一下需要對讀取的網頁內容進行轉碼，先要查看一下網頁的chatset是什麽格式.

在瀏覽器上打開www.qq.com然後按F12，查看User-Agent:

User-Agent : 有些服務器或 Proxy 會通過該值來判斷是否是瀏覽器發出的請求
Content-Type : 在使用 REST 接口時，服務器會檢查該值，用來確定 HTTP Body 中的內容該怎樣解析。
application/xml ：在 XML RPC，如 RESTful/SOAP 調用時使用
application/json ：在 JSON RPC 調用時使用
application/x-www-form-urlencoded ：瀏覽器提交 Web 表單時使用
在使用服務器提供的 RESTful 或 SOAP 服務時， Content-Type 設置錯誤會導致服務器拒絕服務

requests

requests是Python最為常用的http請求庫，也是極其簡單的.使用的時候，首先需要對requests進行安裝，直接使用Pycharm進行一鍵安裝。

1.響應與編碼

import requests
url = ‘http://www.baidu.com‘
r = requests.get(url)
print type(r)
print r.status_code
print r.encoding
#print r.content
print r.cookies

得到：
<class ‘requests.models.Response‘>
200
ISO-8859-1
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

2.Get請求方式

values = {‘user‘:‘aaa‘,‘id‘:‘123‘}
url = ‘http://www.baidu.com‘
r = requests.get(url,values)
print r.url

得到：http://www.baidu.com/?user=aaa&id=123

3.Post請求方式


values = {‘user‘:‘aaa‘,‘id‘:‘123‘}
url = ‘http://www.baidu.com‘
r = requests.post(url,values)
print r.url
#print r.text

得到：
http://www.baidu.com/

4.請求頭headers處理


user_agent = {‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400‘}
header = {‘User-Agent‘:user_agent}
url = ‘http://www.baidu.com/‘
r = requests.get(url,headers=header)
print r.content

5.響應碼code與響應頭headers處理

url = ‘http://www.baidu.com‘
r = requests.get(url)

if r.status_code == requests.codes.ok:
    print r.status_code
    print r.headers
    print r.headers.get(‘content-type‘)#推薦用這種get方式獲取頭部字段
else:
    r.raise_for_status()

得到：
200
{‘Content-Encoding‘: ‘gzip‘, ‘Transfer-Encoding‘: ‘chunked‘, ‘Set-Cookie‘: ‘BDORZ=27315; max-age=86400; domain=.baidu.com; path=/‘, ‘Server‘: ‘bfe/1.0.8.18‘, ‘Last-Modified‘: ‘Mon, 23 Jan 2017 13:27:57 GMT‘, ‘Connection‘: ‘Keep-Alive‘, ‘Pragma‘: ‘no-cache‘, ‘Cache-Control‘: ‘private, no-cache, no-store, proxy-revalidate, no-transform‘, ‘Date‘: ‘Wed, 17 Jan 2018 07:21:21 GMT‘, ‘Content-Type‘: ‘text/html‘}
text/html

6.cookie處理


url = ‘https://www.zhihu.com/‘
r = requests.get(url)
print r.cookies
print r.cookies.keys()

得到：
<RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]>
[‘aliyungf_tc‘]

7重定向與歷史消息

處理重定向只是需要設置一下allow_redirects字段即可，將allow_redirectsy設置為True則是允許重定向的，設置為False則禁止重定向的

r = requests.get(url,allow_redirects = True)
print r.url
print r.status_code
print r.history

得到：
http://www.baidu.com/
200
[]

8.超時設置

超時選項是通過參數timeout來設置的

url = ‘http://www.baidu.com‘
r = requests.get(url,timeout = 2)

9.代理設置

proxis = {
    ‘http‘:‘http://www.baidu.com‘,
    ‘http‘:‘http://www.qq.com‘,
    ‘http‘:‘http://www.sohu.com‘,

}

url = ‘http://www.baidu.com‘
r = requests.get(url,proxies = proxis)

Python爬蟲：認識urllib/urllib2以及requests

更多查看 sts urllib2 chrome 超時設置 word 3.0 erro 首先說明一下我的爬蟲環境是基於py2.x的，為什麽用這個版本呢，因為py2.x的版本支持的多，而且一般會使用py2.x環境，基本在py3.x也沒有太大問題，好了，進入正題！ urlli

Python爬蟲：認識urllib/urllib2以及requests

urllib 與 urllib2

requests

1.響應與編碼

2.Get請求方式

3.Post請求方式

4.請求頭headers處理

5.響應碼code與響應頭headers處理

6.cookie處理

7重定向與歷史消息

8.超時設置

9.代理設置

Python爬蟲：認識urllib/urllib2以及requests

【1】python爬蟲入門，利用bs4以及requests獲取靜態網頁

Python爬蟲：HTTP協議、Requests庫

1.1-Python爬蟲案例演示urllib/requests

python 爬蟲訪問網頁之request與requests：

python爬蟲：從頁面下載圖片以及編譯錯誤解決。

Python爬蟲：使用requests庫下載大檔案

Python爬蟲：urllib內建庫基本使用

Python爬蟲：學爬蟲前得了解的事兒

python爬蟲：爬取網站視頻

Python爬蟲：新浪新聞詳情頁的數據抓取（函數版）

Python 爬蟲：把廖雪峰教程轉換成 PDF 電子書

Python爬蟲：現學現用Xpath爬取豆瓣音樂

python 爬蟲：HTTP ERROR 406

Python爬蟲：Xpath語法筆記

Python爬蟲：抓取手機APP的數據

python爬蟲知識點總結（九）Requests+正則表達式爬取貓眼電影

Python爬蟲學習筆記（二）——requests庫的使用

我的第一個python爬蟲：爬取豆瓣top250前100部電影

python爬蟲：爬取鏈家深圳全部二手房的詳細信息

Python爬蟲：認識urllib/urllib2以及requests

urllib 與 urllib2

requests

1.響應與編碼

2.Get請求方式

3.Post請求方式

4.請求頭headers處理

5.響應碼code與響應頭headers處理

6.cookie處理

7重定向與歷史消息

8.超時設置

9.代理設置

相關推薦