爬蟲學習筆記【1】使用 urllib 獲取 www 資源

阿新 • • 發佈：2018-12-15

1. 掌握普通網頁的獲取方法

檢視 urllib.request 的基本資訊

urllib.request 中最常用的方法是 urlopen() ,它也是我們使用 urllib 獲取普通網頁的基本方法。在應用之前，我們先看一下 urllib 的原始碼，這是從事IT軟體類技術工作要養成的職業習慣。由於 urllib 是 python3 內建庫，所以無需安裝。原始碼的路徑可以在 import urllib 或 import.request 後，使用 file屬性檢視。

import urllib.request

#可以使用語句檢視摘要資訊
print(urllib.request.__all__)

#可以使用語句檢視urllib的本地位置
print(urllib.request.__file__)

urllib.request.urlopen() 方法的應用

從頭部註釋中可以瞭解 urlopen 方法需要傳入一個字串引數：頁面的URL ，然後它會開啟這個URL，返回類檔案物件的響應物件。

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

檢視上面 urlopen 方法原型，瞭解它的功能和呼叫方法。可以看到， url 是必須給定的引數，其他引數可以預設。下面我們嘗試使用 urlopen

開啟百度網頁。

我們使用了 with...as... 語句呼叫，這樣會更有利於在不使用時正常關閉連線。返回的結果是 HTTPResponse 物件。呼叫這個物件的 read() 方法，可以訪問具體的檔案內容。

"""簡單網頁獲取"""
import urllib.request

url = 'http://www.smartcrane.club'

#使用urlopen開啟網頁
#使用 with ... as ... 語句呼叫，有利於在不適用的時候正常關閉連線
with urllib.request.urlopen(url) as resp:
    print(resp.read().decode('utf-8'))

urllib.request.urlretrieve() 方法的應用

urllib.request.urlretrieve() 方法能夠以另一種形式獲取頁面內容，它會將頁面內容存為臨時檔案並獲取 response 頭。

可以檢視 urlretrieve 方法的原型：

def urlretrieve(url, filename=None, reporthook=None, data=None)

"""urlretrieve()方法試用"""
import urllib.request

tempfile,header = urllib.request.urlretrieve(url)

print(header)
print('--'*10)
print(tempfile)

理解 HTTPResponse 物件

HTTPResponse 物件是一種類檔案物件，除了可以檔案的 read() 方法讀取它的內容外，還有別的屬性和方法。

例如： r.code 與 r.status 屬性存放本次請求的響應碼; r.headers 屬性存放響應頭； r.url 屬性存放了發出響應的伺服器URL；還可以嘗試 info() 和 geturl() 方法。

(使用 response 的 geturl() 和 info 方法來驗證請求與響應是否如我們希望的一樣。有時會出現請求發往的伺服器與應答伺服器不是同一臺主機的情況。)

"""理解 HTTPResponse 物件"""

import urllib.request

url = 'http://www.smartcrane.club'

with urllib.request.urlopen(url) as resp:
    print(resp)
    print(resp.code)
    print(resp.status)
    print(resp.headers)
    print(resp.url)
    print(resp.info())
    print(resp.geturl())

2. 掌握使爬蟲更像瀏覽器的方法

預設情況下，urllib發出的請求頭大致如下所示：

GET / HTTP/1.1
Accept-Encoding: identity 
Host: 10.10.10.135 
User-Agent: Python-urllib/3.6 
Connection: close

大多數網站的伺服器端會進行內容審查，檢查客戶端型別，一方面是為了滿足多樣化的需求；另一方面也可以限制一些網路爬蟲程式的訪問。

上面內容中的 User-Agent 就是一個內容審查重點，一般的瀏覽器發出的請求頭如下所示：

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: _ga=GA1.2.1750953090.1536588381; _gid=GA1.2.14861119.1539324601; _gat=1
Host: www.smartcrane.club
If-Modified-Since: Tue, 09 Oct 2018 06:29:24 GMT
If-None-Match: W/"5bbc4ac4-82e1"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36

伺服器發現不是正常瀏覽器可以拒絕提供服務，例如訪問 www.z.cn 時，使用下面程式碼會報出 HTTP Error 503: Service Unavailable:()

with urllib.request.urlopen(url) as response: 
    print(response.status)

這時我們可以定製請求物件 HTTPRequest ，是指更像是瀏覽器發出的。

"""定製 request 物件，使爬蟲更像瀏覽器"""

import urllib.request

url = 'http://www.smartcrane.club'

header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

request = urllib.request.Request(url = url, headers = header)

with urllib.request.urlopen(request) as resp:
    print(resp.status)

3. 掌握向伺服器傳遞引數的方法

許多 HTTP 方法都可以用來向伺服器提供資料，最常見的 GET 和 POST 方法都可以，但方式不同。

使用 GET 方法向伺服器提供資料

"""向伺服器提交引數"""
import urllib.parse
import urllib.request

#1 利用 Get 方法通過 URL 引數提交資訊

url = 'http://www.baidu.com/s?'

wd = {'wd':'北航'}

wdcoded = urllib.parse.urlencode({'wd':'北航'})

url1 =  url + wdcoded


header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

request = urllib.request.Request(url = url, headers = header)

with urllib.request.urlopen(request) as resp:
    print(resp.status)
    print(resp.headers)
    print(resp.read().decode('utf-8'))

使用 POST 方法向伺服器提供資料

"""
利用 POST 方法，向 http://httpbin.org 提交
事先應在該網站進行設定，啟動試用連結
"""

import urllib.request
import urllib.parse

url = 'http://www.httpbin.org/post'
payload = {'key1':'value1','key2':'value'}
header = {
    'Accept': 'text/html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}

request = urllib.request.Request(url=url,data=urllib.parse.urlencode(payload).encode('utf-8'),headers=header)

with urllib.request.urlopen(request) as resp:
    print(resp.read().decode('utf-8'))

4. 掌握設定超時訪問限制和處理異常的方法

urllib.error 處理異常,兩個常用異常類：urllib.error.URLError 和 HTTPError

設定 Time-Out 超時訪問限制

"""設定 Time-Out"""
import urllib.request
import socket

# tineout in seconds
timeout = 3
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urloopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request("http://www.python.org/")
a = urllib.request.urlopen(req).read()
print(a)

使用 urllib.error 處理異常

URLError 繼承自 OSError ，是 urllib 的異常的基礎類。HTTPError 是驗證 HTTP Response 例項的一個異常類。

HTTP protocol errors 是有效的 response ，有 狀態碼 、 header 、 body 。

"""使用urlllib.error處理異常"""

import urllib.request
import urllib.error
import urllib.parse
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S',
                   filename='C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_crawler.log',
                   level=logging.DEBUG)
try:
    url = 'http://www.baidu11.com'   # A wrong website url
    headers = {
        'Accept': 'text/html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    request = urllib.request.Request(url,headers=headers)
    with urllib.request.urlopen(request) as response:
        print(response.status)
        print(response.read().decode('utf-8'))

except urllib.error.HTTPError as e:
    import http.server
    #print(http.server.BaseHTTPRequestHandler.response[e.code])
    logging.error('HTTPError code: %s and Messages: %s' %(str(e.code),http.server.BaseHTTPRequestHandler.response[e.code]))
    logging.info('HTTPError headers: ' + str(e.headers))
    logging.info(e.read().decode('utf-8'))
    print('Error : urllib.error.HTTPError')
except urllib.error.URLError as e:
    logging.error(e.reason)
    print('Error : urllib.error.URLError')

5. 例項：從百度貼吧下載多頁話題內容

loadPage(url) 用於獲取網頁
writePage(html,filename) 用於將已獲得的網頁儲存為本地檔案
tiebaCrawler(url,beginpPage,endPage,keyword)用於排程，提供需要抓取的頁面URLs
main：程式主控模組，完成基本命令列互動介面

import urllib.request
import urllib.error
import urllib.parse
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S',
                   filename='C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_crawler.log',
                   level=logging.DEBUG)

    
def loadPage(url):
    try:
        headers = {
            'Accept': 'text/html',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
        }
        request = urllib.request.Request(url,headers=headers)
        with urllib.request.urlopen(request) as response:
            print(response.status)
            #print(response.read().decode('utf-8'))
            return response.read()
            
    except urllib.error.HTTPError as e:
        import http.server
        #print(http.server.BaseHTTPRequestHandler.response[e.code])
        logging.error('HTTPError code: %s and Messages: %s' %(str(e.code),http.server.BaseHTTPRequestHandler.response[e.code]))
        logging.info('HTTPError headers: ' + str(e.headers))
        logging.info(e.read().decode('utf-8'))
        print('Error : urllib.error.HTTPError')
    except urllib.error.URLError as e:
        logging.error(e.reason)
        print('Error : urllib.error.URLError')

def WritePage(html, filename):
    fp=open(filename,'w+b')
    fp.write(html)
    fp.close()

if __name__ == "__main__":   
    for i in range(3):
        url = 'http://tieba.baidu.com/p/5872199831?pn=' + str(i+1)
        response = loadPage(url)
        filename = 'C:\\Users\\Wen Xuhonghe\\Documents\\Crawler\\CrawlerLesson1_Tieba' + str(i+1) + '.html'
        WritePage(response,filename)

爬蟲學習筆記【1】使用 urllib 獲取 www 資源

1. 掌握普通網頁的獲取方法

2. 掌握使爬蟲更像瀏覽器的方法

3. 掌握向伺服器傳遞引數的方法

4. 掌握設定超時訪問限制和處理異常的方法

5. 例項：從百度貼吧下載多頁話題內容

爬蟲學習筆記【1】使用 urllib 獲取 www 資源

響應式布局學習筆記【1】----基礎知識

Tomcat學習筆記【1】--- WEB服務器、JavaEE、Tomcat背景

學習筆記--Hystrix服務容錯學習筆記【1】

吳恩達《deeplearning深度學習》課程學習筆記【1】（精簡總結）

jedis學習筆記【1】

機器學習框架ML.NET學習筆記【1】基本概念

【Python爬蟲學習筆記8-2】MongoDB數據庫操作詳解

spring boot 2.1學習筆記【一】新特性介紹

spring boot 2.1學習筆記【四】屬性配置

Python3爬蟲學習筆記（1.urllib庫詳解）

python學習記【1】

.NET學習日記【1】

TDD學習筆記【六】一Unit Test - Stub, Mock, Fake 簡介

Docker學習筆記【三】安裝Redis

system generator學習筆記【01】

Python爬蟲學習筆記（一）——urllib庫的使用

Grunt學習筆記【3】---- filter詳解

Python學習筆記【Supervisor】：使用Supervisor監控Tornado程序

Python學習筆記【Nginx】：Nginx使用與完全解除安裝

爬蟲學習筆記【1】 使用 urllib 獲取 www 資源

1. 掌握普通網頁的獲取方法

2. 掌握使爬蟲更像瀏覽器的方法

3. 掌握向伺服器傳遞引數的方法

4. 掌握設定超時訪問限制和處理異常的方法

5. 例項：從百度貼吧下載多頁話題內容

相關推薦

爬蟲學習筆記【1】使用 urllib 獲取 www 資源