Python爬蟲：urllib內建庫基本使用

阿新 • • 發佈：2019-02-13

urllib庫包含以下模組

urllib.request 請求模組
urllib.error 異常處理模組
urllib.parse url解析模組
urllib.robotparser robots.txt解析模組

py2 vs. py3

python2
urllib.urlopen()

python3
urllin.request.urlopen()

引入需要的模組

from urllib import request
from urllib import parse
from urllib import error
from http import cookiejar
import 
 socket

request請求

請求url，請求引數，請求資料，請求頭

urlopen

urlopen(url, data=None, timeout, *, cafile=None, 
    capath=None, cadefault=False, context=None)


# 傳送get請求
def foo1():
    response = request.urlopen("http://www.baidu.com")
    # 位元組 -> utf-8解碼 -> 字串
    print(response.read().decode("utf-8" 
))


# 傳送post請求
def foo2():
    data = bytes(parse.urlencode({"word": "hello"}), encoding="utf-8")
    response = request.urlopen("http://httpbin.org/post", data=data)
    print(response.read())


# 設定超時時間並捕獲異常
def foo3():
    try:
        response = request.urlopen("http://httpbin.org/post", timeout=0.1)
        print(response.read())
    except 
 error.URLError as e:
        print(type(e.reason)) # <class 'socket.timeout'>
        if isinstance(e.reason, socket.timeout):
            print("超時錯誤：", e)

response響應


# 狀態碼，響應頭
def foo4():
    response = request.urlopen("http://www.baidu.com")
    print(type(response))
    # from http.client import HTTPResponse
    # <class 'http.client.HTTPResponse'>

    print(response.status)
    print(response.getheaders())
    print(response.getheader("Server"))

Request請求物件

def foo5():
    req = request.Request("http://www.baidu.com")
    response = request.urlopen(req)
    print(response.read().decode("utf-8"))

# 帶瀏覽器資訊的請求1
def foo6():
    url = "http://httpbin.org/post"
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)",
        "Host": "httpbin.org"
    }
    dct = {"name": "Tom"}

    data = bytes(parse.urlencode(dct), encoding="utf-8")
    req = request.Request(url=url, data=data, headers=headers)
    response = request.urlopen(req)
    print(response.read().decode("utf-8"))


# 帶瀏覽器資訊的請求2
def foo7():
    url = "http://httpbin.org/post"
    dct = {"name": "Tom"}
    data = bytes(parse.urlencode(dct), encoding="utf-8")

    req = request.Request(url=url, data=data, method="POST")
    req.add_header("User-Agent",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)")

    response = request.urlopen(req)

    print(response.read().decode("utf-8"))

代理


def foo8():
    proxy_handler = request.ProxyHandler({
        "http": "http://183.159.94.185:18118",
        "https": "https://183.159.94.187:18118",
        })
    opener = request.build_opener(proxy_handler)
    response = opener.open("http://www.baidu.com")
    print(response.read())


def foo9():
    cookie = cookiejar.CookieJar()
    cookie_handler = request.HTTPCookieProcessor(cookie)
    opener = request.build_opener(cookie_handler)
    response = opener.open("http://www.baidu.com")
    print(response.status)
    for item in cookie:
        print(item.name, item.value)

# 儲存cookie1
def foo10():
    filename = "cookie.txt"
    cookie = cookiejar.MozillaCookieJar(filename)
    cookie_handler = request.HTTPCookieProcessor(cookie)
    opener = request.build_opener(cookie_handler)
    response = opener.open("http://www.baidu.com")
    cookie.save(ignore_discard=True, ignore_expires=True)

# 儲存cookie2
def foo11():
    filename = "cookie1.txt"
    cookie = cookiejar.LWPCookieJar(filename)
    cookie_handler = request.HTTPCookieProcessor(cookie)
    opener = request.build_opener(cookie_handler)
    response = opener.open("http://www.baidu.com")
    cookie.save(ignore_discard=True, ignore_expires=True)

# 讀取cookie
def foo12():
    filename = "cookie1.txt"
    cookie = cookiejar.LWPCookieJar()
    cookie.load(filename, ignore_discard=True, ignore_expires=True)
    cookie_handler = request.HTTPCookieProcessor(cookie)
    opener = request.build_opener(cookie_handler)
    response = opener.open("http://www.baidu.com")
    print(response.read().decode("utf-8"))

異常處理

error主要有：’URLError’, ‘HTTPError’, ‘ContentTooShortError’


def foo13():
    try:
        response = request.urlopen("http://www.xxooxxooxox.com/xxx")
        print(response.status)
    except error.HTTPError as e:  # 子類異常
        print(e.name, e.reason, e.code, e.headers, sep="\n")
    except error.URLError as e:  # 父類異常
        print(e.reason)
    else:
        print("successful")

parse 模組解析url

urlparse(url, scheme='', allow_fragments=True)


def foo14():
    result = parse.urlparse("http://www.baidu.com/xxx.html;user?id=5#comment")
    print(type(result), result, sep="\n")
    """
    <class 'urllib.parse.ParseResult'>
    ParseResult(scheme='http', netloc='www.baidu.com', path='/xxx.html', 
            params='user', query='id=5', fragment='comment')
    """

    # scheme 為預設協議資訊 連結中協議資訊優先
    result = parse.urlparse("www.baidu.com", scheme="https")
    print(result)
    """
    ParseResult(scheme='https', netloc='', path='www.baidu.com',
          params='', query='', fragment='')
    """

    result = parse.urlparse("http://www.baidu.com", scheme="https")
    print(result)
    """
    ParseResult(scheme='http', netloc='www.baidu.com', path='', 
            params='', query='', fragment='')
    """

    # allow_fragments 引數決定錨點拼接的位置
    result = parse.urlparse("http://www.baidu.com/xxx.html;user?id=5#comment",
                    allow_fragments=True)
    print(result)
    """
    ParseResult(scheme='http', netloc='www.baidu.com', path='/xxx.html', 
            params='user', query='id=5', fragment='comment')
    """

    result = parse.urlparse("http://www.baidu.com/xxx.html;user?id=5#comment",
                    allow_fragments=False)
    print(result)
    """
    ParseResult(scheme='http', netloc='www.baidu.com', path='/xxx.html', 
            params='user', query='id=5#comment', fragment='')

    """

    result = parse.urlparse("http://www.baidu.com/xxx.html;user#comment",
                    allow_fragments=False)
    print(result)
    """
    ParseResult(scheme='http', netloc='www.baidu.com', path='/xxx.html', 
            params='user#comment', query='', fragment='')

    """

# urlunparse 拼接url連結，注意順序
def foo15():
    data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"]
    print(parse.urlunparse(data))
    # http://www.baidu.com/index.html;user?a=6#comment

# urljoin 拼接url，類似os.path.join, 後者優先順序高
def foo16():
    print(parse.urljoin("http://www.baidu.com", "index.html"))
    print(parse.urljoin("http://www.baidu.com", "http://www.qq.com/index.html"))
    print(parse.urljoin("http://www.baidu.com/index.html", "http://www.qq.com/?id=6"))
    """
    http://www.baidu.com/index.html
    http://www.qq.com/index.html
    http://www.qq.com/?id=6
    """

# urlencode將字典轉為url中的引數形式
def foo17():
    params ={
        "name": "Tom",
        "age": 18
    }
    # 這裡 ？ 沒了
    url = parse.urljoin("http://www.baidu.com/?", parse.urlencode(params))
    print(url)
    # http://www.baidu.com/name=Tom&age=18

    url = "http://www.baidu.com/?" + parse.urlencode(params)
    print(url)
    # http://www.baidu.com/?name=Tom&age=18

Python爬蟲：urllib內建庫基本使用

urllib庫包含以下模組 urllib.request 請求模組 urllib.error 異常處理模組 urllib.parse url解析模組 urllib.robotparser robots.txt解析模組 py2 vs. py3 py

Python爬蟲之Urllib庫的基本使用

狀態碼 chrom 異常處理 false 基本 sta col thead kit # get請求 import urllib.request response = urllib.request.urlopen("http://www.baidu.com") print(

Python學習筆記：import與常用內建庫

模組 Python模組實質為py檔案，Python在importpy模組時預設會在sys.path所包含的路徑中去尋找，搜尋失敗時會出錯。匯入整個模組假設有一個module.py檔案，程式碼如下： var=1 def func():

Python爬蟲：HTTP協議、Requests庫

.org clas python爬蟲 print 通用娛樂信息傳輸協議介紹 HTTP協議： HTTP（Hypertext Transfer Protocol）：即超文本傳輸協議。URL是通過HTTP協議存取資源的Internet路徑，一個URL對應一個數據資源。

Python爬蟲：認識urllib/urllib2以及requests

更多查看 sts urllib2 chrome 超時設置 word 3.0 erro 首先說明一下我的爬蟲環境是基於py2.x的，為什麽用這個版本呢，因為py2.x的版本支持的多，而且一般會使用py2.x環境，基本在py3.x也沒有太大問題，好了，進入正題！ urlli

『Go 內建庫第一季：strconv』

大家好，我叫謝偉，是一名程式設計師。近期會持續更新內建庫的學習內容，主要的參考文獻是：godoc, 和原始碼日常編寫程式碼的過程中，字串和數值型、布林型別之間的轉換算是很頻繁了。所以有必要研究下內建的 strconv 庫。這節的主題是：字串和其他基本資料型別之間的轉換。除此之外

『Go 內建庫第一季：json』

大家好，我叫謝偉，是一名程式設計師。近期我會持續更新內建庫的學習筆記，主要參考的是文件 godoc 和內建庫的原始碼在日常開發過程中，使用最頻繁的當然是內建庫，無數的開源專案，無不是在內建庫的基礎之上進行衍生、開發，所以其實是有很大的必要進行梳理學習。本節的主題：內建

python爬蟲系列(2.1-requests庫的基本的使用)

一、基本認識 1、傳送一個get請求 import requests if __name__ == "__main__": # 獲取一個get請求 response = requests.get('http://htt

『Go 內建庫第一季：error』

大家好，我叫謝偉，是一名程式設計師。近期我會持續更新內建庫的學習筆記，主要參考的是文件 godoc 和內建庫的原始碼本節的主題：error Go 中的錯誤處理和別的語言不一樣，設計哲學也不一樣，常有開發者埋怨 Go 語言中的錯誤處理。本節從內建庫的 error 出發

Python爬蟲：Windows系統下用pyquery庫解析含有中文的本地HTML檔案報UnicodeDecodeError的解決方法

由於Windows系統預設GBK編碼，用pyquery解析本地html檔案，如果檔案中有中文，會報錯： UnicodeDecodeError: 'gbk' codec can't decode byte 0xa3 in position 12: illegal multibyte sequenc

『Go 內建庫第一季：time』

大家好，我叫謝偉，是一名程式設計師。近期會更新內建庫的學習筆記，主要參考文獻來自於官方文件和原始碼。本節的主題：time 時間的操作在專案中使用的非常頻繁，比如說資料庫中，經常有時間的操作，比如根據時間進行劃分，統計之類的功能。那麼如何學會常用的操作呢？大綱：自己總結的常

Python--day27--幾個內建方法：repr()/str()/del()/call()

repr方法（）雙下方法__str__: 列印物件就相當於列印物件.__str__ __repr__(): __repr__是__str__的備胎，沒有__str__的時候，就呼叫__repr__:(但__str__不是__repr__的備胎) 小結： #內建的方

【Python】logging內建模組基本使用

logging模組是Python內建的標準模組，主要用於輸出執行日誌，可以設定輸出日誌的等級、日誌儲存路徑、日誌檔案回滾等。 print也可以輸入日誌，logging相對print來說更好控制輸出在哪個地方，怎麼輸出及控制訊息級別來過濾掉那些不需要的資訊。一條日誌資訊對應的是一個事件的發生，而

『Go 內建庫第一季：net/url』

大家好，我叫謝偉，是一名程式設計師。近期會同步內建庫的學習，主要參考文獻官方文件和原始碼本節的主題：url 其實這是一個比較小的內建函式，主要用在網路請求方面上，可能最多的用途也就是用來處理網路請求的引數。當然如何你經常在專案中編寫restfulAPI, 那麼你也可能經常

Python爬蟲系列-Urllib庫詳解

Urllib庫詳解 Python內建的Http請求庫: * urllib.request 請求模組 * urllib.error 異常處理模組 * urllib.parse url解析模組 * urllib.robotparser robots.txt解析模組 #### 相比在python2基礎上的變化

python中enumerate內建庫的使用

使用enumerate，可以自動進行索引下標的賦值，本例程式碼中使用enumerate，進行excel單元格的賦值操作。程式碼如果重複被呼叫，可將該程式碼封裝成類進行使用 1 1 import openpyxl 2 2 #載入excel檔案 3 3 wb = openp

Python爬蟲：使用requests庫下載大檔案

當使用requests的get下載大檔案/資料時，建議使用使用stream模式。當把get函式的stream引數設定成False時，它會立即開始下載檔案並放到記憶體中，如果檔案過大，有可能導致記憶體不足。當把get函式的stream引數設定成True時，它不

python爬蟲入門urllib庫的使用

urllib庫的使用，非常簡單。 import urllib2 response = urllib2.urlopen("http://www.baidu.com") print response.read() 只要幾句程式碼就可以把一個網站的原始碼下載下來。官方文件：https://d

零基礎學python-4.2 其它內建類型

介紹 src one 一個 tex == water 文件 div 這一章節我們來聊聊其它內建類型 1.類型type 在python2.2的時候，type是通過字符串實現的，再後來才把類型和類統一我們再次使用上一章節的圖片來說明一些問題我們通

Python爬蟲：學爬蟲前得了解的事兒

編寫 election 檢查語言 jpg mage 圖片一個網頁這是關於Python的第14篇文章，主要介紹下爬蟲的原理。提到爬蟲，我們就不得不說起網頁，因為我們編寫的爬蟲實際上是針對網頁進行設計的。解析網頁和抓取這些數據是爬蟲所做的事情。對於大部分網頁來講，它

Python爬蟲：urllib內建庫基本使用

urllib庫包含以下模組

py2 vs. py3

引入需要的模組

request請求

urlopen

response響應

Request請求物件

代理

cookie

異常處理

parse 模組解析url

相關推薦