Python3爬蟲學習筆記（1.urllib庫詳解）

阿新 • • 發佈：2019-01-01

1.什麼是爬蟲：略，到處都有講解。

雖然是入門，不過沒有Python基礎的同學看起來可能費勁，建議稍學下Python

之前學習前端知識也是為了能看懂HTML，便於爬蟲學習，建議瞭解下前端知識

2.requests庫初識：

列印百度的原始碼：

import requests
reponse = requests.get("http://www.baidu.com")
print(reponse.text)

列印頭部資訊：

import requests
reponse = requests.get("http://www.baidu.com")
print(reponse.headers)

列印狀態碼（200表示成功）：

import requests
reponse = requests.get("http://www.baidu.com")
print(reponse.status_code)

模擬頭部的User-Agent資訊：

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
reponse = requests.get("http://www.baidu.com", 
headers=headers)
print(reponse.status_code)

獲取圖片的二進位制格式：

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
reponse = requests.get("https://www.baidu.com/img/bd_logo1.png",headers=headers)
print(reponse.content)

儲存圖片到本地：

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
reponse = requests.get("https://www.baidu.com/img/bd_logo1.png",headers=headers)
with open("pic.gif","wb") as f:
    f.write(reponse.content)

3.Urllib庫詳解：

獲取原始碼：

import urllib.request
response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

以POST形式傳遞一個字典（這個URL是做HTTP測試用的，應當記下）：

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({"word":"hello"}), encoding="utf-8")
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(reponse.read())

抓住異常（超時）：

import socket
import urllib.error
import urllib.request
try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("TIME OUT")

獲取響應型別：

import urllib.request
response = urllib.request.urlopen("http://www.python.org")
print(type(response))
#<class 'http.client.HTTPResponse'>

狀態碼和響應頭：

import urllib.request
response = urllib.request.urlopen("http://www.python.org")
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))

構造Request:

import urllib.request
request = urllib.request.Request("http://python.org")
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))

完整構造：

from urllib import request, parse
url = "http://www.python.org/post"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
"Host": "httpbin.org"
}
dict = {
    "name": "Germey"
}
data = bytes(parse.urlencode(dict), encoding="utf8")
req = request.Request(url=url, data=data, headers=headers, method="POST")
response = request.urlopen(req)
print(response.read().decode("utf-8"))

另外一種實現方式：

from urllib import request, parse
url = "http://httpbin.org/post"
dict = {
    "name": "Germey"
}
data = bytes(parse.urlencode(dict), encoding="utf8")
req = request.Request(url=url, data=data,method="POST")
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36")
response = request.urlopen(req)
print(response.read().decode("utf-8"))

代理(用於不斷改變IP防止爬蟲被封)：

import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    "http": "http://222.222.169.60:53281",
"https": "http://222.222.169.60:53281"
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open("http://www.baidu,com")
print(response.read())

獲取Cookie:

import http.cookiejar
import urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)

儲存Cookie到文字：

import http.cookiejar
import urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True, ignore_expires=True)

另外一種儲存方法：

import http.cookiejar
import urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True, ignore_expires=True)

使用剛才儲存的cookie訪問：

import http.cookiejar
import urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load("cookie.txt", ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
print(response.read().decode("utf-8"))

異常處理：

from urllib import request,error
try:
    response = request.urlopen("http://heiheiyiqing.com/index.htm")
except error.URLError as e:
    print(e.reason)

異常處理詳細：

from urllib import request,error
try:
    response = request.urlopen("http://heiheiyiqing.com/index.htm")
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print("Request Successfully")

URL解析：

from urllib.parse import urlparse
result = urlparse("https://www.baidu.com/index.php?tn=monline_3_dg")
print(type(result),result)

URL拼接：

from urllib.parse import urlunparse
data = ["http","www.baidu.com", "index.html", "user", "a=6", "comment"]
print(urlunparse(data))

URL轉碼：

from urllib.parse import urlencode
params = {
    "name": "yiqing",
"age": "18"
}
base_url = "http://www.baidu.com?"
url = base_url+urlencode(params)
print(url)

Python3爬蟲學習筆記（1.urllib庫詳解）

1.什麼是爬蟲：略，到處都有講解。雖然是入門，不過沒有Python基礎的同學看起來可能費勁，建議稍學下Python 之前學習前端知識也是為了能看懂HTML，便於爬蟲學習，建議瞭解下前端知識 2.re

Python3爬蟲學習筆記（4.BeautifulSoup庫詳解）

這是一個功能強大的庫，可以代替很多需要寫正則的地方這是一個第三方解析庫，常規安裝方法：調出cmd：pip install bs4 簡單瞭解： html = """ <html lang=

Python3爬蟲學習筆記（2.Requests庫詳解)

Requests庫功能相比Urllib庫更強大，也許是自帶的如果沒有，cmd輸入pip install requests獲取即可例項： import requests response = r

python爬蟲學習筆記二：Requests庫詳解及HTTP協議

Requests庫的安裝：https://mp.csdn.net/postedit/83715574 r=requests.get(url,params=None,**kwargs) 這個r是Response物件 url ：擬獲取頁面的url連結 params：url中的額外引數

python3爬蟲學習筆記（一）初入爬蟲 urllib學習

一、爬蟲是什麼網路爬蟲（也叫做網頁蜘蛛），是一種按照一定的規則，自動地抓取全球資訊網資訊的程式或者指令碼。如果把網際網路比做成一個大的蜘蛛網，蜘蛛網上每個節點都有大量的資料，爬蟲就像一隻小蜘蛛通過網頁的地址找到網站並獲取資訊：HTML程式碼/JSON資料/二進位制資料（圖

【Python爬蟲學習筆記2】urllib庫的基本使用

代理服務 cor proc 技術 origin car windows tpc -c urllib庫是python內置的實現HTTP請求的基本庫，通過它可以模擬瀏覽器的行為，向指定的服務器發送一個請求，並保存服務器返回的數據。 urlopen函數函數原型：urlopen(

爬蟲學習筆記（1）

在訪問網站時，向伺服器傳送請求主要有兩種方式 GET方法請求指定的頁面資訊，並且返回實體主體。 POST方法向指定資源提交資料進行處理請求（例如提交表單或者上傳檔案），資料被包含在請求體中，POST請求可能會導致新的資源建立和/或已有資源的修改。網頁抓取-----就是將URL中指定

【Python3 爬蟲學習筆記】部署相關庫的安裝

如果想要大規模抓取資料，那麼一定會用到分散式爬蟲。對於分散式爬蟲來說，我們需要多臺主機，每臺主機有多個爬蟲任務，但是原始碼其實只有一份。對於Scrapy來說，它有一個擴充套件元件，叫作Scrapyd，我們只需要安裝該擴充套件元件，即可遠端管理Scrapy任務

Python資料爬蟲學習筆記（1）讀取併合並Excel

需求：寫一個Python程式，實現多個Excel表格的合併。準備：在E盤3個待合併的測試檔案，如下所示：在每個檔案中的sheet1和sheet2中隨便寫點東西。（1）簡便方法： import openpyxl wb1 = openpyxl.load

python3語言學習筆記（六：模組+輸入輸出）

引入模組 import 模組名使用時：模組名.函式名（引數） from 模組名 import 函式名使用時：函式名（引數） import 模組名.* 使用時：函式名（引數）格式化字串方式一：使用str.format() 方式二：使用%格式化字串

Python基礎學習資料（語法、函式詳解）

目錄目錄 1 常用函式 5 input() 5 int() 6 len() 7 list() 與tuple相對 8 tuple() 與list相對 9 sum() 10 max() 11 min() 12 print() 13 range() 14 set() 15 sorted() 16

Python爬蟲學習筆記（一）——urllib庫的使用

scheme param https ade 網站 dmgr nor 分享圖片 out 前言我買了崔慶才的《Python3網絡爬蟲開發實戰》，趁著短學期，準備系統地學習下網絡爬蟲。在學習這本書的同時，通過博客摘錄並總結知識點，同時也督促自己每日學習。本書第一章是開發環境的

【Python3 爬蟲學習筆記】基本庫的使用 8—— 正則表示式 1

三、正則表示式 1.例項引入開啟開源中國提供的正則表示式測試工具 http://tool.oschina.net/regex/ ，輸入待匹配的文字，然後選擇常用的正則表示式，就可以得出相應的匹配結果。對於URL來說，可以用下面的正則表示式匹配： [a-zA-z]+://[^\

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

Beautiful Soup可以藉助網頁的結構和屬性等特性來解析網頁。有了Beautiful Soup，我們不用再去寫一些複雜的正則表示式，只需要簡單的幾條語句，就可以完成網頁中某個元素的提取。 Beautiful Soup是Python的一個HTML或XML的解析庫，可以用它來方便地從

【Python3 爬蟲學習筆記】解析庫的使用 1 —— 使用XPath 1

XPath，全稱XML Path Language，即XML路徑語言，它是一門在XML文件中查詢資訊的於洋。它最初是用來搜尋XML文件的，但它同樣適用於HTML文件的搜尋。 1. XPath概覽 XPath的選擇功能十分強大，它提供了非常簡潔明瞭的路徑選擇表示式。另外，它還提供了超過

【Python3 爬蟲學習筆記】解析庫的使用 8 —— 使用pyquery 1

先看示例： html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">secon

【Python3 爬蟲學習筆記】基本庫的使用 1

一、使用urllib 在Python 2中，有urllib和urllib2兩個庫來實現請求的傳送。而在Python 3中，已經不存在urllib2這個庫，統一為urllib。 urllib庫是Python內建的HTTP請求庫，不需要額外安裝，包含如下四個模組

Golang內建庫學習筆記（1）

type 學習筆記 res 利用 pos ring 類型 sso 最小 sort庫利用sort.Sort進行排序須實現如下接口 type Interface interface { // 獲取數據集合元素個數 Len() int

Python爬蟲學習筆記（二）——requests庫的使用

pip 安裝 .text rep 瀏覽器 ror clas ade 學習筆記準備工作 requests庫不是python自帶的庫，可以用pip安裝。在使用時導入requests即可。基本用法 GET請求 r = requests.get(url) print(r.tex

【Python3 爬蟲學習筆記】基本庫的使用 7 —— 使用requests

抓取二進位制資料前面我們抓取知乎的一個頁面，實際上它返回的是一個HTML文件。如何抓取圖片、音訊、視訊？圖片、音訊、視訊這些檔案本質上都是由二進位制碼組成的，由於有特定的儲存格式和對應的解析方式，我們才可以看到這些形形色色的多媒體，所以要抓取它們，就要拿到它們的二進位制碼。抓取

Python3爬蟲學習筆記（1.urllib庫詳解）

相關推薦