python—【爬蟲】學習_1(基本知識篇）

阿新 • • 發佈：2018-11-01

首先介紹下urllib的用法

urllib提供了一系列用於操作URL的功能。

常用的模組：

urllib.request 請求模組
urllib.error 異常處理模組
urllib.parse url解析模組

request（）

urllib的ruquest模組可以非常方便地抓取URL內容，也就是傳送一個GET請求到指定的頁面，然後返回HTTP的響應：

from urllib import request
response = request.urlopen("https://www.baidu.com/")

.read() 每次讀取整個檔案，它通常用於將檔案內容放到一個字串變數中。

html =response.read()

然而這是遠遠不夠的，因為返回值是以二進位制儲存的，一般網頁原碼都是用utf-8表示，所以一般還有一個decode（）的過程。

html =html.decode("utf-8")
print(html)

得到以下輸出：

<html>

<head>

	<script>

		location.replace(location.href.replace("https://","http://"));

	</script>

</head>

<body>

	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

</body>

</html>

具體例子1：（下載一隻小貓）來源網站：placekitten.com

code如下：

from urllib import request as re
response =re.urlopen("http://placekitten.com/g/500/600")
cat_img = response.read()

with open("D:\\實驗樓\cat_500_600.jpg", 'wb')as f:
    f.write(cat_img)

.Parse()

這裡就用到urllib.parse，通過bytes(urllib.parse.urlencode())可以將post資料進行轉換放到urllib.request.urlopen的data引數中。這樣就完成了一次post請求。
所以如果我們新增data引數的時候就是以post請求方式請求，如果沒有data引數就是get請求方式。

具體例子2：使用有道翻譯進行翻譯.

from urllib import request
from urllib import parse
import json

content = input("請輸入需要翻譯的內容")

url ="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"


head ={ }
head['User-Agent']='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

data ={}
data["from"]="AUTO"
data["to"]="AUTO"
data["i"]= content
data["client"]="fanyideskweb"
data["sign"]="c8f3a6d3a2e68a5ba21a0c36de9ed9cd"
data["salt"]="1539962031171"
data["smartresult"]="dict"
data["doctype"]="json"
data["version"]= 2.1
data["keyfrom"]="fanyi.web"
data["action"]="FY_BY_REALTIME"
data["typoResult"]= "false"

data = parse.urlencode(data).encode("utf-8")
##是將一個utf-8型別的字串url，解碼成ascii格式的方法
#urllib.parse.urlencode()
#只將連線中utf-8編碼不在ascii表中的字元翻譯成帶百分號的ascii表示形式
#>>>params = {'query':'中文','submit':'search'}
#>>>data = urllib.parse.urlencode(params)
#>>>data
#'query=%E4%B8%AD%E6%96%87&submit=search'
req = request.Request(url,  data , head)
response = request.urlopen(req)

# or 刪去Request中的head
# +  req.add_header（“User-Agent”，'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'）
html = response.read().decode('utf-8')

target = json.loads(html)
#json.loads 用於解碼 JSON 資料。該函式返回 Python 欄位的資料型別。
res= target['translateResult'][0][0]['tgt']
print ("翻譯結果為：",res)

代理ip的使用方法：

1、build_opener 和urlopen的區別和優點？

答：要爬取的各種各樣的網頁，它們有一部填寫需要驗證碼，有的需要小餅乾（cookie），還有更多許多高階的功能，它們會阻礙你爬，而我對於urlopen單純地理解就是開啟網頁。urlopen開啟一個網址，它可以是一個字串或者是一個request物件。而build_opener就是多了handler，處理問題更專業,更個性化。

2、使用預設的handlers應該怎麼寫？

例子1:

ProHandler=
request = urllib.request.Request(url,data, headers or {})

iplist =['119.6.144.73:81',...........]

proxy_support =urllib.request.ProxyHandler({'http':random.choice(iplist)}）
opener = urllib.request.build_opener(support)
例子2:
opener1 = urllib.request.install_opener(opener)

python—【爬蟲】學習_1(基本知識篇）

request（）

.Parse()

python—【爬蟲】學習_1(基本知識篇）

python—【爬蟲】學習_2(正則表示式篇）1.基礎知識

python—【爬蟲】學習_2(正則表示式篇）_2(practice)

python—【爬蟲】學習_3(異常處理）

python—【爬蟲】學習_2(正則表示式篇）3.re模組函式的深入理解

【vue】學習lodash : 這一篇就夠了

【WPF】學習筆記（持續更新）

Hadoop學習筆記—15.HBase框架學習（基礎知識篇）

HBase框架學習（基礎知識篇）

【原創】學習 python的多型性，基礎知識

Learing-Python【3】：Python中的基本運算符

【爬蟲】Requests 庫的入門學習

【Python】【爬蟲】爬取網易、騰訊、新浪、搜狐新聞到本地

【爬蟲】python爬蟲工具scrapy的安裝使用

【爬蟲】Requests 庫的入門學習

【CSS筆記二】CSS樣式基本知識

【Python】【爬蟲】爬取京東商品使用者評論（分析+視覺化）

【初識SSD】SSD的基本知識

【爬蟲】如何用python+selenium網頁爬蟲

【Python網路爬蟲】Python維基百科網頁抓取（BeautifulSoup+Urllib2）

python—【爬蟲】學習_1(基本知識篇）

request（）

.Parse()

相關推薦