1. 程式人生 > >python3 urllib爬蟲,你只需要看這一篇就夠了

python3 urllib爬蟲,你只需要看這一篇就夠了

寫在最前面:以下資料均脫敏

from urllib import request
import requests
import urllib

if __name__ == "__main__":
    # 介面的url
    session_requests = requests.session()
    data = {'username': '11111111', 'password': '11111111'}
    requrl ='https://xxxxxx.com/xx/login?xxxxxxxxxxxxxxxxxxxxxxx' #登入請求url
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'}
    # 傳送請求
    conn=requests.post(requrl,data,headers)
    #cookies = conn.cookies.get_dict()
    print(conn.request.headers)
    newheaders = conn.request.headers
    url = "http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.htm" #請求爬蟲的url
    print(type(newheaders))
    newheaders = dict(newheaders)
    print(type(newheaders))
    del newheaders['Accept-Encoding']
    print(newheaders)
    req = request.Request(url=url,headers=newheaders)
    rsp = request.urlopen(req)
    html = rsp.read().decode("utf-8","ignore")
    print(html)

因為不把Accepe-Encoding去掉,會報錯,或者亂碼

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

 所以轉成字典,再去掉Accepe-Encoding

下面稍微解釋一下吧,首先構造登入請求報文,包含使用者名稱,密碼,登入成功後獲取cookie,使用cookie再去訪問你要爬蟲的頁面,不然還是會被登入頁面給攔截掉

能抓到你想訪問的頁面,接下來想幹什麼都可以了

關於cookie,其實你也可以手動F12看一下,Network裡,Headers裡有一個Request Headers,其中最重要的就是你的cookie,儲存了你本次登入的所有資訊,每次重新登入都會改變