python3 urllib爬蟲,你只需要看這一篇就夠了
阿新 • • 發佈:2018-11-01
寫在最前面:以下資料均脫敏
from urllib import request import requests import urllib if __name__ == "__main__": # 介面的url session_requests = requests.session() data = {'username': '11111111', 'password': '11111111'} requrl ='https://xxxxxx.com/xx/login?xxxxxxxxxxxxxxxxxxxxxxx' #登入請求url headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'} # 傳送請求 conn=requests.post(requrl,data,headers) #cookies = conn.cookies.get_dict() print(conn.request.headers) newheaders = conn.request.headers url = "http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.htm" #請求爬蟲的url print(type(newheaders)) newheaders = dict(newheaders) print(type(newheaders)) del newheaders['Accept-Encoding'] print(newheaders) req = request.Request(url=url,headers=newheaders) rsp = request.urlopen(req) html = rsp.read().decode("utf-8","ignore") print(html)
因為不把Accepe-Encoding去掉,會報錯,或者亂碼
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
所以轉成字典,再去掉Accepe-Encoding
下面稍微解釋一下吧,首先構造登入請求報文,包含使用者名稱,密碼,登入成功後獲取cookie,使用cookie再去訪問你要爬蟲的頁面,不然還是會被登入頁面給攔截掉
能抓到你想訪問的頁面,接下來想幹什麼都可以了
關於cookie,其實你也可以手動F12看一下,Network裡,Headers裡有一個Request Headers,其中最重要的就是你的cookie,儲存了你本次登入的所有資訊,每次重新登入都會改變