【Python3.6爬蟲學習記錄】（五）Cookie的使用以及簡單的爬取知乎

阿新 • • 發佈：2019-01-28

前言
Cookie，指某些網站為了辨別使用者身份、進行session跟蹤而儲存在使用者本地終端上的資料（通常經過加密）。
有些網站需要登入後才能訪問某個頁面，比如知乎的回答，QQ空間的好友列表、微博上關注的人和粉絲等，在登入之前，你想抓取某個頁面內容是不允許的。那麼我們可以利用某些庫儲存我們登入後的Cookie，然後爬蟲使用儲存的Cookie可以開啟網頁進行相關爬取，此時該頁面仍然以為是我們人為的在訪問，孰不知是爬蟲。當然，一個Cookie也不能永久的使用，一段時間後需要更換。

對於登陸情況的處理
1 使用表單登陸
這種情況屬於post請求，即先向伺服器傳送表單資料，伺服器再將返回的cookie存入本地。
data = {‘data1’:’XXXXX’, ‘data2’:’XXXXX’}
Requests：data為dict，json
import requests
response = requests.post(url=url, data=data)
2 使用cookie登陸
使用cookie登陸，伺服器會認為你是一個已登陸的使用者，所以就會返回給你一個已登陸的內容。因此，需要驗證碼的情況可以使用帶驗證碼登陸的cookie解決，此為後話。
import requests
requests_session = requests.session()
response = requests_session.post(url=url_login, data=data)

本文只用到簡單的cookie模擬登陸，詳細教程戳這裡

# 爬取知乎，-headers的應用
from http import cookiejar
from  urllib import request
from bs4 import BeautifulSoup

# # cookie的測試
# # 宣告一個CookieJar例項物件
# cookie = cookiejar.CookieJar()
# # 建立cookie處理器
# handle = request.HTTPCookieProcessor(cookie)
# # 通過cookie處理器建立opener例項
# opener = request.build_opener(handle) 

# # 通過opener例項開啟網頁
# response = opener.open('https://www.zhihu.com/question/25313930')
# # 列印cookie
#  for item in cookie:
#      print('Name = %s' % item.name)
#      print('Value = %s' % item.value)


# 命名儲存cookie的檔案的檔名
filename = 'cookie.txt'
#儲存cookie到檔案
def saveCookie():
    cookie = cookiejar.MozillaCookieJar(filename)
    handler = request.HTTPCookieProcessor(cookie)
    opener = request.build_opener(handler)
    response = opener.open('https://www.zhihu.com/question/25313930' 
)
    # ignore_discard的意思是即使cookies將被丟棄也將它儲存下來；
    # ignore_expires的意思是如果在該檔案中cookies已經存在，則覆蓋原檔案寫入
    cookie.save(ignore_discard=True, ignore_expires=True)

 saveCookie()
# 從檔案中獲取cookie並訪問
# 建立MozillaCookieJar例項
cookie = cookiejar.MozillaCookieJar()
# 從檔案中讀取cookie內容到變數
cookie.load(filename,ignore_discard=True,ignore_expires=True)
# 建立cookie處理器
handle = request.HTTPCookieProcessor(cookie)
# 通過cookie處理器建立opener物件
opener = request.build_opener(handle)
# 通過opener物件的open方法開啟網頁
response = opener.open('https://www.zhihu.com/question/25313930')
html = response.read()

soup = BeautifulSoup(html,'lxml')
storys = soup.find_all('div',class_="List-item")
print(len(storys))
for story in storys :
    nameLabel = story.find('meta',itemprop="name")
    name = nameLabel["content"]
    with open('By '+str(name)+'.txt','w') as f:
        storyText = story.find('span', class_="RichText CopyrightRichText-richText")
        #storyPages = storyText.find_all('p')
        try:
            # 獲取多個內容，不過需要遍歷獲取，比如下面的例子
             for string in storyText.strings:
                 f.write(repr(string)+'\n')
            print('By '+str(name)+' has been finished')
        except Exception:
            print('Something is wrong on writing to txt')

print('That is all')

相關問題
①解決更新Cookie問題的思路
建立saveCookie()方法，每次獲取並儲存前一次登陸獲得的cookie

②爬取某個回答的全文
原始碼
使用下面的程式碼出現問題，並且列印storyPages為空

 storyPages = storyText.find_all('p')
 for storyPage in storyPages:
     f.write(str(storyPage)+'\n')

 for string in storyText.strings:
                 f.write(repr(string)+'\n')

③未能爬起該頁面下全部回答
遇到的問題

④獲得url的原始碼的等價處理
使用cookie與不使用時，以下兩個方法，列印，得到相同結果

#當使用cookie時
response = opener.open('https://www.zhihu.com/question/25313930')
html = response.read()

#當不使用cookie時
url = 'http://www.jianshu.com/p/82833d443e76'
html = requests.get(url).content

【Python3.6爬蟲學習記錄】（五）Cookie的使用以及簡單的爬取知乎

【Python3.6爬蟲學習記錄】（五）Cookie的使用以及簡單的爬取知乎

【Python3.6爬蟲學習記錄】（十）爬取教務處成績並儲存到Excel檔案中（哈工大）

【Python3.6爬蟲學習記錄】（十四）多執行緒爬蟲模板總結

【Unity3D遊戲開發學習筆記】（四）一切都動起來—Animator元件的應用

【Unity3D遊戲開發學習筆記】（六）上帝之手—GameObject的操作

【Unity3D遊戲開發學習筆記】（七）上帝之眼—第三人稱攝像機的簡單實現（跟隨視角，自由視角）

Python學習記錄——Ubuntu（五）Vim

【Elasticsearch 7 探索之路】（五）搜尋相關 Search-API

hadoop生態系統學習之路（五）hbase的簡單使用

Jsoup-簡單爬取知乎推薦頁面（附：get_agent()）

【零基礎】Python3學習課後練習題（五）

【Python 3 爬蟲學習筆記】使用Python3 爬取貓眼《西虹市首富》

python3網絡爬蟲學習——使用requests（1）

【Unity 實戰記錄】（一）攝像機區域內跟隨物體

【NLP學習筆記】（三）gensim使用之相似性查詢（Similarity Queries）

【程式語言學習 2 】（轉發）雜湊表（散列表）原理詳解

Python小白學習之路（五）—【類和物件】【列表】【列表相關功能】

【GANs學習筆記】（一）初步瞭解GANs

【GANs學習筆記】（九）WGAN-GP

【GANs學習筆記】（十）SNGAN

【Python3.6爬蟲學習記錄】（五）Cookie的使用以及簡單的爬取知乎

相關推薦