1. 程式人生 > >爬取知乎某個問題下所有的圖片

爬取知乎某個問題下所有的圖片

user count view default os.chdir make selenium pytho use

最近在逛知乎時,看到這麽一個問題

技術分享

最高贊的答案寫了個爬蟲,把所有的照片都爬下來了。

技術分享

嘿嘿嘿,技術的力量

技術分享

正好自己也在學習,加上答主的答案是很久之前的,知乎已經改版了,所以決定自己用Python3寫一個練習一下(絕對不是為了下照片)....

技術分享

設個小小的目標:爬取所有“女性”程序員的照片。

首先是要知道“總的回答數”,這個比較簡單:

url="https://www.zhihu.com/question/37787176"
html=requests.get(url,headers=headers).text
answer=BeautifulSoup(html,"lxml").find("h4",class_="List-headerText").find("span").get_text()
answer_num=int(re.sub("\s\S+","",answer))

知乎加載內容是通過點擊“更多”,然後加載出20個回答,利用selenium模擬登陸太慢太麻煩,所有查看知乎的Ajax請求比較靠譜,此處感謝崔大神的教學(http://cuiqingcai.com/4380.html)。

通過瀏覽器,可以看到每次點擊更多,請求內容是一個“fetch”類型的文件和相關的圖片(jpeg),這個"fetch"文件包含了回答者信息和回答內容

技術分享

通過json處理後,通過gender判斷回答者性別(0為女,1為男)。

抓取“content”下的所有src屬性的圖片鏈接,就搞定了。

附註:請求頭要加一個"authorization"

技術分享

技術分享

下面是全代碼:

import requests
import os
import json
from bs4 import BeautifulSoup
import re
import time

headers = {
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘,
"Connection": "keep - alive",
"Accept": "text/html,application/xhtml+xml,application/xml;",
"authorization": "Bearer Mi4wQUFEQVB4VkNBQUFBVU1MWktySDJDeGNBQUFCaEFsVk5TZ0YyV1FBaGsxRnJlTFd3ZGR6QzZrTXptcDFuWGNOQk5B|1498313802|2d5466ef4550588f5fc28553ea8981e7a4e398ad"
}
isExists = os.path.exists("D:/crawler_study/zhihu")
if not isExists:
os.makedirs("D:/crawler_study/zhihu")
os.chdir("D:/crawler_study/zhihu")
else:
os.chdir("D:/crawler_study/zhihu")

url="https://www.zhihu.com/question/37787176"
html=requests.get(url,headers=headers).text
answer=BeautifulSoup(html,"lxml").find("h4",class_="List-headerText").find("span").get_text()
answer_num=int(re.sub("\s\S+","",answer))
url_prefix="https://www.zhihu.com/api/v4/questions/37787176/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset="
offset=0

while offset < answer_num:
answer_url=url_prefix+str(offset)
html=requests.get(answer_url,headers=headers).text
content=json.loads(html)["data"]
for row in content:
gender=row["author"]["gender"]
if gender == 0:
answer=row["content"]
pic_list=BeautifulSoup(answer,‘lxml‘).find_all("img")
for pic in pic_list:
down_url=pic["src"]
if down_url.startswith("http"):
name=re.sub(".*/","",down_url)
file=open(name,"ab")
print("開始下載:",name)
file.write(requests.get(down_url).content)
print("下載完:", name)
file.close()
else:
pass
offset+=20
time.sleep(3)

爬取知乎某個問題下所有的圖片