python scrapy爬取知乎問題和收藏夾下所有答案的內容和圖片
阿新 • • 發佈:2018-12-09
上文介紹了爬取知乎問題資訊的整個過程,這裡介紹下爬取問題下所有答案的內容和圖片,大致過程相同,部分核心程式碼不同.
爬取一個問題的所有內容流程大致如下:
- 一個問題url
- 請求url,獲取問題下的答案個數(我不需要,因為之前獲取問題資訊的時候儲存了問題的回答個數)
- 通過答案的介面去獲取答案(如果一次獲取5個答案,總計100個答案,需要計算的出訪問20次答案介面)[答案的介面地址如下圖所示]
- 答案介面返回的內容儲存到mysql
- 提取內容中的圖片地址,儲存到本地

爬取程式碼:
從mysql庫中查到question的id, 然後直接訪問 答案介面 去獲取資料.
answer_template="https://www.zhihu.com/api/v4/questions/%s/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_ comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;dat a[*].author.follower_count,badge[?(type=best_answerer)].topics&limit=5&offset=%s&sort_by=default" def check_login(self, response): #從mysql中讀取question的資訊,來進行爬取 db = MySQLdb.connect("localhost", "root", "", "crawl", charset='utf8' ) cursor = db.cursor() selectsql="select questionid,answer_num from zhihu_question where id in ( 251,138,93,233,96,293,47,24,288,151,120,311,214,33) ;" try: cursor.execute(selectsql) results = cursor.fetchall() for row in results: questionid = row[0] answer_num = row[1] fornum = answer_num/5 #計算需要訪問答案介面的次數 print("questionid : "+ str(questionid)+" answer_Num: "+str(answer_num)) for i in range(fornum+1): answer_url = self.answer_template % (str(questionid), str(i*5)) yield scrapy.Request(answer_url,callback=self.parse_answer, headers=self.headers) except Exception as e: print(e) db.close()
解析response
parser_anser解析接口裡的內容,這裡就比較方便了, 因為是json格式的 程式碼如下:
def parse_answer(self,response): #測試時把返回結果寫到本地, 然後寫pythonmain方法測試,測試方法都在test_code目錄下 #temfn= str(random.randint(0,100)) #f = open("/var/www/html/scrapy/answer/"+temfn,'wb') #f.write(response.body) #f.write("------") #f.close() res=json.loads(response.text) #print (res) data=res['data'] # 一次返回多個(預設5個)答案, 需要遍歷 for od in data: #print(od) item = AnswerItem() item['answer_id']=str(od['id']) # answer id item['question_id']=str(od['question']['id']) item['question_title']=od['question']['title'] item['author_url_token']=od['author']['url_token'] item['author_name']=od['author']['name'] item['voteup_count']=str(od['voteup_count']) item['comment_count']=str(od["comment_count"]) item['content']=od['content'] yield item testh = etree.HTML(od['content']) itemimg = MyImageItem() itemimg['question_answer_id'] = str(od['question']['id'])+"/"+str(od['id']) itemimg['image_urls']=testh.xpath("//img/@data-original") yield itemimg
成果展示
爬取了4w+個答案和12G圖片(個人伺服器只有12G空間了~) 

爬取收藏夾下的答案內容和圖片:
爬取收藏夾下的回答的流程和爬取問題下回答基本流程一樣,區別在於:
- 問題的start_urls為多個,收藏夾是一個一個爬取
- 問題頁面上找到了內容介面,返回json.方便. 收藏夾頁面沒有找到介面(我沒有找到),我是訪問每頁,然後解析的html.
構造每頁的起始地址:

解析html核心程式碼: