Scrapy學習筆記（3）爬取知乎首頁問題及答案

阿新 • • 發佈：2019-02-04

目標：爬取知乎首頁前x個問題的詳情及問題指定範圍內的答案的摘要

power by:

Python 3.6
Scrapy 1.4
json
pymysql

Step 1——相關簡介

Step 2——模擬登入

知乎如果不登入是爬取不到資訊的，所以首先要做的就是模擬登入
主要步驟：

獲取xsrf及驗證碼圖片
填寫驗證碼提交表單登入
登入是否成功

獲取xsrf及驗證碼圖片：

def start_requests(self):

    yield scrapy.Request('https://www.zhihu.com/', callback=self.login_zhihu)

def login_zhihu(self, response):
    """ 獲取xsrf及驗證碼圖片 """
    xsrf = re.findall(r'name="_xsrf" value="(.*?)"/>', response.text)[0]
    self.headers['X-Xsrftoken'] = xsrf
    self.post_data['_xsrf'] = xsrf

    times = re.findall(r'<script type="text/json" class="json-inline" data-n'
                       r'ame="ga_vars">{"user_created":0,"now":(\d+),', response.text)[0]
    captcha_url = 'https://www.zhihu.com/' + 'captcha.gif?r=' + times + '&type=login&lang=cn'

    yield scrapy.Request(captcha_url, headers=self.headers, meta={'post_data': self.post_data},
                         callback=self.veri_captcha)

這裡寫圖片描述

填寫驗證碼提交表單登入：

def veri_captcha(self, response):
    """ 輸入驗證碼資訊進行登入 """
    with open('captcha.jpg', 'wb') as f:
        f.write(response.body)

    print('只有一個倒立文字則第二個位置為0')
    loca1 = input('input the loca 1:')
    loca2 = input('input the loca 2:')
    captcha = self.location(int(loca1), int(loca2))

    self.post_data = response.meta.get('post_data', {})
    self.post_data['captcha'] = captcha
    post_url = 'https://www.zhihu.com/login/email'

    yield scrapy.FormRequest(post_url, formdata=self.post_data, headers=self.headers,
                             callback=self.login_success)

def location(self, a, b):
    """ 將輸入的位置轉換為相應資訊 """
    if b != 0:
        captcha = "{\"img_size\":[200,44],\"input_points\":[%s,%s]}" %
                   (str(self.capacha_index[a - 1]),
                    str(self.capacha_index[b - 1]))
    else:
        captcha = "{\"img_size\":[200,44],\"input_points\":[%s]}" % str(self.capacha_index[a - 1])
    return captcha

登入是否成功：

def login_success(self, response):

    if 'err' in response.text:
        print(response.text)
        print("error!!!!!!")
    else:
        print("successful!!!!!!")
        yield scrapy.Request('https://www.zhihu.com', headers=self.headers, dont_filter=True)

這裡寫圖片描述

Step 3——獲取首頁問題

獲取第一頁的問題只需要將問題URL提取出來即可，不過第一頁只有10個左右的問題，
如果想提取更多的問題就需要模擬翻頁以便獲取問題資料

def parse(self, response):
    """ 獲取首頁問題 """
    question_urls = re.findall(r'https://www.zhihu.com/question/(\d+)', response.text)

    # 翻頁用到的session_token 和 authorization都可在首頁原始碼找到
    self.session_token = re.findall(r'session_token=([0-9,a-z]{32})', response.text)[0]
    auto = re.findall(r'carCompose&quot;:&quot;(.*?)&quot', response.text)[0]
    self.headers['authorization'] = 'Bearer ' + auto

    # 首頁第一頁問題
    for url in question_urls:
        question_detail = 'https://www.zhihu.com/question/' + url
        yield scrapy.Request(question_detail, headers=self.headers, callback=self.parse_question)

    # 獲取指定數量問題
    n = 10
    while n < self.question_count:
        yield scrapy.Request(self.next_page.format(self.session_token, n), headers=self.headers,
                             callback=self.get_more_question)
        n += 10


def get_more_question(self, response):
    """ 獲取更多首頁問題 """
    question_url = 'https://www.zhihu.com/question/{0}'
    questions = json.loads(response.text)

    for que in questions['data']:
        question_id = re.findall(r'(\d+)', que['target']['question']['url'])[0]
        yield scrapy.Request(question_url.format(question_id), headers=self.headers,
                             callback=self.parse_question)

Step 4——獲取問題詳情

分析問題頁獲取問題詳情及請求問題指定範圍的指定數量答案

Item結構：

class ZhihuQuestionItem(scrapy.Item):

    name = scrapy.Field()
    url = scrapy.Field()
    keywords = scrapy.Field()
    answer_count = scrapy.Field()
    comment_count = scrapy.Field()
    flower_count = scrapy.Field()
    date_created = scrapy.Field()

獲取問題詳情及請求指定範圍答案

def parse_question(self, response):
    """ 解析問題詳情及獲取指定範圍答案 """
    text = response.text
    item = ZhihuQuestionItem()

    item['name'] = re.findall(r'<meta itemprop="name" content="(.*?)"', text)[0]
    item['url'] = re.findall(r'<meta itemprop="url" content="(.*?)"', text)[0]
    item['keywords'] = re.findall(r'<meta itemprop="keywords" content="(.*?)"', text)[0]
    item['answer_count'] = re.findall(r'<meta itemprop="answerCount" content="(.*?)"', text)[0]
    item['comment_count'] = re.findall(r'<meta itemprop="commentCount" content="(.*?)"', text)[0]
    item['flower_count'] = re.findall(r'<meta itemprop="zhihu:followerCount" 
                                      content="(.*?)"', text)[0]
    item['date_created'] = re.findall(r'<meta itemprop="dateCreated" content="(.*?)"', text)[0]

    count_answer = int(item['answer_count'])
    yield item

    question_id = int(re.match(r'https://www.zhihu.com/question/(\d+)', response.url).group(1))

    # 從指定位置開始獲取指定數量答案
    if count_answer > self.answer_count:
        count_answer = self.answer_count
    n = self.answer_offset
    while n + 20 <= count_answer:
        yield scrapy.Request(self.more_answer_url.format(question_id, n, n + 20), 
                             headers=self.headers, callback=self.parse_answer)
        n += 20

Step 5——獲取答案

在parse_question( )中請求的指定範圍答案的url返回json資料

item結構：

class ZhihuAnswerItem(scrapy.Item):

    question_id = scrapy.Field()
    author = scrapy.Field()
    ans_url = scrapy.Field()
    comment_count = scrapy.Field()
    upvote_count = scrapy.Field()
    excerpt = scrapy.Field()

獲取答案：

def parse_answer(self, response):
    """ 解析獲取到的指定範圍答案 """
    answers = json.loads(response.text)

    for ans in answers['data']:
        item = ZhihuAnswerItem()
        item['question_id'] = re.match(r'http://www.zhihu.com/api/v4/questions/(\d+)', 
                                       ans['question']['url']).group(1)
        item['author'] = ans['author']['name']
        item['ans_url'] = ans['url']
        item['comment_count'] = ans['comment_count']
        item['upvote_count'] = ans['voteup_count']
        item['excerpt'] = ans['excerpt']

        yield item

Step 6——問題及答案入庫

class ZhihuPipeline(object):

    def __init__(self):

        self.settings = get_project_settings()
        self.connect = pymysql.connect(
            host=self.settings['MYSQL_HOST'],
            db=self.settings['MYSQL_DBNAME'],
            user=self.settings['MYSQL_USER'],
            passwd=self.settings['MYSQL_PASSWD'],
            charset=self.settings['MYSQL_CHARSET'],
            use_unicode=True
        )
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):

        if item.__class__.__name__ == 'ZhihuQuestionItem':
            sql = 'insert into Scrapy_test.zhihuQuestion(name,url,keywords,answer_count,' \
                  'flower_count,comment_count,date_created) values (%s,%s,%s,%s,%s,%s,%s)'

            self.cursor.execute(sql, (item['name'], item['url'], item['keywords'],      
                                item['answer_count'],item['flower_count'], 
                                item['comment_count'], item['date_created']))
        else:
            sql = 'insert intoScrapy_test.zhihuAnswer(question_id,author,ans_url',\
                  'upvote_count,comment_count,excerpt)values (%s,%s,%s,%s,%s,%s)'

            self.cursor.execute(sql, (item['question_id'], item['author'], 
                                item['ans_url'], 
                                item['upvote_count'],item['comment_count'], item['excerpt']))
    self.connect.commit()

成果展示：

這裡寫圖片描述

Scrapy學習筆記（3）爬取知乎首頁問題及答案

目標：爬取知乎首頁前x個問題的詳情及問題指定範圍內的答案的摘要 power by: Python 3.6 Scrapy 1.4 json pymysql Step 1——相關簡介 Step 2——模擬登入知乎如果不登入

Python資料爬蟲學習筆記（13）爬取微信文章資料

一、需求：在微信搜尋網站中,通過設定搜尋關鍵詞以及搜尋頁面數，爬取出所有符合條件的微信文章：二、搜尋頁URL分析階段： 1、在搜尋框中輸入任意關鍵詞，在出現的搜尋結果頁面點選下一頁，將每一頁的URL複製下來進行觀察： 2、注意到頁碼由page=X決定，搜尋關鍵

Python資料爬蟲學習筆記（21）爬取京東商品JSON資訊並解析

一、需求：有一個通過抓包得到的京東商品的JSON連結，解析該JSON內容，並提取出特定id的商品價格p，json內容如下： jQuery923933([{"op":"7599.00","m":"9999.00","id":"J_5089253","p":"7099.00"}

python爬蟲學習筆記（一）—— 爬取騰訊視訊影評

前段時間我忽然想起來，以前本科的時候總有一些公眾號，能夠為我們提供成績查詢、課表查詢等服務。我就一直好奇它是怎麼做到的，經過一番學習，原來是運用了爬蟲的原理，自動登陸教務系統爬取的成績等內容。我覺得挺好玩的，於是自己也琢磨了一段時間，今天呢，我為大家分享一個爬蟲

Python資料爬蟲學習筆記（11）爬取千圖網圖片資料

需求：在千圖網http://www.58pic.com中的某一板塊中，將一定頁數的高清圖片素材爬取到一個指定的資料夾中。分析：以數碼電器板塊為例 1.檢視該板塊的每一頁的URL：注意到第一頁是“0-1.html”，第二頁是“0-2.html”，由

python學習（三）scrapy爬蟲框架（三）——爬取桌布儲存並命名

寫在開始之前按照上一篇介紹過的scrapy爬蟲的建立順序，我們開始爬取桌布的爬蟲的建立。首先，再過一遍scrapy爬蟲的建立順序：第一步：確定要在pipelines裡進行處理的資料，寫好items檔案第二步：建立爬蟲檔案，將所需要的資訊從網站上爬

TCP/IP詳解學習筆記（3）IP協議ARP協議和RARP協議

out 處理機傳輸包含發送 res 這也進行默認把這三個協議放到一起學習是因為這三個協議處於同一層，ARP協議用來找到目標主機的Ethernet網卡Mac地址，IP則承載要發送的消息。數據鏈路層可以從ARP得到數據的傳送信息，而從IP得到要傳輸的數據信息。　　

spring學習筆記（3）——bean配置細節註意

collect 1.5 之前 ice ble person name return 引用 1. 一個bean引用另外一個bean 當Person類中有一個屬性是Car，那麽該如何配置呢 person： package com.zj.spring; public class

QT學習筆記（3）我的第一個程序

9.png har 中文 gets 有一個 setw 通過坐標關系今天，學習搭建一個空項目，了解程序是如何運行的。（1）新建一個空項目　　1、在創建完空項目之後，項目中只有一個空的項目文件（ .pro）　　　　　　2、然後需要在項目文件（.pro）中添加：

Hibernate學習筆記（3）---hibernate關聯關系映射

gen -m type foreign out eas ner 機制路徑一對一關聯假設有兩個持久化類（實體類）User與Address，它們之間存在一對一的關系 1，通過主鍵關聯（個人偏向另外一種） User.hbm.xml文件配置 <id name="u

C++深度解析教程學習筆記（3）函數的擴展

插入分享技術 lsp 預處理器 _for 返回忽略結合 1.內聯函數 1.1.常量與宏的回顧 (1)C++中的 const 常量可以替代宏常數定義,如: const int A = 3; //等價於 #define A 3 (2)C++中是否有解決方案,可以用來

Scrapy分布式爬蟲打造搜索引擎（慕課網）--爬取知乎（二）

false pat 模塊 text 文件的服務協議 .py execute 通過Scrapy模擬登陸知乎通過命令讓系統自動新建zhihu.py文件首先進入工程目錄下再進入虛擬環境通過genspider命令新建zhihu.py scrap

shell學習筆記（3）

shell 基礎雜記if 一、if基礎 1、單分支 1.1 語法 if語句語法單分支結構語法： if [條件]; then 指令 fi 或 if [條件] then 指令 fi 1.2 例子 [roo

Python學習筆記（3）

python重要的數據類型Dict和Setdict通過key 查找value（key和value關聯）花括號{ }表示這是一個dict，然後按照key:value，寫出來即可。最後一個key:value的都好可以省略註意: 單元素的tuple必須在後面多家加一個逗號dict最後的逗號可以省略由於dict也是

莫煩大大TensorFlow學習筆記（3）----建立神經網絡

nbsp 定義數據學習筆記 variables ati 選擇 mea 有變 plus 1、def add_layer() 添加神經網絡層： import tensorflow as tf def add_layer( inputs, in_size, out_si

jQuery 學習筆記（3）（內容選擇器、attr方法、prop方法，類的操作）

節點 lec ddc 方法 pty 全部如果一個所有內容選擇器： 1、$("div:empty"): 空的div元素 2、$("div:parent"): 非空div元素 3、$("div:contains(text)"): 包含 text 文本（指定文本）的div

Java暑期學習筆記（3）

ring out 顯示字節數順序作用提示 string轉換 gbk # 2018.7.11 # * 1.匿名內部類(只針對重寫一個方法時候使用，不能向下轉型，因為沒有子類類名) * new Inter(){ public

VBA二次學習筆記（3）——批量合並單元格

false spl png next com src 了吧 merge 昨天說明（2018-9-16 22:17:49）： 1. 昨天運動會，100米八個人跑了第五，400米五個人跑了第三，得了個榨汁機。終於結束了哈哈哈！之前一個星期緊張的天天拉肚子，真是沒出息。。不過養

javaweb-servlet學習筆記（3）

tps 技術分享 tex 周期目錄 tom text let 垃圾 servlet的生命周期要經過：實例化，初始化，提供服務，銷毀，回收五個階段。 1.當用戶訪問一個路徑，該路徑對應的servlet被調用的時Servlet就會被實例化。且無論訪問多少次servlet，其

Spring入門學習筆記（3）——事件處理類

aware super 不能 href his 應用 odi eap app 目錄 Spring中的事件處理 Spring內建事件監聽Context事件 Example 自定義Spring事件 Spring中的事件處理 ApplicationContext 是Spr

Scrapy學習筆記（3）爬取知乎首頁問題及答案

目標：爬取知乎首頁前x個問題的詳情及問題指定範圍內的答案的摘要

power by:

Step 1——相關簡介

Step 2——模擬登入

Step 3——獲取首頁問題

Step 4——獲取問題詳情

Step 5——獲取答案

Step 6——問題及答案入庫

成果展示：

相關推薦