【scrapy】模擬登陸知乎

阿新 • • 發佈：2019-02-14

這個網上有個通用的教程，然而為這個教程已經花費了太多時間進行除錯，和知乎上的朋友交流，很多人也是這個地方遇到了問題，最後的結果。。是放棄了crawlspider。。

先貼下這個連結。。。http://ju.outofmemory.cn/entry/105646 謹慎。。

針對上面這個教程，遇到的幾點問題：

問題1：知乎的登陸url不再是/login了，根據email和phonenum分為/login/phone_num和login/email。因此start_requests的裡的url需要更改

問題2：根據文件中，模擬登陸的FormRequest.from_response，在after_login中print response.body發現還是登陸頁，這個也有人遇到，但是根據他的解釋應該是登陸成功，但是獲取url的方法沒有呼叫到。。這個我沒做，不過我自己放棄了，直接使用formRequest提交資料，並且FormRequest.from_response貌似是get方法，改成“method=post”,返回403。不知道是不是method不能改還是其他原因。

formRequest可以設定method為post。但是在after_login中發現after_login中列印response.body，還是登陸頁

問題3：最初針對問題2，我的解決思路是，在after_login裡，重新使用登陸後的cookie重新訪問zhihu.com，在make_request_from_url裡，結果返回了

no more duplicates will be shown(see dupefilter_debug to show all duplicates)

問題4：在post_login裡使用formRequest後，在after_login中列印response.body，返回{r'0',msg:''}呼叫構建個人主頁的request的話，是可以獲取到的，response但是設定start_urls為people/****後，yield make_request_from_url(start_urls)會出現302重定向問題，同時parse_page裡解析依舊是首頁

問題5：拿一個不用登陸的url測試“https://www.zhihu.com/question/21872451“ ：

在after_login裡：

return [Request("https://www.zhihu.com/question/21872451",meta={'cookiejar':response.meta['cookiejar']},headers = self.headers_zhihu,callback=self.parse_page)]

發現可以解析當前頁，並且但是rule規則不生效，並且”https://www.zhihu.com/question/21872451“後臺parse_page呼叫了兩次，但是當前頁面的登陸狀態時可以獲取到的
”

問題6：針對5，反過來測試下，修改start_url=”https://www.zhihu.com/question/21872451“,依然呼叫yield make_request_from_url(start_urls)，發現登陸狀態又獲取不到了；繼續改回

return [Request("https://www.zhihu.com/question/21872451",meta={'cookiejar':response.meta['cookiejar']},headers = self.headers_zhihu,

<span style="white-space:pre">	</span>#callback=self.parse_page

)]

但是把callback註釋掉，發現/question/21872451 解析不到，同時，rule生效，呼叫parse_page，登陸狀態沒有。

總結：

1 make_requests_from_urls:如果不設定回撥函式，會呼叫預設的parse，同時呼叫原生態的make_request_from_urls不會攜帶哦cookie,所以要複寫，同時，如果callback也設定和rule一致的話，會出現首頁解析正確，rule不生效(有人說是因為衝突，因此我試了下更改make_requests_from_urls，callback=‘parse_item’，發現rule還是不生效，parse_page沒有執行，所以問題應該不在那裡)，反過來，如果callback不設定的話，rule生效，但是首頁解析不到，且無登陸狀態。

2 crawlspider的rule是不能自動攜帶cookie構建request，同時不能複寫parse()，這個是官方文件的說明，如果重寫parse會執行失敗。

結論：放棄了crawlspider,選擇複寫parse(),在parse中構建自己的request

最終形成的登陸

zhihu.py

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from MyTest.items import *
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
import urlparse
from scrapy import log

class ZhihuSpider(BaseSpider):
    name = "zhihu"
    #allowed_domains = ["zhihu.com"]
    start_urls = (
        'https://www.zhihu.com/',
    )



    headers_zhihu = {
           'Host':'www.zhihu.com ',
           'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0',
           'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
           'Accept-Encoding':'gzip,deflate,sdch',
           'Referer':'https://www.zhihu.com ',
           'If-None-Match':"FpeHbcRb4rpt_GuDL6-34nrLgGKd.gz",
           'Cache-Control':'max-age=0',
           'Connection':'keep-alive'
          # 'cookie':cookie


    }


    def start_requests(self):
        return [Request("https://www.zhihu.com/login/phone_num",meta={'cookiejar':1},headers = self.headers_zhihu,callback=self.post_login)]

    def post_login(self,response):
        print 'post_login'
        xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]  #不見【0】輸出錯誤

        print 'xsrf'+xsrf

        return [FormRequest('https://www.zhihu.com/login/phone_num',
                method='POST',
                meta = {
                    'cookiejar': response.meta['cookiejar'],
                    '_xsrf':xsrf

                },

                headers = self.headers_zhihu,

                formdata = {
                    'phone_num':'******',  #這裡的引數值不能去掉''
                    'password':'*****',
                     '_xsrf':xsrf


                },

                callback = self.after_login,
                #dont_filter = True

        )]

    def after_login(self,response):
        print 'after_login'
        print response.body    # 返回msg
        for url in self.start_urls:
            print 'url...................'+url
            yield self.make_requests_from_url(url,response)


    def make_requests_from_url(self, url,response):
        return Request(url,dont_filter=True, meta = {
                 'cookiejar':response.meta['cookiejar'],
                  'dont_redirect': True,
                  'handle_httpstatus_list': [301,302]
            },
                 #      callback=self.parse
                       )


    def parse(self, response):
        items = []

        problem = Selector(response)

        item = ZhihuItem()
        name = problem.xpath('//span[@class="name"]/text()').extract()
        print name
        item['name'] = name
        urls = problem.xpath('//a[@class="question_link"]/@href').extract()

        print urls
        item['urls'] = urls
        print 'response ............url'+response.url
        item['url'] = response.url
        print item['url']


        items.append(item)
        yield item                                                     #返回item
        for url in urls:
            print url

            yield scrapy.Request(urlparse.urljoin('https://www.zhihu.com', url),dont_filter=True,   #直接使用url會報錯
                 meta = {
                 'cookiejar':response.meta['cookiejar'],               #設定cookiejar
                  'dont_redirect': True,                               #防止重定向
                  'handle_httpstatus_list': [301,302]
            },
                       callback=self.parse
                       )


        #return  item

setting.py

COOKIES_ENABLED = True
COOKIES_DEBUG = True

其他的處理和之前爬取qiubai差不多，就不多解釋了

遺留問題：為什麼make_request_from_url設定回撥後，rule不生效

start_urls如果設定符合rule規則，為什麼也沒做解析

【scrapy】模擬登陸知乎

【scrapy】模擬登陸知乎

Scrapy 模擬登陸知乎--抓取熱點話題

python爬蟲模擬登陸知乎網

最新，最新！selenium模擬登陸知乎

使用OKHttp模擬登陸知乎，兼談OKHttp中Cookie的使用！

Python爬蟲之模擬登陸知乎

模擬登陸知乎，2016/10/23可用

利用python requests庫模擬登陸知乎

用selenium模擬登陸知乎賬號，處理登陸介面隨機出現驗證碼視窗的問題

Scrapy基礎(十四)————知乎模擬登陸

scrapy 登陸知乎

【ES6】模擬字符串拼接

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

MT【67】窺一斑知全豹

selenium 模擬登入知乎和微博

【scrapy】流程大致分析

【scrapy】scrapy-redis 全國建築市場基本信息采集

Scrapy 爬蟲模擬登陸的3種策略

adb.【轉】模擬點擊、滑動、輸入、按鍵

【Android】刪除已知路徑的檔案或資料夾

【scrapy】模擬登陸知乎

相關推薦