1. 程式人生 > >scrapy——抓取知乎

scrapy——抓取知乎

分享一下我老師大神的人工智慧教程!零基礎,通俗易懂!http://blog.csdn.net/jiangjunshow

也歡迎大家轉載本篇文章。分享知識,造福人民,實現我們中華民族偉大復興!

               

 

 

主要目標是:

·       從如何評價X的話題下開始抓取問題,然後開始爬相關問題再迴圈

·      

對於每個問題抓取標題,關注人數,回答數等資料

1    建立專案

$ scrapy startproject zhihu

New Scrapy project 'zhihu', using template directory'/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-1.3.2-py2.7.egg/scrapy/templates/project',created in:

    /Users/huilinwang/zhihu

 

You can start your first spider with:

    cd zhihu

    scrapy genspider exampleexample.com

2    編輯SPIDER

在/zhihu/zhihu/spiders目錄中,建立zhihuspider.py檔案,具體內容檢視文件後面。

2.1    函式def start_requests(self):

     This method must return aniterable with the first Requests to crawl for this spider.只調用一次。

    當沒有指定URLs(定義在start_urls變數中)時候呼叫,如果指定了URLs則呼叫make_requests_from_url() 函式。

    如果要通過登入來POST請求,可以如下程式碼:

class MySpider(scrapy.Spider):

    name = 'myspider'

 

    def start_requests(self):

        return[scrapy.FormRequest("http://www.example.com/login",

                                  formdata={'user': 'john', 'pass': 'secret'},

                                  callback=self.logged_in)]

 

    def logged_in(self, response):

        # here you would extractlinks to follow and return Requests for

        # each of them, with anothercallback

        pass

例項中start_requests內容如下:

 def start_requests(self):

        yield scrapy.Request(

            url=self.zhihu_url,

           headers=self.headers_dict,

           callback=self.request_captcha

        )

其中zhihu_url,headers_dict分別在共享變數中定義。

request_captcha為回撥函式。然後呼叫回撥函式。

2.1.1 scrapy.Request類

位於scrapy/http/request/_init_.py檔案中,初始化函式如下:

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,

                 cookies=None, meta=None, encoding='utf-8', priority=0,

                 dont_filter=False, errback=None):

該模組實現Requst類,用於表示HTTP請求。官方文件(docs/topics/request-response.rst)

    scrapy使用request和response物件來爬網站。request物件在spider中產生,在系統中傳遞直到達到downloader.下載器執行request並返回response物件。

    request和response類都有子類,增加了在基類中沒有要求的函式。

    request物件如下:

class scrapy.http.Request(url[, callback, method='GET', headers, body,cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

    一個request物件表示一個HTTP請求,通過在SPIDER中產生,在下載器中執行,然後產生一個Response.

2.1.2          request的引數說明:

    url (string) – 這個請求的URL,這個屬性是隻讀的,改變URL使用replace().

    callback (callable) – 這個request(一旦請求下載後)會呼叫的函式。如果沒有指定callback,預設呼叫parse()函式。注意,如果處理過程發生意外,errback會被呼叫。

    method (string) – 這個request的HTTP方法,預設是’GET’。

    meta (dict) – Request的初始化值request.meta.如果有這個值,dict會被複制。

    body (str or unicode) – request的主體。如果傳遞是一個unicode,會編碼到utf-8。如果主體沒有給定,空字串會被儲存。不管這個引數的型別,最後值會被儲存為str.

    headers (dict) – 這個request的頭。可以是單獨值或者多值頭。如果沒有傳遞這個值,HTTP頭就不會被髮送。

    cookies (dict or list) –request的cookies,可以用兩種方式傳送。

    Using a dict:

    request_with_cookies =Request(url="http://www.example.com",

                               cookies={'currency':'USD', 'country': 'UY'})

    Using a list of dicts:

    request_with_cookies =Request(url="http://www.example.com",

                              cookies=[{'name': 'currency',

                                       'value': 'USD',

                                       'domain': 'example.com',

                                       'path': '/currency'}])

    在後續requests的時候,儲存的cookies才有用。  

    有些站點返回cookies(作為響應),後續有requests時候會被再次傳送。這是通常的WEB瀏覽器行為。如果,出於某些原因考慮,要避免合併存在的cookies,可以設定dont_merge_cookies為True在request.meta.

    例如不合並cookies的request:

    request_with_cookies =Request(url="http://www.example.com",

                               cookies={'currency': 'USD','country': 'UY'},

                              meta={'dont_merge_cookies': True})

    encoding (string) – 這個request的編碼(預設是為’utf-8’). 用於將字串轉換成給定的編碼。

    priority (int) – 這個request的優先順序(預設是0)。優先順序用於被排程器排程順序。高的優先順序會被更早呼叫。負值表示相對的低優先順序。

    dont_filter (boolean) – 表示這個request不會被排程器過濾。在一個request 需要被多次排程時候有效,需要忽略過濾器。使用這個需要小心,不然會迴圈爬行,預設是False。  errback (callable) – 當有異常時候會被呼叫。包括404 HTTP錯誤等。

2.2    函式def request_captcha(self, response):

         是start_request的回撥函式,當下載器處理完request後返回response,該函式於處理response.

    def request_captcha(self, response):

       _xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]

       captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() *1000)

       yield scrapy.Request(

           url=captcha_url,

           headers=self.headers_dict,

           callback=self.download_captcha

       )

其中time.time() * 1000為獲取時間。返回當前時間的時間戳(1970紀元後經過的浮點秒數)

程式碼中

response.css('input[name="_xsrf"]::attr(value)').extract()[0]

獲取_xsrf對應的值,為獲取HTML原始碼中:

<input type="hidden"name="_xsrf" value="fb57ee37dc9bd70821e6ed878bdfe24f"/>

該函式最後呼叫函式download_captcha。

2.3    函式 def download_captcha(self, response):

函式如下:

    def download_captcha(self, response):

       with open("captcha.gif","wb") as fp:

           fp.write(response.body)

       os.system('opencaptcha.gif')

       print "請輸入驗證碼:\n"

       captcha = raw_input()

       yield scrapy.FormRequest(

           url=self.login_url,

           headers=self.headers_dict,

           formdata={

                "email": email,

                "password": password,

                "_xsrf": response.meta["_xsrf"],

               "remember_me": "true",

                "captcha": captcha

           },

           callback=self.request_zhihu

       )

該函式是request_captcha函式的回撥函式。

該函式主要是增加了scrapy.FormRequest函式

2.3.1 FormRequest

    FormRequest類擴充套件了Request基類。使用lxml.html forms來預處理來自Response物件的資料。

    類增加一個新的引數到構造器。其他引數和Request類一致。

classmethodfrom_response(response[formname=Noneformid=Noneformnumber=0formdata=Noneformxpath=Noneformcss=Noneclickdata=Nonedont_click=False...])

    使用預填充的元素來,返回一個formrequest物件。

其中引數如下:

·      formdata (dict) – fields to override in the formdata. If a field was already present in the response <form> element, its value is overridden by the one passedin this parameter.

    表示form中覆蓋的資料。主要用於模擬HTML POST格式,傳送一對健值。

然後呼叫request_zhihu函式。

2.4    函式def request_zhihu(self, response):

程式碼如下:

def request_zhihu(self, response):

       yield scrapy.Request(url=self.topic+'/19760570',

                             headers=self.headers_dict,

                             callback=self.get_topic_question,

dont_filter=True)

https://www.zhihu.com/topic/19760570/hot開始

因為需要迴圈呼叫所以設定了dont_filter=True

呼叫get_topic_question

2.5    函式def get_topic_question(self, response):

程式碼如下:

    def get_topic_question(self,response):

       # withopen("topic.html", "wb") asfp:

       #     fp.write(response.body)

       question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()

       length = len(question_urls)

       k = -1

       j = 0

       temp = []

       for j in range(length/3):

           temp.append(question_urls[k+3])

           j+=1

           k+=3

       for url in temp:

           yield scrapy.Request(url =self.zhihu_url+url,

                    headers = self.headers_dict,

                    callback = self.parse_question_data)

找到相對連結,即href屬性,這些就是知乎的TOPIC的相對連結。賦值給question_urls變數。然後間隔提取其中的TOPIC URL片段,最後繼續呼叫Request,不過URL發生了變化,變成了一個拼接的URL。然後呼叫parse_question_data

2.6    函式def parse_question_data(self, response):

是蜘蛛的最後一個函式,如下:

    def parse_question_data(self,response):

       item = ZhihuItem()

       item["qid"] = re.search('\d+',response.url).group()

       item["title"] = response.css(".zm-item-title::text").extract()[0].strip()

       item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]

       question_nums = response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

       item["followers_num"] = question_nums[0]

       item["visitsCount"] = question_nums[1]

       item["topic_views"] = question_nums[2]

       topic_tags = response.css(".zm-item-tag::text").extract()

       if len(topic_tags) >= 3:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

print item

       elif len(topic_tags) == 2:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] ='-'

       elif len(topic_tags) == 1:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] ='-'

           item["topic_tag2"] ='-'

       # printtype(item["title"])

       question_links = response.css(".question_link::attr(href)").extract()

       yield item

       for url in question_links:

           yield scrapy.Request(url =self.zhihu_url+url,

                    headers = self.headers_dict,

callback = self.parse_question_data)

迴圈抓取,直到結束

3    編輯pipelines.py

匯入到資料庫中。

3.1    函式def open_spider(self, spider):

spider開啟時個方法被調

 

 

 

 

3.2    函式def process_item(self, item, spider):

   每個item pipeline元件都需要呼叫該方法,這個方法必須返回一個 Item (或任何繼承類)物件,或是丟擲 DropItem 常,被棄的item將不會被之後的pipeline件所理。

3.3    編輯SETTING

修改ROBOTSTXT_OBEY = False

 

4    items內容

import scrapy

class ZhihuItem(scrapy.Item):

    # define the fields for youritem here like:

    # name = scrapy.Field()

    qid = scrapy.Field()

    title = scrapy.Field()

    followers_num = scrapy.Field()

    answers_num = scrapy.Field()

    visitsCount = scrapy.Field()

    topic_views = scrapy.Field()

    topic_tag0 = scrapy.Field()

    topic_tag1 = scrapy.Field()

    topic_tag2 = scrapy.Field()

 

5    SPDIER內容

#coding=utf-8

import scrapy

import os

import time

import re

import json

 

from ..items import ZhihuItem

 

class zhihutopicSpider(scrapy.Spider):

    zhihu_url ="https://www.zhihu.com"

    headers_dict = {

        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

        "Accept-Language":"zh-CN,zh;q=0.8",

        "Connection": "keep-alive",

        "Host":"www.zhihu.com",

       "Upgrade-Insecure-Requests": "1",

        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"

    }

    def start_requests(self):

        yield scrapy.Request(

            url=self.zhihu_url,

           headers=self.headers_dict,

           callback=self.request_captcha

        )

    def request_captcha(self,response):

        _xsrf =response.css('input[name="_xsrf"]::attr(value)').extract()[0]

        captcha_url ="http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)

        yield scrapy.Request(

            url=captcha_url,

           headers=self.headers_dict,

           callback=self.download_captcha

        )

 

    def download_captcha(self,response):

        withopen("captcha.gif", "wb") as fp:

            fp.write(response.body)

        os.system('opencaptcha.gif')

        print "請輸入驗證碼:\n"

        captcha = raw_input()

        yield scrapy.FormRequest(

            url=self.login_url,

           headers=self.headers_dict,

            formdata={

                "email":email,

               "password": password,

                "_xsrf":response.meta[“_xsrf"],

               "remember_me": "true",

                "captcha":captcha

            },

           callback=self.request_zhihu

        )

 

    def request_zhihu(self,response):

        yieldscrapy.Request(url=self.topic + '/19760570',

                            headers=self.headers_dict,

                            callback=self.get_topic_question,

                            dont_filter=True)

 

    def get_topic_question(self,response):

        # withopen("topic.html", "wb") as fp:

        #     fp.write(response.body)

        question_urls =response.css(".question_link[target=_blank]::attr(href)").extract()

        length = len(question_urls)

        k = -1

        j = 0

        temp = []

        for j in range(length/3):

           temp.append(question_urls[k+3])

            j+=1

            k+=3

        for url in temp:

            yield scrapy.Request(url= self.zhihu_url+url,

                    headers =self.headers_dict,

                    callback =self.parse_question_data)

 

    def parse_question_data(self,response):

        item = zhihuQuestionItem()

        item["qid"] =re.search('\d+',response.url).group()

        item["title"] =response.css(".zm-item-title::text").extract()[0].strip()

       item["answers_num"] =response.css("h3::attr(data-num)").extract()[0]

        question_nums =response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

       item["followers_num"] = question_nums[0]

       item["visitsCount"] = question_nums[1]

       item["topic_views"] = question_nums[2]

        topic_tags =response.css(".zm-item-tag::text").extract()

        if len(topic_tags) >= 3:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] = topic_tags[2].strip()

        elif len(topic_tags) == 2:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] = '-'

        elif len(topic_tags) == 1:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = '-'

           item["topic_tag2"] = '-'

        # printtype(item["title"])

        question_links =response.css(".question_link::attr(href)").extract()

        yield item

        for url in question_links:

            yield scrapy.Request(url =self.zhihu_url+url,

                    headers =self.headers_dict,

                    callback =self.parse_question_data)

6    PIPELINE內容

import MySQLdb

class ZhihuPipeline(object):

    print "\n\n\n\n\n\n\n\n"

   sql_questions = (

           "INSERTINTO questions("

           "qid,title, answers_num, followers_num, visitsCount, topic_views, topic_tag0,topic_tag1, topic_tag2) "

           "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')")

   count = 0

 

    def open_spider(self, spider):

       host = "localhost"

       user = "root"

       password = "wangqi"

       dbname = "zh"

       self.conn = MySQLdb.connect(host, user, password, dbname)

       self.cursor = self.conn.cursor()

       self.conn.set_character_set('utf8')

       self.cursor.execute('SET NAMES utf8;')

       self.cursor.execute('SET CHARACTER SET utf8;')

       self.cursor.execute('SET character_set_connection=utf8;')

       print "\n\nMYSQL DB CURSOR INITSUCCESS!!\n\n"

       sql = (

           "CREATETABLE IF NOT EXISTS questions ("

                "qid VARCHAR (100) NOTNULL,"

                "title varchar(100),"

                "answers_num INT(11),"

                "followers_num INT(11) NOT NULL,"

                "visitsCount INT(11),"

                "topic_views INT(11),"

                "topic_tag0 VARCHAR (600),"

                "topic_tag1 VARCHAR (600),"

                "topic_tag2 VARCHAR (600),"

               "PRIMARY KEY (qid)"

           ")")

       self.cursor.execute(sql)

       print "\n\nTABLES ARE READY!\n\n"

 

    def process_item(self, item,spider):

       sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],

                                item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])

       self.cursor.execute(sql)

       if self.count % 10 == 0:

           self.conn.commit()

       self.count += 1

       print item["qid"] +" DATA COLLECTED!"

 

7    執行

scrapy crawl zhihu

 

8    關於反爬

    robots.txt(統一小寫)是一種存放於網站根目錄下的ASCII編碼的文字檔案,它通常告訴網路蜘蛛,此網站中的哪些內容是不應被搜尋引擎的網路蜘蛛獲取的,哪些是可以被網路蜘蛛獲取的。robots.txt是一個這個紳士協議也不是一個規範,而只是約定俗成的,有些搜尋引擎會遵守這一規範,而其他則不然。這就說明了scrapy自動遵守了robots協議.(這個要在settings.py裡面設定不遵守才可以爬得到把scrapy寫進robots.txt的網站)

 

 

 

 

           

給我老師的人工智慧教程打call!http://blog.csdn.net/jiangjunshow

這裡寫圖片描述