scrapy框架中Crawlspider模組原始碼剖析

阿新 • • 發佈：2018-11-09

一、前言

1、scrapy從Terminal中通過genspider命令建立一個蜘蛛，其中包含四個模組，分別為spider，crawlspider，csvfeedspider和xmlfeedspider，其中spider(basic模組)和crawlspider最為常用。
2、做過web後臺開發的都知道，很多網站中定義url都是有一定規則的（如django路由系統中定義的urls規則就是正則表示式), 那麼我們就可以根據這個特性來設計爬蟲，而不是每次都要用spider分析頁面格式，此時我們就可以用crawlspider實現這樣的需求。

二、Crawlspider簡介

CrawlSpider基於Spider自己獨有的特性：

Rules: 這是一個（或多個）Rule物件的列表。每個Rule 定義用於爬網站點的特定行為。規則物件如下所述。如果多個規則匹配相同的連結，則將根據它們在此屬性中定義的順序使用第一個規則。
parse_start_url: 為start_urls響應呼叫此方法。它允許解析初始響應，並且必須返回 Item物件，Request 物件或包含其中任何一個的iterable。

Rule引數分析

class scrapy.spiders.Rule（link_extractor，callback = None，
cb_kwargs = None，follow = None，process_links = None，process_request = None ）

link_extractor：是一個Link Extractor物件，它定義如何從每個已爬網頁面中提取連結。

callback：是一個可呼叫的或一個字串（在這種情況下，將使用具有該名稱的spider物件的方法）為使用指定的link_extractor提取的每個連結呼叫。此回撥接收響應作為其第一個引數，並且必須返回包含Item和/或 Request物件（或其任何子類）的列表。

警告：編寫爬網蜘蛛規則時，請避免使用parse回撥，因為CrawlSpider使用parse方法本身來實現其邏輯。因此，如果您覆蓋該parse方法，則爬網蜘蛛將不再起作用。

cb_kwargs

：是一個包含要傳遞給回撥函式的關鍵字引數的dict。

follow：是一個布林值，它指定是否應該從使用此規則提取的每個響應中跟蹤連結。如果寫了callback，則follow預設為True，否則預設為False。

process_links：是一個可呼叫的，或一個字串（在這種情況下，將使用來自具有該名稱的蜘蛛物件的方法），將使用指定的每個響應提取的每個連結列表呼叫該方法link_extractor。這主要用於過濾目的。

process_request ：是一個可呼叫的，或一個字串（在這種情況下，將使用來自具有該名稱的spider物件的方法），該方法將在此規則提取的每個請求中呼叫，並且必須返回請求或None（以過濾掉請求）。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

這個蜘蛛首先抓取example.com的主頁，收集category連結和item連結，使用該parse_item方法解析後者。對於每個item的響應，將使用XPath從HTML中提取一些資料，並將資料返回給pipeline。

三、原始碼分析

class CrawlSpider(Spider):
    rules = ()
    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()
 
    #首先呼叫parse()來處理start_urls中返回的response物件
    #parse()則將這些response物件傳遞給了_parse_response()函式處理，並設定回撥函式為parse_start_url()
    #設定了跟進標誌位True
    #parse將返回item和跟進了的Request物件    
    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
    
    #處理start_url中返回的response，需要重寫
    def parse_start_url(self, response):
        return []
 
    def process_results(self, response, results):
        return results
    
    #從response中抽取符合任一使用者定義'規則'的連結，並構造成Resquest物件返回
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        #抽取之內的所有連結，只要通過任意一個'規則'，即表示合法
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            #使用使用者指定的process_links處理每個連線
            if links and rule.process_links:
                links = rule.process_links(links)
            #將連結加入seen集合，為每個連結生成Request物件，並設定回撥函式為_repsonse_downloaded()
            for link in links:
                seen.add(link)
                #構造Request物件，並將Rule規則中定義的回撥函式作為這個Request物件的回撥函式
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                #對每個Request呼叫process_request()函式。該函式預設為indentify，即不做任何處理，直接返回該Request.
                yield rule.process_request(r)
    #處理通過rule提取出的連線，並返回item以及request
    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
    
    #解析response物件，會用callback解析處理他，並返回request或Item物件
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        #首先判斷是否設定了回撥函式。（該回調函式可能是rule中的解析函式，也可能是 parse_start_url函式）
        #如果設定了回撥函式（parse_start_url()），那麼首先用parse_start_url()處理response物件，
        #然後再交給process_results處理。返回cb_res的一個列表
        if callback:
            #如果是parse呼叫的，則會解析成Request物件
            #如果是rule callback，則會解析成Item
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
        
        #如果需要跟進，那麼使用定義的Rule規則提取並返回這些Request物件
        if follow and self._follow_links:
            #返回每個Request物件
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item
 
    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, basestring):
                return getattr(self, method, None)
 
        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)
 
    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

init：主要執行了_compile_rules方法

parse：預設回撥方法，不過在這裡進行了重寫，這裡直接呼叫方法_parse_response，並把parse_start_url方法作為處理response的方法。

parse_start_url：它的主要作用就是處理parse返回的response，比如提取出需要的資料等，該方法也需要返回item、request或者他們的可迭代物件。它就是一個回撥方法，和rule.callback用法一樣。

_requests_to_follow：閱讀原始碼可以發現，它的作用就是從response中解析出目標url，並將其包裝成request請求。該請求的回撥方法是_response_downloaded，這裡為request的meta值添加了rule引數，該引數的值是這個url對應rule在rules中的下標。

_response_downloaded：該方法是方法_requests_to_follow的回撥方法，作用就是呼叫_parse_response方法，處理下載器返回的response，設定response的處理方法為rule.callback方法。

_parse_response：該方法將resposne交給引數callback代表的方法去處理，然後處理callback方法的requests_or_item。再根據rule.follow and spider._follow_links來判斷是否繼續採集，如果繼續那麼就將response交給_requests_to_follow方法，根據規則提取相關的連結。spider._follow_links的值是從settings的CRAWLSPIDER_FOLLOW_LINKS值獲取到的。

_compile_rules：這個方法的作用就是將rule中的字串表示的方法改成實際的方法。

下面附帶一張原始碼的執行流程圖，以便於大家理解：
在這裡插入圖片描述

scrapy框架中Crawlspider模組原始碼剖析

一、前言

二、Crawlspider簡介

Rule引數分析

三、原始碼分析

scrapy框架中Crawlspider模組原始碼剖析

scrapy框架中crawlspider的使用

scrapy框架中Spider原始碼解析

18、python網路爬蟲之Scrapy框架中的CrawlSpider詳解

將selenium集成到scrapy框架中

Scrapy框架中的Pipeline組件

Python爬蟲從入門到放棄之 Scrapy框架中Download Middleware用法

Python爬蟲從入門到成妖之7-----Scrapy框架中Download Middleware用法

scrapy框架之CrawlSpider

scrapy框架中在middleware中進行配置user-agent，將user-agent進行隨機

Scrapy框架中setting 中的欄位含義

scrapy框架中cookie的設定路徑

Scrapy框架中的 UA偽裝

細談Scrapy框架中運用selenium的經驗

Python envoy 模組原始碼剖析

Scrapy框架中解決OSError=[Errno 2] No such file or directory: 'Xvfb': 'Xvfb'

scrapy框架中多個spider同時執行：scrapyd的部署及使用

scrapy框架中選擇器的使用

Qt中事件分發原始碼剖析

scrapy框架中實現登入人人網（二）（最新登入方式）

scrapy框架中Crawlspider模組原始碼剖析

一、前言

二、Crawlspider簡介

Rule引數分析

三、原始碼分析

相關推薦