1. 程式人生 > >Python爬蟲:Scrapy中介軟體middleware和Pipeline

Python爬蟲:Scrapy中介軟體middleware和Pipeline

在這裡插入圖片描述

Scrapy提供了可自定義2種中介軟體,1個數據處理器

名稱 作用 使用者設定
資料收集器(Item-Pipeline) 處理item 覆蓋
下載中介軟體(Downloader-Middleware) 處理request/response 合併
爬蟲中介軟體(Spider-Middleware) 處理item/response/request 合併

解釋: 使用者設定:是指custom_settings

可是它們繼承的父類竟然是Object…,每次都要查文件。

正常來說應該提供一個抽象函式作為介面,給使用者實現自己的具體功能,不知道為啥這麼設計

通過幾段程式碼及註釋,簡要說明三個中介軟體的功能

1、Spider

baidu_spider.py


from scrapy import Spider, cmdline

class BaiduSpider(Spider):
    name = "baidu_spider"

    start_urls = [
        "https://www.baidu.com/"
    ]

    custom_settings = {
        "SPIDER_DATA": "this is spider data",
        "DOWNLOADER_MIDDLEWARES": {
                "scrapys.mymiddleware.MyMiddleware"
: 100, }, "ITEM_PIPELINES": { "scrapys.mypipeline.MyPipeline": 100, }, "SPIDER_MIDDLEWARES":{ "scrapys.myspidermiddleware.MySpiderMiddleware": 100, } } def parse(self, response): pass if __name__ == '__main__': cmdline.
execute("scrapy crawl baidu_spider".split())

2、Pipeline

mypipeline.py


class MyPipeline(object):
    def __init__(self, spider_data):
        self.spider_data = spider_data

    @classmethod
    def from_crawler(cls, crawler):
        """
        獲取spider的settings引數,返回Pipeline例項物件
        """
        spider_data = crawler.settings.get("SPIDER_DATA")
        print("### pipeline get spider_data: {}".format(spider_data))

        return cls(spider_data)

    def process_item(self, item, spider):
        """
        return Item 繼續處理
        raise DropItem 丟棄
        """
        print("### call process_item")

        return item

    def open_spider(self, spider):
        """
        spider開啟時呼叫
        """
        print("### spdier open {}".format(spider.name))


    def close_spider(self, spider):
        """
        spider關閉時呼叫
        """
        print("### spdier close {}".format(spider.name))


3、Downloader-Middleware

mymiddleware.py


class MyMiddleware(object):
    def __init__(self, spider_data):
        self.spider_data = spider_data

    @classmethod
    def from_crawler(cls, crawler):
        """
        獲取spider的settings引數,返回中介軟體例項物件
        """
        spider_data = crawler.settings.get("SPIDER_DATA")
        print("### middleware get spider_data: {}".format(spider_data))

        return cls(spider_data)

    def process_request(self, request, spider):
        """
        return
            None: 繼續處理Request
            Response: 返回Response
            Request: 重新排程
        raise IgnoreRequest:  process_exception -> Request.errback
        """
        print("### call process_request")

    def process_response(self, request, response, spider):
        """
        return
            Response: 繼續處理Response
            Request: 重新排程
        raise IgnoreRequest: Request.errback
        """
        print("### call process_response")
        return response

    def process_exception(self, request, exception, spider):
        """
        return
            None: 繼續處理異常
            Response: 返回Response
            Request: 重新呼叫
        """
        pass

4、Spider-Middleware

myspidermiddleware.py


class MySpiderMiddleware(object):
    def __init__(self, spider_data):
        self.spider_data = spider_data

    @classmethod
    def from_crawler(cls, crawler):
        """
        獲取spider的settings引數,返回中介軟體例項物件
        """
        spider_data = crawler.settings.get("SPIDER_DATA")
        print("### spider middleware get spider_data: {}".format(spider_data))

        return cls(spider_data)

    def process_spider_input(self, response, spider):
        """
        response通過時呼叫
        return None  繼續處理response
        raise Exception
        """

        print("### call process_spider_input")

    def process_spider_output(self, response, result, spider):
        """
        response返回result時呼叫
        return
            iterable of Request、dict or Item
        """
        print("### call process_spider_output")

        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        """
        return
            None
            iterable of Response, dict, or Item
        """
        pass

執行爬蟲後,檢視日誌

### middleware get spider_data: this is spider data
### spider middleware get spider_data: this is spider data
### pipeline get spider_data: this is spider data

### spdier open baidu_spider
### call process_request
### call process_response
### call process_spider_input
### call process_spider_output
### spdier close baidu_spider

根據日誌輸出資訊,看到大致流程是和Scrapy資料流向圖保持一致的

download middlewarespider middlewarepipelinespdier openprocess_requestprocess_responseprocess_spider_inputprocess_spider_outputspdier close