1. 程式人生 > >scrapy通過自定義類給爬取的url去重

scrapy通過自定義類給爬取的url去重

之前我們是通過在parse函式裡設定集合來解決url去重的問題。

首先先在根目錄中建立一個新的duplication的py檔案,在from scrapy.dupefilter import RFPDupeFilter,在RFPDupeFilter原始碼中把BaseDupeFilter類複製到新建的duolication中。

class RepeatFilter(object):
    def __init__(self):
        self.visited_set = set()
    @classmethod
    def from_settings(cls, settings):#用類方法建立RepeatFilter類物件返回的是RepeatFliter()
        return cls()

    def request_seen(self, request):#過濾url的方法
        if request.url in self.visited_set:
            return True
        else:
            self.visited_set.add(request.url)
            return False

    def open(self):#爬蟲開始
        print("---開始爬取---")
        
    def close(self, reason):  # 爬蟲結束
        print("---爬取結束---")

    def log(self, request, spider):  # 記錄日誌
        pass

在request_open方法中把過濾的url方法寫好

執行順序是

1、from_setting

2、__init__

3、open

4、log

5、close

最後別忘了要再settings.py檔案中新增一條DUPEFILTER_CLASS = "shan.duplication.RepeatFilter"

預設的是DUPEFILTER_CLASS = "shan.dupefilter.RFPDupeFilter"

(venv) D:\shan>scrapy crawl chouti --nolog
D:\shan\shan\spiders\chouti.py:9: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  from scrapy.dupefilter import RFPDupeFilter
---開始爬取---
https://dig.chouti.com/
https://dig.chouti.com/all/hot/recent/2
https://dig.chouti.com/all/hot/recent/3
https://dig.chouti.com/all/hot/recent/8
https://dig.chouti.com/all/hot/recent/5
https://dig.chouti.com/all/hot/recent/7
https://dig.chouti.com/all/hot/recent/6
https://dig.chouti.com/all/hot/recent/10
https://dig.chouti.com/all/hot/recent/9
https://dig.chouti.com/all/hot/recent/4
---爬取結束---