1. 程式人生 > >scrapy 去重策略修改

scrapy 去重策略修改

1、首先自定義一個‘duplication.py’檔案:

class RepeatFilter(object):

    def __init__(self):
        """
        2、物件初始化
        """
        self.visited_set = set()

    @classmethod
    def from_settings(cls, settings):
        """
        1、建立物件
        :param settings:
        :return:
        """
        print
('......') return cls() def request_seen(self, request): """ 4、檢查是否已經訪問過 :param request: :return: """ if request.url in self.visited_set: return True self.visited_set.add(request.url) return False def open(self): #
can return deferred """ 3、開始爬取 :return: """ print('open') pass def close(self, reason): # can return a deferred """ 5、停止爬取 :param reason: :return: """ print('close') pass def log(self, request, spider): #
log that a request has been filtered pass

2、修改settings檔案,新增

DUPEFILTER_CLASS = 'day96.duplication.RepeatFilter'