scrapy利用redis實現url去重與增量爬取
阿新 • • 發佈:2019-02-01
引言
之前資料採集時有2個需求就是url去重與資料的增量爬去(只能夠請求增加的url,不然會增加被爬網站的伺服器負荷),最開始的想法是指用redis的set實現url去重,但在後面開發中無意的解決了增量爬去的類容。下面貼上主要程式碼。
具體實現步驟
- 將每次爬去的連結存入redis(pipeline.py)
class InsertRedis(object): def __init__(self): self.Redis = RedisOpera('insert') def process_item(self,item,spider): self.Redis.write(item['url']) return item
注:redis的具體操作此處不表
- 將即將請求的url判斷是否已經爬取(middlewa.py)
class IngoreRequestMiddleware(object): def __init__(self): self.Redis = RedisOpera('query') def process_request(self,request,spider): if self.Redis.query(request.url): raise IgnoreRequest("IgnoreRequest : %s" % request.url) else: return None
- 實現增量爬取
def start_requests(self): yield FormRequest('https://www.demo.org/vuldb/vulnerabilities?page='+str(self.page_num), callback=self.parse_page)#page_num是分頁引數 def parse_page(self,response): urls = response.xpath('//tbody/tr').extract() for url in urls: request_url = Selector(text=url).xpath('//td[@class=\'vul-title-wrapper\']/a/@href').extract()[0] if re.search('/vuldb/ssvid-\d+',request_url): yield FormRequest('https://www.demo.org'+request_url.strip(),callback=self.parse_item,dont_filter=False) if len(urls) == 20: self.page_num += 1 def parse_item(self,response): item = WebcrawlItem() self.count += 1 item['url'] = response.url yield item yield FormRequest('https://www.demo.org/vuldb/vulnerabilities?page=' + str(self.page_num), callback=self.parse_page)
第三段函式parse_item()回撥parse_page(),如果redis資料庫中沒有一條url資料則會一直將整站的page抓取,但r如果是在某個時間點我們已經爬去完了資料,繼續啟動程式爬去增加的資料是會去判斷每個url是否已經爬去,當url有重複時parse_page不會回撥parse_item(url去重),當然也就不會在去執行yield FormRequest('https://www.demo.org/vuldb/vu...' + str(self.page_num), callback=self.parse_page)
,故程式會跳出迴圈結束。
在這不上Redis相關的操作
1 redisopera.py
# -*- coding: utf-8 -*-
import redis
import time
from scrapy import log
from newscrawl.util import RedisCollection
class RedisOpera:
def __init__(self,stat):
log.msg('init redis %s connection!!!!!!!!!!!!!!!!!!!!!!!!!' %stat,log.INFO)
self.r = redis.Redis(host='localhost',port=6379,db=0)
def write(self,values):
# print self.r.keys('*')
collectionname = RedisCollection(values).getCollectionName()
self.r.sadd(collectionname,values)
def query(self,values):
collectionname = RedisCollection(values).getCollectionName()
return self.r.sismember(collectionname,values)
2 util.py
# -*- coding: utf-8 -*-
import re
from scrapy import log
class RedisCollection(object):
def __init__(self,OneUrl):
self.collectionname = OneUrl
def getCollectionName(self):
# name = None
if self.IndexAllUrls() is not None:
name = self.IndexAllUrls()
else:
name = 'publicurls'
# log.msg("the collections name is %s"(name),log.INFO)
return name
def IndexAllUrls(self):
allurls = ['wooyun','freebuf']
result = None
for str in allurls:
if re.findall(str,self.collectionname):
result = str
break
return result