分散式爬蟲原理

首先我們來看一下scrapy的單機架構:

 
 

可以看到,scrapy單機模式,通過一個scrapy引擎通過一個排程器,將Requests佇列中的request請求發給下載器,進行頁面的爬取。

那麼多臺主機協作的關鍵是共享一個爬取佇列。

所以,單主機的爬蟲架構如下圖所示:

 
 

前文提到,分散式爬蟲的關鍵是共享一個requests佇列,維護該佇列的主機稱為master,而從機則負責資料的抓取,資料處理和資料儲存,所以分散式爬蟲架構如下圖所示:

 
 
 
 

MasterSpider 對 start_urls 中的 urls 構造 request,獲取 response
MasterSpider 將 response 解析,獲取目標頁面的 url, 利用 redis 對 url 去重並生成待爬 request 佇列
SlaveSpider 讀取 redis 中的待爬佇列,構造 request
SlaveSpider 發起請求,獲取目標頁面的 response
Slavespider 解析 response,獲取目標資料,寫入生產資料庫

增加併發

併發是指同時處理數量。其有全侷限制和區域性(每個網站)的限制。

Scrapy 預設的全域性併發限制對同時爬取大量網站的情況並不適用。 增加多少取決於爬蟲能佔用多少 CPU。 一般開始可以設定為 100 。
不過最好的方式是做一些測試,獲得 Scrapy 程序佔取 CPU 與併發數的關係。 為了優化效能,應該選擇一個能使CPU佔用率在80%-90%的併發數。

Redis 遠端連線

安裝完成後,redis預設是不能被遠端連線的,此時要修改配置檔案/etc/redis.conf

# bind 127.0.0.1

修改後,重啟redis伺服器


Windows的小夥伴兒 pip是安裝Scrapy可能會出現問題。推薦使用anaconda 、不然還是老老實實用Linux吧

conda install scrapy
或者
pip install scrapy

安裝Scrapy-Redis

conda install scrapy-redis
或者
pip install scrapy-redis

開始之前我們得知道scrapy-redis的一些配置:PS 這些配置是寫在Scrapy專案的settings.py中的!

  • settings.py

     # -*- coding: utf-8 -*-
    
     # Scrapy settings for companyNews project
     #
     # For simplicity, this file contains only settings considered important or
     # commonly used. You can find more settings consulting the documentation:
     #
     #     http://doc.scrapy.org/en/latest/topics/settings.html
     #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
     #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
     from DBSetting import host_redis,port_redis,db_redis,password_redis
    
     BOT_NAME = 'companyNews'
    
     SPIDER_MODULES = ['companyNews.spiders']
     NEWSPIDER_MODULE = 'companyNews.spiders'
    
     #-----------------------日誌檔案配置-----------------------------------
     #日誌檔名
     #LOG_FILE = "dg.log"
     #日誌檔案級別
     LOG_LEVEL = 'WARNING'
    
     # Obey robots.txt rules
     # robots.txt 是遵循 Robot協議 的一個檔案,它儲存在網站的伺服器中,它的作用是,告訴搜尋引擎爬蟲,
     # 本網站哪些目錄下的網頁 不希望 你進行爬取收錄。在Scrapy啟動後,會在第一時間訪問網站的 robots.txt 檔案,
     # 然後決定該網站的爬取範圍。
     # ROBOTSTXT_OBEY = True
    
     # ------------------------全域性併發數的一些配置:-------------------------------
     # Configure maximum concurrent requests performed by Scrapy (default: 16)
     # 預設 Request 併發數:16
     # CONCURRENT_REQUESTS = 32
     # 預設 Item 併發數:100
     # CONCURRENT_ITEMS = 100
     # The download delay setting will honor only one of:
     # 預設每個域名的併發數:16
     #CONCURRENT_REQUESTS_PER_DOMAIN = 16
     # 每個IP的最大併發數:0表示忽略
     # CONCURRENT_REQUESTS_PER_IP = 0
    
     # Configure a delay for requests for the same website (default: 0)
     # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
     # See also autothrottle settings and docs
     #DOWNLOAD_DELAY 會影響 CONCURRENT_REQUESTS,不能使併發顯現出來,設定下載延遲
     #DOWNLOAD_DELAY = 3
    
     # Disable cookies (enabled by default)
     #禁用cookies
     # COOKIES_ENABLED = True
     # COOKIES_DEBUG = True
    
     # Disable Telnet Console (enabled by default)
     #TELNETCONSOLE_ENABLED = False
    
     # Crawl responsibly by identifying yourself (and your website) on the user-agent
     #USER_AGENT = 'haoduofuli (+http://www.yourdomain.com)'
    
     # Override the default request headers:
     DEFAULT_REQUEST_HEADERS = {
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Language': 'en',
     }
    
     # Enable or disable spider middlewares
     # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
     SPIDER_MIDDLEWARES = {
         'companyNews.middlewares.UserAgentmiddleware': 401,
         'companyNews.middlewares.ProxyMiddleware':426,
     }
    
     # Enable or disable downloader middlewares
     # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
     DOWNLOADER_MIDDLEWARES = {
          'companyNews.middlewares.UserAgentmiddleware': 400,
         'companyNews.middlewares.ProxyMiddleware':425,
         # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':423,
          # 'companyNews.middlewares.CookieMiddleware': 700,
     }
    
     MYEXT_ENABLED=True      # 開啟擴充套件
     IDLE_NUMBER=10           # 配置空閒持續時間單位為 360個 ,一個時間單位為5s
     # Enable or disable extensions
     # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
     # 在 EXTENSIONS 配置,啟用擴充套件
     EXTENSIONS = {
         # 'scrapy.extensions.telnet.TelnetConsole': None,
         'companyNews.extensions.RedisSpiderSmartIdleClosedExensions': 500,
     }
    
     # Configure item pipelines
     # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
     # 注意:自定義pipeline的優先順序需高於Redispipeline,因為RedisPipeline不會返回item,
     # 所以如果RedisPipeline優先順序高於自定義pipeline,那麼自定義pipeline無法獲取到item
     ITEM_PIPELINES = {
          #將清除的專案在redis進行處理,# 將RedisPipeline註冊到pipeline元件中(這樣才能將資料存入Redis)
         # 'scrapy_redis.pipelines.RedisPipeline': 400,
         'companyNews.pipelines.companyNewsPipeline': 300,# 自定義pipeline視情況選擇性註冊(可選)
     }
    
     # Enable and configure the AutoThrottle extension (disabled by default)
     # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
     #AUTOTHROTTLE_ENABLED = True
     # The initial download delay
     #AUTOTHROTTLE_START_DELAY = 5
     # The maximum download delay to be set in case of high latencies
     #AUTOTHROTTLE_MAX_DELAY = 60
     # The average number of requests Scrapy should be sending in parallel to
     # each remote server
     #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
     # Enable showing throttling stats for every response received:
     #AUTOTHROTTLE_DEBUG = False
    
     # Enable and configure HTTP caching (disabled by default)
     # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
     # ----------------scrapy預設已經自帶了快取,配置如下-----------------
     # 開啟快取
     #HTTPCACHE_ENABLED = True
     # 設定快取過期時間(單位:秒)
     #HTTPCACHE_EXPIRATION_SECS = 0
     # 快取路徑(預設為:.scrapy/httpcache)
     #HTTPCACHE_DIR = 'httpcache'
     # 忽略的狀態碼
     #HTTPCACHE_IGNORE_HTTP_CODES = []
     # 快取模式(檔案快取)
     #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
     #-----------------Scrapy-Redis分散式爬蟲相關設定如下--------------------------
     # Enables scheduling storing requests queue in redis.
     #啟用Redis排程儲存請求佇列,使用Scrapy-Redis的排程器,不再使用scrapy的排程器
     SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    
     # Ensure all spiders share same duplicates filter through redis.
     #確保所有的爬蟲通過Redis去重,使用Scrapy-Redis的去重元件,不再使用scrapy的去重元件
     DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    
     # 預設請求序列化使用的是pickle 但是我們可以更改為其他類似的。PS:這玩意兒2.X的可以用。3.X的不能用
     # SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
    
     # 使用優先順序排程請求佇列 (預設使用),
     # 使用Scrapy-Redis的從請求集合中取出請求的方式,三種方式擇其一即可:
     # 分別按(1)請求的優先順序/(2)佇列FIFO/(先進先出)(3)棧FILO 取出請求(先進後出)
     # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
     # 可選用的其它佇列
     SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
     # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
    
     # Don't cleanup redis queues, allows to pause/resume crawls.
     #不清除Redis佇列、這樣可以暫停/恢復 爬取,
     # 允許暫停,redis請求記錄不會丟失(重啟爬蟲不會重頭爬取已爬過的頁面)
     #SCHEDULER_PERSIST = True
    
     #----------------------redis的地址配置-------------------------------------
     # Specify the full Redis URL for connecting (optional).
     # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
     # 指定用於連線redis的URL(可選)
     # 如果設定此項,則此項優先順序高於設定的REDIS_HOST 和 REDIS_PORT
     # REDIS_URL = 'redis://root:密碼@主機IP:埠'
     # REDIS_URL = 'redis://root:[email protected]:6379'
     REDIS_URL = 'redis://root:%s@%s:%s'%(password_redis,host_redis,port_redis)
     # 自定義的redis引數(連線超時之類的)
     REDIS_PARAMS={'db': db_redis}
     # Specify the host and port to use when connecting to Redis (optional).
     # 指定連線到redis時使用的埠和地址(可選)
     #REDIS_HOST = '127.0.0.1'
     #REDIS_PORT = 6379
     #REDIS_PASS = '19940225'
    
     # REDIRECT_ENABLED = False
     #
     # HTTPERROR_ALLOWED_CODES = [302, 301]
     #
     # DEPTH_LIMIT = 3
    
     #------------------------------------------------------------------------------------------------
     # 最大空閒時間防止分散式爬蟲因為等待而關閉
     # 這隻有當上面設定的佇列類是SpiderQueue或SpiderStack時才有效
     # 並且當您的蜘蛛首次啟動時,也可能會阻止同一時間啟動(由於佇列為空)
     # SCHEDULER_IDLE_BEFORE_CLOSE = 10
    
     # 序列化專案管道作為redis Key儲存
     # REDIS_ITEMS_KEY = '%(spider)s:items'
    
     # 預設使用ScrapyJSONEncoder進行專案序列化
     # You can use any importable path to a callable object.
     # REDIS_ITEMS_SERIALIZER = 'json.dumps'
    
     # 自定義redis客戶端類
     # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
    
     # 如果為True,則使用redis的'spop'進行操作。
     # 如果需要避免起始網址列表出現重複,這個選項非常有用。開啟此選項urls必須通過sadd新增,否則會出現型別錯誤。
     # REDIS_START_URLS_AS_SET = False
    
     # RedisSpider和RedisCrawlSpider預設 start_usls 鍵
     # REDIS_START_URLS_KEY = '%(name)s:start_urls'
    
     # 設定redis使用utf-8之外的編碼
     # REDIS_ENCODING = 'latin1'

Nice配置檔案寫到這兒。我們來做一些基本的反爬蟲設定

最基本的一個切換UserAgent!

首先在專案檔案中新建一個useragent.py用來寫一堆 User-Agent(可以去網上找更多,也可以用下面這些現成的)

  • useragent.py

    # -*- coding: utf-8 -*-
    agents = [
        "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
        "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
        "Mozilla/2.02E (Win95; U)",
        "Mozilla/3.01Gold (Win95; I)",
        "Mozilla/4.8 [en] (Windows NT 5.1; U)",
        "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
        "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
        "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    ]

現在我們來重寫一下Scrapy的下載中介軟體
在專案中新建一個middlewares.py的檔案(如果你使用的新版本的Scrapy,在新建的時候會有這麼一個檔案,直接用就好了)

首先匯入UserAgentMiddleware畢竟我們要重寫它啊!

import json ##處理json的包
import redis #Python操作redis的包
import random #隨機選擇
from .useragent import agents #匯入前面的
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #UserAegent中介軟體
from scrapy.downloadermiddlewares.retry import RetryMiddleware #重試中介軟體

開寫:

class UserAgentmiddleware(UserAgentMiddleware):

    def process_request(self, request, spider):
        agent = random.choice(agents)
        request.headers["User-Agent"] = agent

第一行:定義了一個類UserAgentmiddleware繼承自UserAgentMiddleware

第二行:定義了函式process_request(request, spider)為什麼定義這個函式,因為Scrapy每一個request通過中間 件都會呼叫這個方法。

 
 

第三行:隨機選擇一個User-Agent

第四行:設定request的User-Agent為我們隨機的User-Agent

_Y(o)Y一箇中間件寫完了!哈哈 是不是So easy!

需要登陸的,我們需要重寫Cookie中介軟體!分散式爬蟲啊!你不能手動的給每個Spider寫一個Cookie吧。而且你還不會知道這個Cookie到底有沒有失效。所以我們需要維護一個Cookie池(這個cookie池用redis)。

好!來理一理思路,維護一個Cookie池最基本需要具備些什麼功能呢?

獲取Cookie
更新Cookie
刪除Cookie
判斷Cookie是否可用進行相對應的操作(比如重試)
好,我們先做前三個對Cookie進行操作。

首先我們在專案中新建一個cookies.py的檔案用來寫我們需要對Cookie進行的操作。

首先日常匯入我們需要的檔案:

import requests
import json
import redis
import logging
from .settings import REDIS_URL ##獲取settings.py中的REDIS_URL

首先我們把登陸用的賬號密碼 以Key:value的形式存入redis資料庫。不推薦使用db0(這是Scrapy-redis預設使用的,賬號密碼單獨使用一個db進行儲存。)

解決第一個問題:獲取Cookie:

import requests
import json
import redis
import logging
from .settings import REDIS_URL

logger = logging.getLogger(__name__)
##使用REDIS_URL連結Redis資料庫, deconde_responses=True這個引數必須要,資料會變成byte形式 完全沒法用
reds = redis.Redis.from_url(REDIS_URL, db=2, decode_responses=True)
login_url = 'http://haoduofuli.pw/wp-login.php'

##獲取Cookie
def get_cookie(account, password):
    s = requests.Session()
    payload = {
        'log': account,
        'pwd': password,
        'rememberme': "forever",
        'wp-submit': "登入",
        'redirect_to': "http://http://www.haoduofuli.pw/wp-admin/",
        "
    }
    response = s.post(login_url, data=payload)
    cookies = response.cookies.get_dict()
    logger.warning("獲取Cookie成功!(賬號為:%s)" % account)
    return json.dumps(cookies)

這段很好懂吧。

使用requests模組提交表單登陸獲得Cookie,返回一個通過Json序列化後的Cookie(如果不序列化,存入Redis後會變成Plain Text格式的,後面取出來Cookie就沒法用啦。)

第二個問題:將Cookie寫入Redis資料庫(分散式呀,當然得要其它其它Spider也能使用這個Cookie了)

def init_cookie(red, spidername):
    redkeys = reds.keys()
    for user in redkeys:
        password = reds.get(user)
        if red.get("%s:Cookies:%s--%s" % (spidername, user, password)) is None:
            cookie = get_cookie(user, password)
            red.set("%s:Cookies:%s--%s"% (spidername, user, password), cookie)

使用我們上面建立的redis連結獲取redis db2中的所有Key(我們設定為賬號的哦!),再從redis中獲取所有的Value(我設成了密碼哦!)

判斷這個spider和賬號的Cookie是否存在,不存在 則呼叫get_cookie函式傳入從redis中獲取到的賬號密碼的cookie;

儲存進redis,Key為spider名字和賬號密碼,value為cookie。

重寫cookie中介軟體;估摸著吧!聰明的小夥兒看了上面重寫User-Agent的方法,十之八九也知道怎麼重寫Cookie中介軟體了。

好啦,現在繼續寫middlewares.py啦!

class CookieMiddleware(RetryMiddleware):

    def __init__(self, settings, crawler):
        RetryMiddleware.__init__(self, settings)
        self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses設定取出的編碼為str
        init_cookie(self.rconn, crawler.spider.name)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings, crawler)

    def process_request(self, request, spider):
        redisKeys = self.rconn.keys()
        while len(redisKeys) > 0:
            elem = random.choice(redisKeys)
            if spider.name + ':Cookies' in elem:
                cookie = json.loads(self.rconn.get(elem))
                request.cookies = cookie
                request.meta["accountText"] = elem.split("Cookies:")[-1]
                break

第一行:不說

第二行第三行得說一下 這玩意兒叫過載,有啥用呢:

也不扯啥子高深問題了,小夥伴兒可能發現,當你繼承父類之後;子類是不能用 def init()方法的,不過過載父類之後就能用啦!

第四行:settings[‘REDIS_URL’]是個什麼鬼?這是訪問scrapy的settings。怎麼訪問的?下面說

第五行:往redis中新增cookie。第二個引數就是spidername的獲取方法(其實就是字典啦!)

@classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings, crawler)

這個貌似不好理解,作用看下面:

 
 

這樣是不是一下就知道了??

至於訪問settings的方法官方文件給出了詳細的方法:

http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/settings.html#how-to-access-settings

 

下面就是完整的middlewares.py檔案:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import json
import redis
import random
from .useragent import agents
from .cookies import init_cookie, remove_cookie, update_cookie
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from scrapy.downloadermiddlewares.retry import RetryMiddleware
import logging

logger = logging.getLogger(__name__)

class UserAgentmiddleware(UserAgentMiddleware):

    def process_request(self, request, spider):
        agent = random.choice(agents)
        request.headers["User-Agent"] = agent

class CookieMiddleware(RetryMiddleware):

    def __init__(self, settings, crawler):
        RetryMiddleware.__init__(self, settings)
        self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses設定取出的編碼為str
        init_cookie(self.rconn, crawler.spider.name)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings, crawler)

    def process_request(self, request, spider):
        redisKeys = self.rconn.keys()
        while len(redisKeys) > 0:
            elem = random.choice(redisKeys)
            if spider.name + ':Cookies' in elem:
                cookie = json.loads(self.rconn.get(elem))
                request.cookies = cookie
                request.meta["accountText"] = elem.split("Cookies:")[-1]
                break
            #else:
                #redisKeys.remove(elem)

    #def process_response(self, request, response, spider):

         #"""
         #下面的我刪了,各位小夥伴可以嘗試以下完成後面的工作

         #你需要在這個位置判斷cookie是否失效

         #然後進行相應的操作,比如更新cookie  刪除不能用的賬號

         #寫不出也沒關係,不影響程式正常使用,

         #"""

MasterSpider

# coding: utf-8
from scrapy import Item, Field
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
from scrapy.linkextractors import LinkExtractor
from redis import Redis
from time import time
from urllib.parse import urlparse, parse_qs, urlencode

class MasterSpider(RedisCrawlSpider):
    name = 'ebay_master'
    redis_key = 'ebay:start_urls'

    ebay_main_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/allcategories/all-categories', ))
    ebay_category2_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/[^\s]*/\d+/i.html',
                                             r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_ipg=\d+&_pgn=\d+',
                                             r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_pgn=\d+&_ipg=\d+',))

    rules = (
        Rule(ebay_category2_lx, callback='parse_category2', follow=False),
        Rule(ebay_main_lx, callback='parse_main', follow=False),
    )

    def __init__(self, *args, **kwargs):
        domain = kwargs.pop('domain', '')
        # self.allowed_domains = filter(None, domain.split(','))
        super(MasterSpider, self).__init__(*args, **kwargs)

    def parse_main(self, response):
        pass
        data = response.xpath("//div[@class='gcma']/ul/li/a[@class='ch']")
        for d in data:
            try:
                item = LinkItem()
                item['name'] = d.xpath("text()").extract_first()
                item['link'] = d.xpath("@href").extract_first()
                yield self.make_requests_from_url(item['link'] + r"?_fsrp=1&_pppn=r1&scp=ce2")
            except:
                pass

    def parse_category2(self, response):
        data = response.xpath("//ul[@id='ListViewInner']/li/h3[@class='lvtitle']/a[@class='vip']")
        redis = Redis()
        for d in data:
            # item = LinkItem()
            try:
                self._filter_url(redis, d.xpath("@href").extract_first())

            except:
                pass
        try:
            next_page = response.xpath("//a[@class='gspr next']/@href").extract_first()
        except:
            pass
        else:
            # yield self.make_requests_from_url(next_page)
            new_url = self._build_url(response.url)
            redis.lpush("test:new_url", new_url)
            # yield self.make_requests_from_url(new_url)
            # yield Request(url, headers=self.headers, callback=self.parse2)

    def _filter_url(self, redis, url, key="ebay_slave:start_urls"):
        is_new_url = bool(redis.pfadd(key + "_filter", url))
        if is_new_url:
            redis.lpush(key, url)

    def _build_url(self, url):
        parse = urlparse(url)
        query = parse_qs(parse.query)
        base = parse.scheme + '://' + parse.netloc + parse.path

        if '_ipg' not in query.keys() or '_pgn' not in query.keys() or '_skc' in query.keys():
            new_url = base + "})
        else:
            new_url = base + "?" + urlencode({"_ipg": query['_ipg'][0], "_pgn": int(query['_pgn'][0]) + 1})
        return new_url

class LinkItem(Item):
    name = Field()
    link = Field()

MasterSpider 繼承來自 scrapy-redis 元件下的 RedisCrawlSpider,相比 scrapy框架 有了以下變化:

  • redis_key
    該爬蟲的 start_urls 的存放容器由原先的 Python list 改至 redis list,所以此處需要 redis_key 存放 redis list 的 key
  • rules
    • rules 是含有多個 Rule 物件的 tuple
    • Rule 物件例項化常用的三個引數:link_extractor / callback / follow
    • link_extractor 是一個 LinkExtractor 物件。 其定義瞭如何從爬取到的頁面提取連結
    • callback 是一個 callable 或 string (該spider中同名的函式將會被呼叫)。 從 link_extractor中每獲取到連結時將會呼叫該函式。該回調函式接受一個response作為其第一個引數, 並返回一個包含 Item 以及(或) Request 物件(或者這兩者的子類)的列表(list)。rules中的規則如果callback沒有指定,則使用預設的parse函式進行解析,如果指定了,那麼使用自定義的解析函式
    • follow 是一個布林(boolean)值,指定了根據該規則從response提取的連結是否需要跟進。 如果 callback 為None, follow 預設設定為 True ,否則預設為 False 。
    • process_links 處理所有的連結的回撥,用於處理從response提取的links,通常用於過濾(引數為link列表)
    • process_request 連結請求預處理(新增header或cookie等)
  • ebay_main_lx / ebay_category2_lx
    LinkExtractor 物件
    • allow (a regular expression (or list of)) – 必須要匹配這個正則表示式(或正則表示式列表)的URL才會被提取。如果沒有給出(或為空), 它會匹配所有的連結。
    • deny 排除正則表示式匹配的連結(優先順序高於allow)
    • allow_domains 允許的域名(可以是str或list)
    • deny_domains 排除的域名(可以是str或list)
    • restrict_xpaths: 取滿足XPath選擇條件的連結(可以是str或list)
    • restrict_css 提取滿足css選擇條件的連結(可以是str或list)
    • tags 提取指定標籤下的連結,預設從a和area中提取(可以是str或list)
    • attrs 提取滿足擁有屬性的連結,預設為href(型別為list)
    • unique 連結是否去重(型別為boolean)
    • process_value 值處理函式(優先順序大於allow)
  • parse_main / parse_category2
    • 用於解析符合對應 rule 的 url 的 response 的方法
  • _filter_url / _build_url
    • 一些有關 url 的工具方法
  • LinkItem
    • 繼承自 Item 物件
    • Item 物件是種簡單的容器,用於儲存爬取到得資料。 其提供了類似於 dict 的 API 以及用於宣告可用欄位的簡單語法。

SlaveSpider

# coding: utf-8
from scrapy import Item, Field
from scrapy_redis.spiders import RedisSpider

class SlaveSpider(RedisSpider):
    name = "ebay_slave"
    redis_key = "ebay_slave:start_urls"

    def parse(self, response):
        item = ProductItem()
        item["price"] = response.xpath("//span[contains(@id,'prcIsum')]/text()").extract_first()
        item["item_id"] = response.xpath("//div[@id='descItemNumber']/text()").extract_first()
        item["seller_name"] = response.xpath("//span[@class='mbg-nw']/text()").extract_first()
        item["sold"] = response.xpath("//span[@class='vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk']/a/text()").extract_first()
        item["cat_1"] = response.xpath("//li[@class='bc-w'][1]/a/span/text()").extract_first()
        item["cat_2"] = response.xpath("//li[@class='bc-w'][2]/a/span/text()").extract_first()
        item["cat_3"] = response.xpath("//li[@class='bc-w'][3]/a/span/text()").extract_first()
        item["cat_4"] = response.xpath("//li[@class='bc-w'][4]/a/span/text()").extract_first()
        yield item

class ProductItem(Item):
    name = Field()
    price = Field()
    sold = Field()
    seller_name = Field()
    pl_id = Field()
    cat_id = Field()
    cat_1 = Field()
    cat_2 = Field()
    cat_3 = Field()
    cat_4 = Field()
    item_id = Field()

SlaveSpider 繼承自 RedisSpider,屬性與方法相比 MasterSpider 簡單了不少,少了 rules 與其他,但大致功能都比較類似

SlaveSpider 從 ebay_slave:start_urls 下讀取構建好的目標頁面的 request,對 response 解析出目標資料,以 ProductItem 的形式輸出資料

IP proxy

給請求新增代理有2種方式,第一種是重寫你的爬蟲類的start_request方法,第二種是新增download中介軟體。

重寫start_request方法
我在我的爬蟲類中重寫了start_requests方法:

 

反爬蟲一個最常用的方法的就是限制 ip。為了避免最壞的情況,可以利用代理伺服器來爬取資料,scrapy 設定代理伺服器只需要在請求前設定 Request 物件的 meta 屬性,新增 proxy 值即可,
可以通過中介軟體來實現:

 1、-------------------------------
 class ProxyMiddleware(object):
     def process_request(self, request, spider):
         proxy = 'http://178.33.6.236:3128'     # 代理伺服器
         request.meta['proxy'] = proxy
         proxy_user_pass=b'test:test'#使用者名稱:密碼(bytes形式)
         request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass)

 2、-------------------------------
 from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware

 class ProxyMiddleware(HttpProxyMiddleware):

     def process_request(self, request, spider):
         proxy = 'http://%s'%ip
         request.meta['proxy'] = proxy
         proxy_user_pass=b'test:test'
         request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass)

在setting檔案中新增

DOWNLOADER_MIDDLEWARES = {
    '專案名.spider同級檔名.檔名.ProxyMiddleware': 543,
}

另外,也可以使用大量的 IP Proxy 建立起代理 IP 池,請求時隨機呼叫來避免更嚴苛的 IP 限制機制,方法類似 User-Agent 池

URL Filter

正常業務邏輯下,爬蟲不會對重複爬取同一個頁面兩次。所以爬蟲預設都會對重複請求進行過濾,但當爬蟲體量達到千萬級時,預設的過濾器佔用的記憶體將會遠遠超乎你的想象。
為了解決這個問題,可以通過一些演算法來犧牲一點點過濾的準確性來換取更小的空間複雜度

Bloom Filter

Bloom Filter可以用於檢索一個元素是否在一個集合中。它的優點是空間效率和查詢時間都遠遠超過一般的演算法,缺點是有一定的誤識別率和刪除困難。

Hyperloglog

HyperLogLog是一個基數估計演算法。其空間效率非常高,1.5K記憶體可以在誤差不超過2%的前提下,用於超過10億的資料集合基數估計。

這兩種演算法都是合適的選擇,以 Hyperloglog 為例
由於 redis 已經提供了支援 hyperloglog 的資料結構,所以只需對此資料結構進行操作即可

MasterSpider 下的 _filter_url 實現了過濾 URL 的功能

def _filter_url(self, redis, url, key="ebay_slave:start_urls"):
    is_new_url = bool(redis.pfadd(key + "_filter", url))
    if is_new_url:
        redis.lpush(key, url)

當 redis.pfadd() 執行時,一個 url 嘗試插入 hyperloglog 結構中,如果 url 存在返回 0,反之返回 1。由此來判斷是否要將該 url 存放至待爬佇列