1. 程式人生 > >實踐出真知-scrapy整合ip代理(以阿布云為例)

實踐出真知-scrapy整合ip代理(以阿布云為例)

一、前言

有個專案需要爬取證券協會,對方有ip遮蔽。所以我需要在scrapy中實現ip自動切換,才能夠完成爬取任務。

在此之前,我用過第三方庫scrapy-proxys加上芝麻ip的代理api介面,可能是之前程式碼沒有調整好,導致的沒有能夠成功。(後面有機會再測試)。

二、阿布雲範例

阿布雲官方給出了python和scrapy的示例程式碼

python3示例

from urllib import request

    # 要訪問的目標頁面
    targetUrl = "http://test.abuyun.com/proxy.php"

    # 代理伺服器
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"

    # 代理隧道驗證資訊
    proxyUser = "H01234567890123D"
    proxyPass = "0123456789012345"

    proxyMeta = "http://%(user)s:%(pass)
[email protected]
%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxy_handler = request.ProxyHandler({ "http" : proxyMeta, "https" : proxyMeta, }) #auth = request.HTTPBasicAuthHandler() #opener = request.build_opener(proxy_handler, auth, request.HTTPHandler) opener = request.build_opener(proxy_handler) request.install_opener(opener) resp = request.urlopen(targetUrl).read() print (resp)

上面的是原生寫法,下面提供scrapy的中介軟體寫法

scrapy中介軟體

 import base64

    # 代理伺服器
    proxyServer = "http://http-dyn.abuyun.com:9020"

    # 代理隧道驗證資訊
    proxyUser = "H01234567890123D"
    proxyPass = "0123456789012345"

    # for Python2
    proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass)

    # for Python3
    #proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")

    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            request.meta["proxy"] = proxyServer

            request.headers["Proxy-Authorization"] = proxyAuth 

這裡在scrapy專案中的Middleware裡面寫即可。

三、正式整合

在專案的middlewares.py中新增類:

import base64

""" 阿布雲ip代理配置,包括賬號密碼 """
proxyServer = "http://http-dyn.abuyun.com:9020"
proxyUser = "HWFHQ5YP14Lxxx"
proxyPass = "CB8D0AD56EAxxx"
# for Python3
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")


class ABProxyMiddleware(object):
    """ 阿布雲ip代理配置 """
    def process_request(self, request, spider):
        request.meta["proxy"] = proxyServer
        request.headers["Proxy-Authorization"] = proxyAuth

然後再到settings.py中開啟中介軟體:

DOWNLOADER_MIDDLEWARES = {

   #'Securities.middlewares.SecuritiesDownloaderMiddleware': None,

    'Securities.middlewares.ABProxyMiddleware': 1,
}

四、注意事項

阿布雲動態ip預設是1秒鐘請求5次,(可以加錢,購買多次)。所以,當他是預設5次的時候,我需要對爬蟲進行限速,還是在settings.py中,空白處新增如下程式碼:

""" 啟用限速設定 """
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.2  # 初始下載延遲
DOWNLOAD_DELAY = 0.2  # 每次請求間隔時間

當然了,如果加錢購買多次的話,可以不用考慮限速的問題。

即可完成阿布雲動態代理ip在scrapy中的的整合,盡情的爬吧!