1. 程式人生 > >Scrapy爬蟲實戰:使用代理訪問

Scrapy爬蟲實戰:使用代理訪問

Scapy爬蟲實戰:使用代理訪問

前面我們簡單的設定了headers就可以騙過ip138.com,但是絕大多數比較複雜的網站就不是那麼好騙的了,這個時候我們需要更高階的方案,富人靠科技,窮人靠變異,如果不差錢的話,可以考慮VPN,也可以使用免費的代理。我們這裡試著使用代理。

Middleware 中介軟體設定代理

middlewares.py

from tutorial.settings import PROXIES

class TutorialDownloaderMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = "http://%s" % random.choice(PROXIES)
        return None

settings.py

PROXIES = [
    '113.59.59.73:58451',
    '113.214.13.1:8000',
    '119.7.192.27:8088'
, '60.13.74.183:80', '110.180.90.181:8088', '125.38.239.199:8088' ] # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'tutorial.middlewares.TutorialDownloaderMiddleware': 1, }

spider

# -*- coding: utf-8 -*-
import scrapy class Ip138Spider(scrapy.Spider): name = 'ip138' allowed_domains = ['www.ip138.com','2018.ip138.com'] start_urls = ['http://2018.ip138.com/ic.asp'] def parse(self, response): print("*" * 40) print("response text: %s" % response.text) print("response headers: %s" % response.headers) print("response meta: %s" % response.meta) print("request headers: %s" % response.request.headers) print("request cookies: %s" % response.request.cookies) print("request meta: %s" % response.request.meta)

執行結果如下:
middleware

配置meta使用proxy

# -*- coding: utf-8 -*-
import scrapy

class Ip138Spider(scrapy.Spider):
    name = 'ip138'
    allowed_domains = ['www.ip138.com','2018.ip138.com']
    start_urls = ['http://2018.ip138.com/ic.asp']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'proxy':'http://116.62.134.173:9999'} ,callback=self.parse)

    def parse(self, response):
        print("*" * 40)
        print("response text: %s" % response.text)
        print("response headers: %s" % response.headers)
        print("response meta: %s" % response.meta)
        print("request headers: %s" % response.request.headers)
        print("request cookies: %s" % response.request.cookies)
        print("request meta: %s" % response.request.meta)

執行結果
spider

快代理

我們可以使用上一些免費代理 https://www.kuaidaili.com/free/inha/
kuaidaili

速度比較滿,但是還可以用。

GitHub原始碼