scrapy框架設置代理

阿新 • • 發佈：2017-06-04

ase param his utf-8 httpproxy down json eth head

網易音樂在單ip請求下經常會遇到網頁返回碼503的情況
經查詢，503為單個ip請求流量超限，猜測是網易音樂的一種反扒方式
因原音樂下載程序采用scrapy框架，所以需要在scrapy中通過代理的方式去解決此問題
在scrapy中使用代理，有兩種使用方式

1.使用中間件
2.直接設置Request類的meta參數

下面依次簡要說明下如何使用

方式一：使用中間件
要進行下面兩步操作

在文件 settings.py 中激活代理中間件ProxyMiddleware
在文件 middlewares.py 中實現類ProxyMiddleware
1.文件 settings.py 中:
# settings.py

DOWNLOADER_MIDDLEWARES = {
‘project_name.middlewares.ProxyMiddleware‘: 100, # 註意修改 project_name
‘scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware‘: 110,
}
說明：
數字100, 110表示中間件先被調用的次序。數字越小，越先被調用。
官網文檔：

The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.

2.文件 middlewares.py 看起來像這樣:
代理不斷變換

這裏利用網上API 直接get過來。（需要一個APIKEY，免費註冊一個賬號就有了。這個APIKEY是我自己的，不保證一直有效！）
也可以從網上現抓。
還可以從本地文件讀取
# middlewares.py

import requests

class ProxyMiddleware(object):

def process_request(self, request, spider):
APIKEY = ‘f95f08afc952c034cc2ff9c5548d51be‘
url = ‘https://www.proxicity.io/api/v1/{}/proxy‘.format(APIKEY) # 在線API接口
r = requests.get(url)

request.meta[‘proxy‘] = r.json()[‘curl‘] # 協議://IP地址:端口（如 http://5.39.85.100:30059）
return request
方式二：直接設置Request類的meta參數
import random

# 事先準備的代理池
proxy_pool = [‘http://proxy_ip1:port‘, ‘http://proxy_ip2:port‘, ..., ‘http://proxy_ipn:port‘]

class MySpider(BaseSpider):
name = "my_spider"

allowed_domains = ["example.com"]

start_urls = [
‘http://www.example.com/articals/‘,
]

def start_requests(self):
for url in self.start_urls:
proxy_addr = random.choice(proxy_pool) # 隨機選一個
yield scrapy.Request(url, callback=self.parse, meta={‘proxy‘: proxy_addr}) # 通過meta參數添加代理

def parse(self, response):
# doing parse
延伸閱讀
1.閱讀官網文檔對Request類的描述，我們可以發現除了設置proxy，還可以設置method, headers, cookies, encoding等等:

class scrapy.http.Request(url[, callback, method=‘GET‘, headers, body, cookies, meta, encoding=‘utf-8‘, priority=0, dont_filter=False, errback])

2.官網文檔對Request.meta參數可以設置的詳細列表：

dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies (see cookies parameter of Request constructor)
cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout
download_maxsize
proxy
如隨機設置請求頭和代理：

# my_spider.py

import random

# 事先收集準備的代理池
proxy_pool = [
‘http://proxy_ip1:port‘,
‘http://proxy_ip2:port‘,
...,
‘http://proxy_ipn:port‘
]

# 事先收集準備的 headers
headers_pool = [
{‘User-Agent‘: ‘Mozzila 1.0‘},
{‘User-Agent‘: ‘Mozzila 2.0‘},
{‘User-Agent‘: ‘Mozzila 3.0‘},
{‘User-Agent‘: ‘Mozzila 4.0‘},
{‘User-Agent‘: ‘Chrome 1.0‘},
{‘User-Agent‘: ‘Chrome 2.0‘},
{‘User-Agent‘: ‘Chrome 3.0‘},
{‘User-Agent‘: ‘Chrome 4.0‘},
{‘User-Agent‘: ‘IE 1.0‘},
{‘User-Agent‘: ‘IE 2.0‘},
{‘User-Agent‘: ‘IE 3.0‘},
{‘User-Agent‘: ‘IE 4.0‘},
]

class MySpider(BaseSpider):
name = "my_spider"

allowed_domains = ["example.com"]

start_urls = [
‘http://www.example.com/articals/‘,
]

def start_requests(self):
for url in self.start_urls:
headers = random.choice(headers_pool) # 隨機選一個headers
proxy_addr = random.choice(proxy_pool) # 隨機選一個代理
yield scrapy.Request(url, callback=self.parse, headers=headers, meta={‘proxy‘: proxy_addr})

def parse(self, response):
# doing parse

scrapy框架設置代理

ase param his utf-8 httpproxy down json eth head 網易音樂在單ip請求下經常會遇到網頁返回碼503的情況經查詢，503為單個ip請求流量超限，猜測是網易音樂的一種反扒方式因原音樂下載程序采用scrapy框架，所以需要在scra

scrapy框架設置代理

scrapy框架設置代理

python設置代理IP來爬取拉勾網上的職位信息，

使用Genymotion無法連接網絡設置代理

Ubuntu設置代理的方法

Homebrew設置代理

Eclipse設置代理上網

Windows設置代理，linux連接windows上網

centos設置代理上網

Scrapy selenium 設置頭部信息headers-------UA

Vue-cli創建項目從單頁面到多頁面4 - 本地開發服務器設置代理

jmeter設置代理錄制腳本

windows 的cmd設置代理的問題

linux系統(ubuntu)下使用ssr上網的方法之一:設置代理

設置代理的話，可以使用這種方式，代碼是我剛才測試過的，親測可用

jmeter設置代理服務器錄制腳本

selenium 設置代理選項

CentOS7 設置代理(轉)--系統全局代理部分，測試可用

設置代理服務器

在OpenSSH上采用公鑰和私鑰登錄並設置代理登錄

Git為某個域名設置代理

scrapy框架設置代理

相關推薦