python爬蟲之Scrapy框架:兩種隨機選擇User-Agent的方法
阿新 • • 發佈:2019-01-24
修改請求時的User-Agent一般有兩種思路:
一是修改setting中的User-Agent變數(適用於極少量的agent更換,一般不採用);另一種就是通過Downloader Middleware的process_request()方法來修改,即在middlewares.py裡面新增一個RandomUserAgentMiddleware的類.
首先介紹下scrapy框架預設的UserAgentMiddleware:
from scrapy import signals class UserAgentMiddleware(object): ''' 設定User-Agent ''' def __init__(self, user_agent='Scrapy'): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): o = cls(crawler.setting['USER-AGENT']) crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): self.user_agent = getattr(spider, 'user-agent', self.user_agent) def process_request(self, request, spider): if self.user_agent: request.headers.setdefault(b'User-Agent', self.user_agent)
在不設定agent時,呼叫時返回的agent為:
"User-Agent": "Scrapy/1.5.0 (+https://scrapy.org)"
下面設定,介紹第二種思路的新增UserAgentMiddleware的類的三種形式:
1.第一種方法:
在useragent檔案下新增加一個user_agents模組程式碼,程式碼如下(為了節省空間,拿出部分agents作為參考):
# encoding=utf-8 """ User-Agents """ user_agents = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", ]
接著,在middleware.py內新增一個RandomUserAgentMiddleware類,程式碼實現如下:
# -*- coding: utf-8 -*- import random from useragent.user_agents import user_agents class RandomUserAgentMiddelware(object): """ 換User-Agent """ def process_request(self, request, spider): request.headers['User-Agent'] = random.choice(user_agents)
最後,將setting中的DOWNLOADER_MIDDLEWARES取消註釋:將其設定為:
DOWNLOADER_MIDDLEWARES = {
'useragent.middlewares.RandomUserAgentMiddelware': 543,
}
這樣scrapy的agents隨機選擇就實現了,執行程式得到的agent都不一樣,證明成功.
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
2.第二種方法:
import random
class RandomUserAgentMiddelware(object):
def __init__(self):
self.user_agent = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agent)
然後將setting中的DOWNLOADER_MIDDLEWARES取消註釋:將其設定為:
DOWNLOADER_MIDDLEWARES = {
'useragent.middlewares.RandomUserAgentMiddelware': 543,
}