1. 程式人生 > >python爬蟲之scrapy中user agent淺談(兩種方法)

python爬蟲之scrapy中user agent淺談(兩種方法)

user agent簡述

User Agent中文名為使用者代理,簡稱 UA,它是一個特殊字串頭,使得伺服器能夠識別客戶使用的作業系統及版本、CPU 型別、瀏覽器及版本、瀏覽器渲染引擎、瀏覽器語言、瀏覽器外掛等。

開始(測試不同型別user agent返回值)

手機user agent 測試:Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3

電腦user agent 測試:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36

1.新建一個scrapy專案(以百度做案例):

scrapy startproject myspider

scrapy genspider bdspider www.baidu.com

2.在settings中開啟user agent

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'testspider (+http://www.yourdomain.com)'

將手機與電腦user agent 分別修改(手機訪問返回的內容比電腦訪問的要少,所以隨便拿個len()判斷一下就可以)

3.spider編寫與user agent對比

class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, response):
        print(len(response.text))


明顯可以看出手機和電腦的區別,這也表明百度通過user agent來判斷終端類別然後返回不同內容

那麼重點來了,對於爬蟲來說為了防止觸發反爬 一個user agent肯定不行了

那麼該如何掛大量的user agent呢

處理方法有很多,這裡主要介紹兩種:

一、在setings中寫一個user agent列表並設定隨機方法(讓setings替我們選擇)

二、在settings中寫列表,在middleware.py中建立類,在downloadmiddleware中呼叫(讓中介軟體完成選擇)

一、settings 隨機選擇user agnet(第一種方法)

settings建立user agent表,

匯入random,隨機用choise函式呼叫user agent

import random
# user agent 列表
USER_AGENT_LIST = [
    'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
    'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
    'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
    'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
    'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
    'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
    'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
    'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
]
# 隨機生成user agent
USER_AGENT = random.choice(USER_AGENT_LIST) 

編寫spider:

# -*- coding: utf-8 -*-
import scrapy

class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com']

    def parse(self, response):
        print(response.request.headers['User-Agent'])

結果對比

執行結果可以明顯發現每次呼叫的user agent不一樣


二、在middleware中呼叫user agent(第二種方法)

在setting中註釋user agent 防止干擾

在middlewares中建立類

import random
class UserAgentMiddleware(object):
    def __init__(self):
        self.user_agent_list = [
            'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
            'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
            'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
            'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
            'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
            'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
            'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
            'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
        ]
    def process_request(self,request,spider):
        request.headers['USER_AGENT']=random.choice(self.user_agent_list)

啟用downloader middleware

DOWNLOADER_MIDDLEWARES = {
    'testspider.middlewares.UserAgentMiddleware': 300
}

開始測試,對別結果


好了 兩種方法結束了...........................