1. 程式人生 > >scrapy學習2 爬蟲中間件,下載器中間件之添加代理

scrapy學習2 爬蟲中間件,下載器中間件之添加代理

page b- bytes etc internet HR line option sched

中間件

註意:這些中間件都放在middleware中

下載中間件

技術分享圖片

作用

技術分享圖片

實例:

代理被封,添加代理

方式一:內置添加代理功能

import os

技術分享圖片

技術分享圖片
    # -*- coding: utf-8 -*-
                import os
                import scrapy
                from scrapy.http import Request

                class ChoutiSpider(scrapy.Spider):
                    name = chouti
allowed_domains = [chouti.com] start_urls = [https://dig.chouti.com/] def start_requests(self): os.environ[HTTP_PROXY] = "http://192.168.11.11" for url in self.start_urls:
yield Request(url=url,callback=self.parse) def parse(self, response): print(response)
View Code

方法二:自定義中間件添加代理 (有多個代理的時候,而且想要隨機循環的使用某個代理,防止被封) 常用此方法,

技術分享圖片

然後再在middleware.py中添加上自定義的propxy類 和一個方法

代碼如下:

技術分享圖片
                import random
                import base64
                
import six def to_bytes(text, encoding=None, errors=strict): """Return the binary representation of `text`. If `text` is already a bytes object, return it as-is.""" if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError(to_bytes must receive a unicode, str or bytes object, got %s % type(text).__name__) if encoding is None: encoding = utf-8 return text.encode(encoding, errors) class MyProxyDownloaderMiddleware(object): def process_request(self, request, spider): proxy_list = [ {ip_port: 111.11.228.75:80, user_pass: xxx:123}, {ip_port: 120.198.243.22:80, user_pass: ‘‘}, {ip_port: 111.8.60.9:8123, user_pass: ‘‘}, {ip_port: 101.71.27.120:80, user_pass: ‘‘}, {ip_port: 122.96.59.104:80, user_pass: ‘‘}, {ip_port: 122.224.249.122:8088, user_pass: ‘‘}, ] proxy = random.choice(proxy_list) if proxy[user_pass] is not None: request.meta[proxy] = to_bytes("http://%s" % proxy[ip_port]) encoded_user_pass = base64.encodestring(to_bytes(proxy[user_pass])) request.headers[Proxy-Authorization] = to_bytes(Basic + encoded_user_pass) else: request.meta[proxy] = to_bytes("http://%s" % proxy[ip_port]) 配置: DOWNLOADER_MIDDLEWARES = { # ‘xiaohan.middlewares.MyProxyDownloaderMiddleware‘: 543, }
View Code

問題2

技術分享圖片

  如果被爬取的網站是自己花錢買的證書(此證書就是為了防止用戶發送的數據在中間環節被截獲,沒有證書相關的解密方式無法解析),可以直接正常爬取

  如果是網站麽錢,自己寫的證書,發送爬取數據的時候,必須攜帶證書文件,才能爬取數據

    方法: 現在Middleware.py中寫入這些代碼,然後再在配置文件中寫上那兩行配置(代碼)

技術分享圖片
20. Https訪問
    Https訪問時有兩種情況:
    1. 要爬取網站使用的可信任證書(默認支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
        
    2. 要爬取網站使用的自定義證書
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
        
        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
        
        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open(/Users/wupeiqi/client.key.unsecure, mode=r).read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open(/Users/wupeiqi/client.pem, mode=r).read())
                return CertificateOptions(
                    privateKey=v1,  # pKey對象
                    certificate=v2,  # X509對象
                    verify=False,
                    method=getattr(self, method, getattr(self, _ssl_method, None))
                )
    其他:
        相關類
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相關配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""
View Code

爬蟲中間件

技術分享圖片

技術分享圖片

技術分享圖片

這裏註意

技術分享圖片

代碼

技術分享圖片
middlewares.py
                class XiaohanSpiderMiddleware(object):
                    # Not all methods need to be defined. If a method is not defined,
                    # scrapy acts as if the spider middleware does not modify the
                    # passed objects.
                    def __init__(self):
                        pass
                    @classmethod
                    def from_crawler(cls, crawler):
                        # This method is used by Scrapy to create your spiders.
                        s = cls()
                        return s

                    # 每次下載完成之後,未執行parse函數之前。
                    def process_spider_input(self, response, spider):
                        # Called for each response that goes through the spider
                        # middleware and into the spider.

                        # Should return None or raise an exception.
                        print(process_spider_input,response)
                        return None

                    def process_spider_output(self, response, result, spider):
                        # Called with the results returned from the Spider, after
                        # it has processed the response.

                        # Must return an iterable of Request, dict or Item objects.
                        print(process_spider_output,response)
                        for i in result:
                            yield i

                    def process_spider_exception(self, response, exception, spider):
                        # Called when a spider or process_spider_input() method
                        # (from other spider middleware) raises an exception.

                        # Should return either None or an iterable of Response, dict
                        # or Item objects.
                        pass

                    # 爬蟲啟動時,第一次執行start_requests時,觸發。(只執行一次)
                    def process_start_requests(self, start_requests, spider):
                        # Called with the start requests of the spider, and works
                        # similarly to the process_spider_output() method, except
                        # that it doesn’t have a response associated.

                        # Must return only requests (not items).

                        print(process_start_requests)
                        for r in start_requests:
                            yield r
View Code

settings中的配置

SPIDER_MIDDLEWARES = {
               xiaohan.middlewares.XiaohanSpiderMiddleware: 543,
            }

擴展 信號

單純擴展:

無意義

extends.py 
                class MyExtension(object):
                    def __init__(self):
                        pass

                    @classmethod
                    def from_crawler(cls, crawler):
                        obj = cls()
                        return obj
            配置:
                EXTENSIONS = {
                    xiaohan.extends.MyExtension:500,
                }
        

擴展+信號:

技術分享圖片

extends.py

from scrapy import signals


                class MyExtension(object):
                    def __init__(self):
                        pass

                    @classmethod
                    def from_crawler(cls, crawler):
                        obj = cls()
                        # 在爬蟲打開時,觸發spider_opened信號相關的所有函數:xxxxxxxxxxx
                        crawler.signals.connect(obj.xxxxxxxxxxx1, signal=signals.spider_opened)
                        # 在爬蟲關閉時,觸發spider_closed信號相關的所有函數:xxxxxxxxxxx
                        crawler.signals.connect(obj.uuuuuuuuuu, signal=signals.spider_closed)
                        return obj

                    def xxxxxxxxxxx1(self, spider):
                        print(open)

                    def uuuuuuuuuu(self, spider):
                        print(close)
        
                            return obj

配置:

EXTENSIONS = {
                xiaohan.extends.MyExtension:500,
            }

7. 自定制命令

  • 在spiders同級創建任意目錄,如:commands
  • 技術分享圖片

  • 在其中創建 crawlall.py 文件 (此處文件名就是自定義的命令)

        from scrapy.commands import ScrapyCommand
        from scrapy.utils.project import get_project_settings


        class Command(ScrapyCommand):
            requires_project = True

            def syntax(self):
          #支持的語法
return [options] def short_desc(self): return Runs all of the spiders def run(self, args, opts): #獲取所有的爬蟲 spider_list = self.crawler_process.spiders.list() for name in spider_list:
            #craewler_process是執行爬蟲的入口,如self.crawler_process.crawl(‘chouti‘) self.crawler_process.crawl(name,
**opts.__dict__) #讓爬蟲開始做操作 self.crawler_process.start()

11.TinyScrapy

技術分享圖片
from twisted.web.client import getPage
from twisted.internet import reactor
from twisted.internet import defer

url_list = [http://www.bing.com, http://www.baidu.com, ]


def callback(arg):
    print(回來一個, arg)


defer_list = []
for url in url_list:
    ret = getPage(bytes(url, encoding=utf8))
    ret.addCallback(callback)
    defer_list.append(ret)


def stop(arg):
    print(已經全部現在完畢, arg)
    reactor.stop()


d = defer.DeferredList(defer_list)
d.addBoth(stop)

reactor.run()
twisted示例一 技術分享圖片
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from twisted.web.client import getPage
from twisted.internet import reactor
from twisted.internet import defer


@defer.inlineCallbacks
def task(url):
    ret = getPage(bytes(url, encoding=utf8))
    ret.addCallback(callback)
    yield ret


def callback(arg):
    print(回來一個, arg)


url_list = [http://www.bing.com, http://www.baidu.com, ]
defer_list = []
for url in url_list:
    ret = task(url)
    defer_list.append(ret)


def stop(arg):
    print(已經全部現在完畢, arg)
    reactor.stop()


d = defer.DeferredList(defer_list)
d.addBoth(stop)
reactor.run()
twisted示例二 技術分享圖片
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from twisted.internet import defer
from twisted.web.client import getPage
from twisted.internet import reactor
import threading


def _next_request():
    _next_request_from_scheduler()


def _next_request_from_scheduler():
    ret = getPage(bytes(http://www.chouti.com, encoding=utf8))
    ret.addCallback(callback)
    ret.addCallback(lambda _: reactor.callLater(0, _next_request))


_closewait = None

@defer.inlineCallbacks
def engine_start():
    global _closewait
    _closewait = defer.Deferred()
    yield _closewait


@defer.inlineCallbacks
def task(url):
    reactor.callLater(0, _next_request)
    yield engine_start()


counter = 0
def callback(arg):
    global counter
    counter +=1
    if counter == 10:
        _closewait.callback(None)
    print(one, len(arg))


def stop(arg):
    print(all done, arg)
    reactor.stop()


if __name__ == __main__:
    url = http://www.cnblogs.com

    defer_list = []
    deferObj = task(url)
    defer_list.append(deferObj)

    v = defer.DeferredList(defer_list)
    v.addBoth(stop)
    reactor.run()
twisted示例三

技術分享圖片

補充

技術分享圖片

如何創建第二個爬蟲

技術分享圖片

利用 scrapy- redis去重

原理就是把訪問過的額地址放到一個集合總,然後判斷時候訪問過

redis只是補充

技術分享圖片

源碼解析:

配置文件(要想使用就是修改這些內容)

技術分享圖片

代碼(老師筆記中)

如果想要自定義去重規則 或者擴展

技術分享圖片

技術分享圖片

redis知識補充技術分享圖片

技術分享圖片

技術分享圖片

博客

技術分享圖片

自定義調度器

技術分享圖片

技術分享圖片

位置

技術分享圖片

調度器和去重規則的使用的三中情況: 老師的代碼(筆記中)

技術分享圖片

scrapy學習2 爬蟲中間件,下載器中間件之添加代理