Python3網路爬蟲：Scrapy入門之使用ImagesPipline下載圖片

阿新 • • 發佈：2018-12-22

Python版本： python3.+
執行環境： Mac OS
IDE： pycharm

一前言
二初識ImagesPipline
三 ImagePipline修改圖片預設下載名稱
四小結

一、前言

上篇部落格用了一個簡單的實戰熟悉了一下scrapy框架的使用。但是下載圖片的方法使用的卻是requests庫，而scrapy本身就自帶有圖片下載的方法ImagesPipline。

二、初識ImagesPipline

1. ImagesPipline的特性:

避免重新下載最近已經下載過的資料
指定儲存路徑
將所有下載的圖片轉換成通用的格式（JPG）和模式（RGB）
縮圖生成
檢測影象的寬/高，確保它們滿足最小限制

2. ImagesPipline的工作流

在一個爬蟲裡，你抓取一個專案，把其中圖片的URL放入 image_urls

(type = list) 組內。
item從爬蟲內返回，進入Item Piplines。
當item進入 ImagesPipeline，image_urls 組內的URLs將被Scrapy的排程器和下載器（這意味著排程器和下載器的中介軟體可以複用）安排下載，當優先順序更高，會在其他頁面被抓取前處理。專案會在這個特定的管道階段保持“locker”的狀態，直到完成檔案的下載（或者由於某些原因未完成下載）。
當檔案下載完後，另一個欄位(files)將被更新到結構中。這個組將包含一個字典列表，其中包括下載檔案的資訊，比如下載路徑、源抓取地址（從 image_urls 組獲得）和圖片的校驗碼(checksum)。 images 列表中的檔案順序將和源 image_urls

組保持一致。如果某個圖片下載失敗，將會記錄下錯誤資訊，圖片也不會出現在 files 組中。

3.ImagesPipline使用樣例

一、定義item

為了使用media pipeline，你僅需要啟用 .
接著，如果spider返回一個具有 ‘file_urls’ 或者 ‘image_urls’(取決於使用Files 或者 Images
Pipeline) 鍵的dict，則pipeline會提取相對應(‘files’ 或 ‘images’)的結果。

如果你更喜歡使用 Item 來自定義item，則需要設定相應必要的欄位，例如下面使用Image Pipeline的例子:


import scrapy

class MyItem(scrapy.Item):

    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

在這裡我就自己定義了一個items

import scrapy

class MscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    image_ids = scrapy.Field()
    image_paths = scrapy.Field()
    pass

二、設定setting

首先需要在專案中新增 ITEM_PIPELINES

ITEM_PIPELINES = {'scrapy.pipeline.images.ImagesPipeline': 1}

接著 IMAGES_STORE 設定為一個有效的資料夾，用來儲存下載的圖片。否則管道將保持禁用狀態，即使你在
ITEM_PIPELINES 設定中添加了它。

對於Images Pipeline, 設定 IMAGES_STORE

IMAGES_STORE = '/path/to/valid/dir'

關於縮圖等其他屬性可以參看官方文件

三、 ImagePipline修改圖片預設下載名稱

1. 文件解讀

在ImagePipline的諸多屬性中需要特別注意的就是檔案系統儲存，因為它定義了檔案儲存時的預設名稱，我們想要修改圖片預設名稱，就得從這裡入手。

檔案系統儲存

檔案以它們URL的 SHA1 hash 作為檔名。

比如，對下面的圖片URL:

http://www.example.com/image.jpg 它的 SHA1 hash 值為:

3afec3b4765f8f0a07b78f98c07b83f013567a0a

將被下載並存為下面的檔案:

< IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

其中:

<IMAGES_STORE> 是定義在 IMAGES_STORE 設定裡的資料夾 > full是用來區分圖片和縮圖（如果使用的話）的一個子資料夾。

我們當然不希望自己下載下來的圖片名稱是這一串無法理解的數字。所以我們需要修改它檔名。

官方文件中提供了2個可以重寫的方法:

get_media_requests(item, info)
item_completed(results, items, info)

get_media_requests(item, info)

在工作流程中可以看到，管道會得到檔案的URL並從專案中下載。為了這麼做，你需要重寫 get_media_requests()方法，並對各個圖片URL返回一個Request:
def get_media_requests(self, item, info):
    for file_url in item['file_urls']:
        yield scrapy.Request(file_url) 
  這些請求將被管道處理，當它們完成下載後，結果將以2-元素的元組列表形式傳送到 `item_completed()`方法: 每個元組包含
(success, file_info_or_error):

success 是一個布林值，當圖片成功下載時為True，因為某個原因下載失敗為False file_info_or_error
是一個包含下列關鍵字的字典（如果成功為 True ）或者出問題時為 Twisted Failure 。 url - 檔案下載的url。這是從get_media_requests() 方法返回請求的url。 path - 圖片儲存的路徑（類似 FILES_STORE）
checksum - 圖片內容的 MD5 hash item_completed() 接收的元組列表需要保證與
get_media_requests() 方法返回請求的順序相一致。下面是 results 引數的一個典型值:
[(True,   {'checksum': '2b00042f7481c7b056c4b410d28f33cf',    'path':
'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',    'url':
'http://www.example.com/files/product1.pdf'}),  (False,  
Failure(...))] 
預設 get_media_requests() 方法返回 None ，這意味著專案中沒有檔案可下載。

item_completed(results, items, info)

當一個單獨專案中的所有圖片請求完成時（要麼完成下載，要麼因為某種原因下載失敗），
FilesPipeline.item_completed() 方法將被呼叫。

item_completed() 方法需要返回一個輸出，其將被送到隨後的專案管道階段，因此你需要返回（或者丟棄）專案，如你在任意管道里所做的一樣。這裡是一個
item_completed() 方法的例子，其中我們將下載的圖片路徑（傳入到results中）儲存到 image_paths
專案組中，如果其中沒有圖片，我們將丟棄專案:

from scrapy.exceptions import DropItem
def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no files")
    item['image_paths'] = image_paths
    return item 
預設情況下， item_completed() 方法返回item。

下面是一個圖片管道的完整例子，其方法如上所示:

import scrapy
from scrapy.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

2.程式碼實戰

繼續上篇部落格的實戰demo，在這裡我修改了piplines下的程式碼

class UnsplashPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        if item['image_ids']:
            new_path = "full/"+item['image_ids'][0]+".jpg"
        os.rename(settings.IMAGES_STORE+"/"+image_paths[0],settings.IMAGES_STORE+"/"+new_path)
        item['image_paths'] = new_path
        return item

該方法實質上是在ImagesPipline完成預設檔名的儲存後，將檔案重新命名。

3.ImagePipline原始碼淺析

如果閱讀原始碼，會發現file_path()方法正是給圖片賦檔名的方法。所以直接重寫這個方法豈不是美滋滋。在這裡，我們先來看一下file_path()方法的原始碼:

def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                          'please use file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'full/%s.jpg' % (image_guid)

如果只是為了修改檔案路徑而修改file_path，這對原始碼侵入太大。所以官方文件裡也沒有建議重寫file_path。

以下是ImagesPinpline的原始碼，供大家參考

class ImagesPipeline(FilesPipeline):
    """Abstract pipeline that implement the image thumbnail generation logic

    """

    MEDIA_NAME = 'image'
    MIN_WIDTH = 0
    MIN_HEIGHT = 0
    THUMBS = {}
    DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
    DEFAULT_IMAGES_RESULT_FIELD = 'images'

    @classmethod
    def from_settings(cls, settings):
        cls.MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
        cls.MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
        cls.EXPIRES = settings.getint('IMAGES_EXPIRES', 90)
        cls.THUMBS = settings.get('IMAGES_THUMBS', {})
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

        cls.IMAGES_URLS_FIELD = settings.get('IMAGES_URLS_FIELD', cls.DEFAULT_IMAGES_URLS_FIELD)
        cls.IMAGES_RESULT_FIELD = settings.get('IMAGES_RESULT_FIELD', cls.DEFAULT_IMAGES_RESULT_FIELD)
        store_uri = settings['IMAGES_STORE']
        return cls(store_uri)

    def file_downloaded(self, response, request, info):
        return self.image_downloaded(response, request, info)

    def image_downloaded(self, response, request, info):
        checksum = None
        for path, image, buf in self.get_images(response, request, info):
            if checksum is None:
                buf.seek(0)
                checksum = md5sum(buf)
            width, height = image.size
            self.store.persist_file(
                path, buf, info,
                meta={'width': width, 'height': height},
                headers={'Content-Type': 'image/jpeg'})
        return checksum

    def get_images(self, response, request, info):
        path = self.file_path(request, response=response, info=info)
        orig_image = Image.open(StringIO(response.body))

        width, height = orig_image.size
        if width < self.MIN_WIDTH or height < self.MIN_HEIGHT:
            raise ImageException("Image too small (%dx%d < %dx%d)" %
                                 (width, height, self.MIN_WIDTH, self.MIN_HEIGHT))

        image, buf = self.convert_image(orig_image)
        yield path, image, buf

        for thumb_id, size in self.THUMBS.iteritems():
            thumb_path = self.thumb_path(request, thumb_id, response=response, info=info)
            thumb_image, thumb_buf = self.convert_image(image, size)
            yield thumb_path, thumb_image, thumb_buf

    def convert_image(self, image, size=None):
        if image.format == 'PNG' and image.mode == 'RGBA':
            background = Image.new('RGBA', image.size, (255, 255, 255))
            background.paste(image, image)
            image = background.convert('RGB')
        elif image.mode != 'RGB':
            image = image.convert('RGB')

        if size:
            image = image.copy()
            image.thumbnail(size, Image.ANTIALIAS)

        buf = StringIO()
        image.save(buf, 'JPEG')
        return image, buf

    def get_media_requests(self, item, info):
        return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]

    def item_completed(self, results, item, info):
        if self.IMAGES_RESULT_FIELD in item.fields:
            item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
        return item

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                          'please use file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'full/%s.jpg' % (image_guid)

    def thumb_path(self, request, thumb_id, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.thumb_key(url) method is deprecated, please use '
                          'thumb_path(request, thumb_id, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from thumb_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if thumb_key() method has been overridden
        if not hasattr(self.thumb_key, '_base'):
            _warn()
            return self.thumb_key(url, thumb_id)
        ## end of deprecation warning block

        thumb_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'thumbs/%s/%s.jpg' % (thumb_id, thumb_guid)

    # deprecated
    def file_key(self, url):
        return self.image_key(url)
    file_key._base = True

    # deprecated
    def image_key(self, url):
        return self.file_path(url)
    image_key._base = True

    # deprecated
    def thumb_key(self, url, thumb_id):
        return self.thumb_path(url, thumb_id)
    thumb_key._base = True

四、小結

scrapy本身提供的工具已經很豐富而且實用。我對scrapy的理解很有限，僅僅是入門，本篇部落格也只是我對imagesPipline自學後的總結，如有錯，望指正。

Python3網路爬蟲：Scrapy入門之使用ImagesPipline下載圖片

Python版本： python3.+ 執行環境： Mac OS IDE： pycharm 一前言二初識ImagesPipline ImagesPipline的特性 ImagesPipline的工

Python3網路爬蟲：Scrapy入門實戰之爬取動態網頁圖片

Python版本： python3.+ 執行環境： Mac OS IDE： pycharm 一前言二 Scrapy相關方法介紹 1 搭建Scrapy專案 2 shell分析三網頁分析

Python3網路爬蟲：初識Scrapy爬蟲框架

Python版本：python3.+ 執行環境：Mac OS IDE：pycharm 一前言二初識Scrapy 1 什麼是Scrapy 2 我能用S

Python3網路爬蟲：requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲

Python版本： python3.+ 執行環境： Mac OS IDE： pycharm 一前言二豆瓣網影評爬取網頁分析程式碼編寫三資料庫實裝四

Python3網路爬蟲：requests爬取動態網頁內容

本文為學習筆記學習博主:http://blog.csdn.net/c406495762 Python版本：python3.+ 執行環境：OSX IDE：pycharm 一、工具準備抓包工具：在OSX下,我使用的是Charles4.0 下載連結以及安裝教

Python3網路爬蟲：使用Beautiful Soup爬取小說

本文是http://blog.csdn.net/c406495762/article/details/71158264的學習筆記作者:Jack-Cui 博主連結:http://blog.csdn.net/c406495762 執行平臺： OSX Python版本： Pyth

Python3網路爬蟲：使用Cookie-模擬登陸

該文是http://blog.csdn.net/c406495762部落格的學習筆記. 為什麼要使用Cookie Cookie，指某些網站為了辨別使用者身份、進行session跟蹤而儲存在使用者本地終端上的資料（通常經過加密)。比如說有些網站需要登入後才能訪問某個頁面，

Python3網路爬蟲：今日頭條新聞App的廣告資料抓取

咱們就不說廢話了，直接上完整的原始碼def startGetData(self): ret = random.randint(2, 10) index = 0 url = "" while index < ret: if index ==

《零基礎入門學習Python》第063講：論一隻爬蟲的自我修養11：Scrapy框架之初窺門徑

上一節課我們好不容易裝好了 Scrapy，今天我們就來學習如何用好它，有些同學可能會有些疑惑，既然我們懂得了Python編寫爬蟲的技巧，那要這個所謂的爬蟲框架又有什麼用呢？其實啊，你懂得Python寫爬蟲的程式碼，好比你懂武功，會打架，但行軍打仗你不行，畢竟敵人是千軍萬馬，縱使你再強，

Python3網路爬蟲(十一)：爬蟲黑科技之讓你的爬蟲程式更像人類使用者的行為(代理IP池等)

轉載請註明作者和出處：http://blog.csdn.net/c406495762 執行平臺： Windows Python版本： Python3.x IDE： Sublime text3 1 前言近期，有些朋友問我一些關於如何應

《python3網路爬蟲開發實戰》--Scrapy

1. 架構引擎(Scrapy)：用來處理整個系統的資料流處理, 觸發事務(框架核心) 排程器(Scheduler)：用來接受引擎發過來的請求, 壓入佇列中, 並在引擎再次請求的時候返回. 可以想像成一個URL（抓取網頁的網址或者說是連結）的優先佇列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複

python爬蟲基礎（13：Scrapy框架之架構流程與目錄）

框架對於特別小的爬蟲，一般直接編寫就可以了，但一般面對一個專案級別的爬蟲，都選擇用框架框架可以理解為一個等你填坑的程式碼： 1. 為你編寫好那些必須的、重複的程式碼 2. 為你模組化好每一個元件，自動建立元件之間的聯絡，這樣就方便使用者清晰瞭解它的

《Python3網路爬蟲開發實戰》PDF+原始碼+《精通Python爬蟲框架Scrapy》中英文PDF原始碼

下載：https://pan.baidu.com/s/1oejHek3Vmu0ZYvp4w9ZLsw 《Python 3網路爬蟲開發實戰》中文PDF+原始碼下載：https://pan.baidu.com/s/1BgQ54kCnGch4eaz4WuoC9w 《精通Pyt

python3網路爬蟲第三章: Scrapy 爬蟲框架 (1)

1.認識目錄結構安裝略過,使用命令建立專案 scrapy startproject myfirstpjt 這裡面 scrapy.cfg 是爬蟲專案配置檔案,專案的同名子資料夾中,init.py 是初始化檔案,items.py 是爬蟲專案的資料容器檔案,piplines

《崔慶才Python3網路爬蟲開發實戰教程》學習筆記（2）：常用庫函式的安裝與配置

python的一大優勢就是庫函式極其豐富，網路爬蟲工具的開發使用也是藉助於這一優勢來完成的。那麼要想用Python3做網路爬蟲的開發需要那些庫函式的支援呢？與網路爬蟲開發相關的庫大約有6種，分別為：請求庫：requests，selenium，ChromeDrive

網路爬蟲：URL去重策略之布隆過濾器(BloomFilter)的使用

前言：最近被網路爬蟲中的去重策略所困擾。使用一些其他的“理想”的去重策略，不過在執行過程中總是會不太聽話。不過當我發現了BloomFilter這個東西的時候，的確，這裡是我目前找到的最靠譜的一種方法。如果，你說URL去重嘛，有什麼難的。那麼你可

Python3網路爬蟲快速入門實戰解析

一前言三爬蟲實戰優美桌布下載 1實戰背景 2實戰進階 3整合程式碼愛奇藝VIP視訊下載 1實戰背景 2實戰升級 3編寫程式碼四總結一前言強烈建議：請在電腦的陪同下，閱讀本文。本文以實戰為主，閱讀過程

【專欄】- Python3網路爬蟲入門

Python3網路爬蟲入門歡迎Follow、Star：https://github.com/Jack-Cherish/python-spider 進階教程：http://cuijiahua.com/blog/spider/

記錄Python3網路爬蟲開發實戰的各種坑：Flask安裝（Windows環境下）

1.Flask 的安裝文章推薦使用pip安裝，命令如下：pip3 install flask 2.測試程式碼 from flask import Flask app = Flask(__name__) @app.route("/")

Python3網路爬蟲(四)：使用User Agent和代理IP隱藏身份

執行平臺：Windows Python版本：Python3.x IDE：Sublime text3 一、為何要設定User Agent 有一些網站不喜歡被爬蟲程式訪問，所以會檢測連線物件，如果是爬蟲程式，也就是非人點選訪問，它就會不讓你繼續

Python3網路爬蟲：Scrapy入門之使用ImagesPipline下載圖片

一、前言

二、初識ImagesPipline

1. ImagesPipline的特性:

2. ImagesPipline的工作流

3.ImagesPipline使用樣例

三、 ImagePipline修改圖片預設下載名稱

1. 文件解讀

2.程式碼實戰

3.ImagePipline原始碼淺析

四、小結

相關推薦