爬蟲2.4-scrapy框架-圖片分類下載

阿新 • • 發佈：2018-12-31

scrapy框架-圖片下載

scrapy框架-圖片下載

python小知識：

map函式：將一個可迭代物件的每個值，依次交給一個函式處理，返回一個生成器。

urls = uibox.xpath(".//ul/li/a/img/@src").getall()
urls = list(map(lambda url: 'https:'+url, urls))

urls本身是一個列表，使用map函式，將urls中的每一項傳遞給url ，並讓url執行‘https:’+url的操作。 lambda則是一個無名函式，方便一行內寫完函式。最後用list函式將map返回的生成器轉成列表

1 傳統下載方法：

pipelines.py
def process_item(self, item, spider):
    category = item['category']  # 一個字串
    urls = item['urls']  # 一個列表
    local_path = os.path.abspath('.') # 獲取絕對路經
    os.mkdir(r'{}\images\{}'.format(local_path, category)) #分類建立images/category資料夾
    for url in urls:
        filename = url.split('_')[-1]  # 獲取圖片名
        request.urlretrieve(url, '{}\images\{}\{}'.format(local_path, category, filename))  # 將圖片下載到/images/category/filename

缺點是非非同步，圖片是一張一張按順序下載的，數量較大時效率低

2 scrapy框架的下載方法

scrapy.pipelines中提供了兩個下載檔案的子類

一個是scrapy.pipelines.files 中的FilesPipeline，另一個是scrapy.pipelines.images 中的ImagesPipeline

其中ImagesPipeline專門用於下載圖片，是繼承自FilesPipeline

ImagesPipeline使用方法

1）在items.py中定義好image_urls和images，image_urls是用來儲存需要下載的圖片的url，需要給一個列表，images中需要儲存圖片的屬性，如下載路徑、url、圖片的校驗碼等

2）當圖片下載完成後，會把檔案下載的相關資訊儲存到item中的images屬性中。

3）在配置檔案settings.py中配置IMAGES_STROE，這個配置是用來設定圖片下載路徑的

4）啟動pipeline，在ITEM_PIPELINES中註釋原來的pepeline設定'scrapy.pipelines.images.ImagesPipeline':1

FilesPipeline使用方法類似

定義好file_urls和file，設定好FILES_STORE，以及ITEM_PIPELINES中的scrapy.pipelines.files.FilesPipeline:1

2.1 將圖片全部下載到full資料夾下

本來高高興興來到實驗室，2分鐘就把需要整的步驟整完了，如何折騰了一個小時都沒想通為什麼爬蟲無法執行parse()函式，各種百度，最後對比之前寫的爬蟲發現，我睡了午覺起來迷迷糊糊將start_urls 改成了start_image_urls 這樣的話scrapy框架一檢查，沒有開始點啊，那直接就結束了。。。。

1）修改setttings.py 註釋掉原來的ITEM_PIPELINES中的內容，新增'scrapy.pipelines.images.ImagesPipeline': 1

並新增圖片儲存路徑IMAGES_STORE = os.path.join(os.path.abspath('.'), 'images')

~ settings.py
ITEM_PIPELINES = {
   # 'bmw.pipelines.BmwPipeline': 300,
 'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = os.path.join(os.path.abspath('.'), 'images')  # 爬蟲專案根目錄/images

2）修改items.py 新增或修改 image_urls和images

image_urls = scrapy.Field()
images = scrapy.Field()

3）修改[spider.py] yield image_urls

item = BmwItem(category=category, image_urls=image_urls)  # 重點在於image_urls
yield item

執行，得到結果：./images/full 目錄下下載了一堆沒有分類的圖片

這個時候就想怎麼講圖片分類下載好，比如車身圖片（實驗程式碼以汽車之家寶馬5x圖片進行下載）就在/images/車身，內部圖片在/images/內部，這樣強迫症才舒服

2.2 將圖片分類到不同目錄

這裡就需要重寫scrapy.pipelines.images.ImagesPipeline裡面的一些方法了。

這裡需要注意：parse()函式中yield 丟擲的資料格式為 category: ‘xxx’ image_urls: ['url', 'url'....] images: ''

所以在之後操作時，每次丟擲的image_urls 都對應著同一個category

1）繼承ImagesPipeline，並改寫一些返回值

from bmw import settings  # bmw是專案名
import os  # 2)中需要用到
class BMWImagesPipeline(ImagesPipeline):  # 新定義一個子類，繼承ImagesPipeline
    def get_media_requests(self, item, info):
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
        for request_obj in request_objs:
            request_obj.item = item  # 為每個request_obj物件新增item屬性，裡面包含了category
        return request_objs  # 父類中必須返回這個列表，給其他函式使用

get_media_requests函式原本返回的是[Request(x) for x in item.get(self.images_urls_field, [])]，即返回了一個Request(url)的列表，這裡改造一下，使用super重複父類中的方法，得到一個返回值，並添加了item屬性，方便下一步使用。

2）重寫file_path()方法

def file_path(self, request, response=None, info=None):
    # path獲取父類中file_path方法返回的值即‘full/hash.jpg' hash是圖片的雜湊值
    path = super(BMWImagesPipeline, self).file_path(request, response, info)
    #request即前一個函式返回的列表中的每一項，所以有item屬性
    category = request.item.get('category')
    images_store = settings.IMAGES_STORE  # IMAGES_STORE是需要下載到的路徑，例如c:/images
    category_path = os.path.join(images_store, category) # 建立每個種類的路徑並判斷是否已存在
    if not os.path.exists(category_path):
        os.mkdir(category_path)
    image_name = path.replace('full/', '')  # 去掉原本的'full/' 只留下檔名
    image_path = os.path.join(category_path, image_name) 
    # 圖片完整的路徑和檔名c:/images/aaa/js451v88225F45sd42f4y4.jpg
    return image_path

3）修改settings.py中的配置

ITEM_PIPELINES = {
   # 'bmw.pipelines.BmwPipeline': 300, 註釋掉原有的pipeline
    'bmw.pipelines.BMWImagesPipeline': 1
}
IMAGES_STORE = os.path.join(os.path.abspath('.'), 'images')  # 新增圖片儲存地址

完工，重寫之後每個圖片的路徑就包含了自己的種類，下載完成後目錄格式如下：

images
   車頭
       41312r322r2234.jpg
       1425flj5l2.jpg
   車身
   。。

3 分類下載完整程式碼

專案名：bmw 爬蟲名：bmw5x

bmw5x.py

import scrapy
from bmw.items import BmwItem


class Bmw5xSpider(scrapy.Spider):
    name = 'bmw5x'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=2042209']

    def parse(self, response):
        uiboxs = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxs:
            category = uibox.xpath("./div/a/text()").get()
            image_urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            image_urls = list(map(lambda url: 'https:'+url, image_urls))
            item = BmwItem(category=category, image_urls=image_urls)
            yield item

items.py

import scrapy
class BmwItem(scrapy.Item):
    category = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

pipelines.py

import os
from scrapy.pipelines.images import ImagesPipeline
from bmw import settings


class BMWImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
        for request_obj in request_objs:
            request_obj.item = item
        return request_objs

    def file_path(self, request, response=None, info=None):
        path = super(BMWImagesPipeline, self).file_path(request, response, info)
        category = request.item.get('category')
        images_store = settings.IMAGES_STORE
        category_path = os.path.join(images_store, category)
        if not os.path.exists(category_path):
            os.mkdir(category_path)
        image_name = path.replace('full/', '')
        image_path = os.path.join(category_path, image_name)
        return image_path

settings.py

import os
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'xxx',
  'Referer': 'xxx',
  'Cookie': 'xxx'
}
ITEM_PIPELINES = {
    'bmw.pipelines.BMWImagesPipeline': 1
}
IMAGES_STORE = os.path.join(os.path.abspath('.'), 'images')

爬蟲2.4-scrapy框架-圖片分類下載

scrapy框架-圖片下載

1 傳統下載方法：

2 scrapy框架的下載方法

2.1 將圖片全部下載到full資料夾下

2.2 將圖片分類到不同目錄

3 分類下載完整程式碼

爬蟲2.4-scrapy框架-圖片分類下載

爬蟲2.5-scrapy框架-下載中介軟體

爬蟲2.5-scrapy框架-下載中間件

Python爬蟲 --- 2.3 Scrapy 框架的簡單使用

爬蟲2.1-scrapy框架-兩種爬蟲對比

2018 - Python 3.7 爬蟲之利用 Scrapy 框架獲取圖片並下載（二）

Python爬蟲（入門+進階）學習筆記 2-1 爬蟲工程化及Scrapy框架初窺

爬蟲實踐---電影排行榜和圖片批量下載

2.4 利用FTP服務器下載和上傳目錄

[Python] [爬蟲] 1.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲概要——脫離Scrapy框架

零基礎寫python爬蟲之使用Scrapy框架編寫爬蟲

python 爬蟲如何通過scrapy框架簡單爬取網站資訊--以51job為例

Python爬蟲 --- 2.5 Scrapy之汽車之家爬蟲實踐

二維碼資料目錄 1. 二維碼QR Code 1 2. 發展歷程 1 3. 特點 2 4. 儲存 3 5. 分類 3 5.1.1. 按原理分 3 6. 區別與條碼區別 5 7. 什麼是碼制？

python爬蟲學習筆記-scrapy框架之start_url

python爬蟲入門(六) Scrapy框架之原理介紹

python爬蟲(16)使用scrapy框架爬取頂點小說網

python3爬蟲之使用Scrapy框架爬取性感女神美女照片

人工智慧學習tensorFlow_gpu-1.1.0圖文詳細安裝教程（64位機win7旗艦sp1+Anaconda3-4.2.0+Pycharm2017.2.4）附所有軟體下載地址

python3爬蟲之使用Scrapy框架爬取英雄聯盟高清桌面桌布

爬蟲2.4-scrapy框架-圖片分類下載

scrapy框架-圖片下載

1 傳統下載方法：

2 scrapy框架的下載方法

2.1 將圖片全部下載到full資料夾下

2.2 將圖片分類到不同目錄

3 分類下載完整程式碼

相關推薦