爬蟲框架之Scrapy（四 ImagePipeline）

阿新 • • 發佈：2019-04-27

lsp link 分享圖片 ack 文件的 page topic pat elif

ImagePipeline

使用scrapy框架我們除了要下載文本，還有可能需要下載圖片，scrapy提供了ImagePipeline來進行圖片的下載。

ImagePipeline還支持以下特別的功能：

1 生成縮略圖：通過配置IMAGES_THUMBS = {‘size_name‘: (width_size,heigh_size),}

2 過濾小圖片：通過配置IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH來過濾過小的圖片。

具體其他功能可以看下參考官網手冊:https://docs.scrapy.org/en/latest/topics/media-pipeline.html.

ImagePipelines的工作流程

1 在spider中爬取需要下載的圖片鏈接，將其放入item中的image_urls.

2 spider將其傳送到pipieline

3 當ImagePipeline處理時，它會檢測是否有image_urls字段，如果有的話，會將url傳遞給scrapy調度器和下載器

4 下載完成後會將結果寫入item的另一個字段images，images包含了圖片的本地路徑，圖片校驗，和圖片的url。

示例爬取巴士lol的英雄美圖

只爬第一頁

http://lol.tgbus.com/tu/yxmt/

技術分享圖片

第一步:items.py

import scrapy


 
class Happy4Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

爬蟲文件lol.py

# -*- coding: utf-8 -*-
import scrapy
from happy4.items import Happy4Item

class LolSpider(scrapy.Spider):
    name  
= ‘lol‘
    allowed_domains = [‘lol.tgbus.com‘]
    start_urls = [‘http://lol.tgbus.com/tu/yxmt/‘]

    def parse(self, response):
        li_list = response.xpath(‘//div[@class="list cf mb30"]/ul//li‘)
        for one_li in li_list:
            item = Happy4Item()
            item[‘image_urls‘] =one_li.xpath(‘./a/img/@src‘).extract()
            yield item

最後 settings.py

BOT_NAME = ‘happy4‘

SPIDER_MODULES = [‘happy4.spiders‘]
NEWSPIDER_MODULE = ‘happy4.spiders‘


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
}
IMAGES_STORE = ‘images‘

不需要操作管道文件，就可以爬去圖片到本地

技術分享圖片

極大的減少了代碼量.

註意：因為圖片管道會嘗試將所有圖片都轉換成JPG格式的，你看源代碼的話也會發現圖片管道中文件名類型直接寫死為JPG的。所以如果想要保存原始類型的圖片，就應該使用文件管道。

示例爬取mm131美女圖片

http://www.mm131.com/xinggan/

要求爬取的就是這個網站

技術分享圖片

這個網站是有反爬的，當你直接去下載一個圖片的時候，你會發現url被重新指向了別處，或者有可能是直接報錯302，這是因為它使用了referer這個請求頭裏的字段，當你打開一個圖片的url的時候，你的請求頭裏必須有referer，不然就會被識別為爬蟲文件，禁止你的爬取，那麽如何解決呢？

手動在爬取每個圖片的時候添加referer字段。

技術分享圖片

xingan.py

# -*- coding: utf-8 -*-
import scrapy
from happy5.items import Happy5Item
import re

class XinganSpider(scrapy.Spider):
    name = ‘xingan‘
    allowed_domains = [‘www.mm131.com‘]
    start_urls = [‘http://www.mm131.com/xinggan/‘]

    def parse(self, response):
        every_html = response.xpath(‘//div[@class="main"]/dl//dd‘)
        for one_html in every_html[0:-1]:
            item = Happy5Item()
            # 每個圖片的鏈接
            link = one_html.xpath(‘./a/@href‘).extract_first()
            # 每個圖片的名字
            title = one_html.xpath(‘./a/img/@alt‘).extract_first()
            item[‘title‘] = title
            # 進入到每個標題裏面
            request = scrapy.Request(url=link, callback=self.parse_one, meta={‘item‘:item})
            yield request

    # 每個人下面的圖集
    def parse_one(self, response):
        item = response.meta[‘item‘]
        # 找到總頁數
        total_page = response.xpath(‘//div[@class="content-page"]/span[@class="page-ch"]/text()‘).extract_first()
        num = int(re.findall(‘(\d+)‘, total_page)[0])
        # 找到當前頁數
        now_num = response.xpath(‘//div[@class="content-page"]/span[@class="page_now"]/text()‘).extract_first()
        now_num = int(now_num)
        # 當前頁圖片的url
        every_pic = response.xpath(‘//div[@class="content-pic"]/a/img/@src‘).extract()
        # 當前頁的圖片url
        item[‘image_urls‘] = every_pic
        # 當前圖片的refer
        item[‘referer‘] = response.url
        yield item

        # 如果當前數小於總頁數
        if now_num < num:
            if now_num == 1:
                url1 = response.url[0:-5] + ‘_%d‘%(now_num+1) + ‘.html‘
            elif now_num > 1:
                url1 = re.sub(‘_(\d+)‘, ‘_‘ + str(now_num+1), response.url)
            headers = {
                ‘referer‘:self.start_urls[0]
            }
            # 給下一頁發送請求
            yield scrapy.Request(url=url1, headers=headers, callback=self.parse_one, meta={‘item‘:item})

items.py

import scrapy


class Happy5Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    title = scrapy.Field()
    referer = scrapy.Field()

pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class Happy5Pipeline(object):
    def process_item(self, item, spider):
        return item

class QiushiImagePipeline(ImagesPipeline):

    # 下載圖片時加入referer請求頭
    def get_media_requests(self, item, info):
        for image_url in item[‘image_urls‘]:
            headers = {‘referer‘:item[‘referer‘]}
            yield Request(image_url, meta={‘item‘: item}, headers=headers)
            # 這裏把item傳過去，因為後面需要用item裏面的書名和章節作為文件名

    # 獲取圖片的下載結果, 控制臺查看
    def item_completed(self, results, item, info):
        image_paths = [x[‘path‘] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    # 修改文件的命名和路徑
    def file_path(self, request, response=None, info=None):
        item = request.meta[‘item‘]
        image_guid = request.url.split(‘/‘)[-1]
        filename = ‘./{}/{}‘.format(item[‘title‘], image_guid)
        return filename

settings.py

BOT_NAME = ‘happy5‘

SPIDER_MODULES = [‘happy5.spiders‘]
NEWSPIDER_MODULE = ‘happy5.spiders‘


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   # ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
   ‘happy5.pipelines.QiushiImagePipeline‘: 2,
}
IMAGES_STORE = ‘images‘

得到圖片文件:

技術分享圖片

這種圖還是要少看。

爬蟲框架之Scrapy（四 ImagePipeline）

lsp link 分享圖片 ack 文件的 page topic pat elif ImagePipeline 使用scrapy框架我們除了要下載文本，還有可能需要下載圖片，scrapy提供了ImagePipeline來進行圖片的下載。 ImagePipeline還支持

爬蟲框架之Scrapy（四 ImagePipeline）

ImagePipelines的工作流程

示例爬取巴士lol的英雄美圖

示例爬取mm131美女圖片

爬蟲框架之Scrapy（四 ImagePipeline）

爬蟲框架之Scrapy（二）

Hibernate框架之路（四）hibernate查詢方式

Java框架之Struts2（四）

java框架之spring（四）

Yii2框架之旅（四）

爬蟲框架之——Scrapy

express框架之restful（web services）

python爬蟲基礎（13：Scrapy框架之架構流程與目錄）

Python學習之路（四）爬蟲（三）HTTP和HTTPS

scrapy爬蟲框架中資料庫（mysql）的非同步寫入

第七節：web爬蟲之urllib（四）

Python 爬蟲從入門到進階之路（四）

python學習之路（四）

Linux學習之路（四）幫助命令

Android Api Demos登頂之路（四十五）Loader-->Cursor

JavaSE 學習筆記之封裝（四）

Java集合框架源碼（四）——Vector

activiti自己定義流程之整合（四）：整合自己定義表單部署流程定義

Java學習之路（四）面向對象

爬蟲框架之Scrapy（四 ImagePipeline）

ImagePipelines的工作流程

示例 爬取巴士lol的英雄美圖

示例 爬取mm131美女圖片

相關推薦

示例爬取巴士lol的英雄美圖

示例爬取mm131美女圖片