爬蟲框架之Scrapy(四 ImagePipeline)
ImagePipeline
使用scrapy框架我們除了要下載文本,還有可能需要下載圖片,scrapy提供了ImagePipeline來進行圖片的下載。
ImagePipeline還支持以下特別的功能:
1 生成縮略圖:通過配置IMAGES_THUMBS = {‘size_name‘: (width_size,heigh_size),}
2 過濾小圖片:通過配置IMAGES_MIN_HEIGHT
和IMAGES_MIN_WIDTH
來過濾過小的圖片。
具體其他功能可以看下參考官網手冊:https://docs.scrapy.org/en/latest/topics/media-pipeline.html.
ImagePipelines的工作流程
1 在spider中爬取需要下載的圖片鏈接,將其放入item中的image_urls.
2 spider將其傳送到pipieline
3 當ImagePipeline處理時,它會檢測是否有image_urls字段,如果有的話,會將url傳遞給scrapy調度器和下載器
4 下載完成後會將結果寫入item的另一個字段images,images包含了圖片的本地路徑,圖片校驗,和圖片的url。
示例 爬取巴士lol的英雄美圖
只爬第一頁
http://lol.tgbus.com/tu/yxmt/
第一步:items.py
import scrapy
class Happy4Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
爬蟲文件lol.py
# -*- coding: utf-8 -*-
import scrapy
from happy4.items import Happy4Item
class LolSpider(scrapy.Spider):
name = ‘lol‘
allowed_domains = [‘lol.tgbus.com‘]
start_urls = [‘http://lol.tgbus.com/tu/yxmt/‘]
def parse(self, response):
li_list = response.xpath(‘//div[@class="list cf mb30"]/ul//li‘)
for one_li in li_list:
item = Happy4Item()
item[‘image_urls‘] =one_li.xpath(‘./a/img/@src‘).extract()
yield item
最後 settings.py
BOT_NAME = ‘happy4‘
SPIDER_MODULES = [‘happy4.spiders‘]
NEWSPIDER_MODULE = ‘happy4.spiders‘
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0‘
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
‘scrapy.pipelines.images.ImagesPipeline‘: 1,
}
IMAGES_STORE = ‘images‘
不需要操作管道文件,就可以爬去圖片到本地
極大的減少了代碼量.
註意:因為圖片管道會嘗試將所有圖片都轉換成JPG格式的,你看源代碼的話也會發現圖片管道中文件名類型直接寫死為JPG的。所以如果想要保存原始類型的圖片,就應該使用文件管道。
示例 爬取mm131美女圖片
http://www.mm131.com/xinggan/
要求爬取的就是這個網站
這個網站是有反爬的,當你直接去下載一個圖片的時候,你會發現url被重新指向了別處,或者有可能是直接報錯302,這是因為它使用了referer這個請求頭裏的字段,當你打開一個圖片的url的時候,你的請求頭裏必須有referer,不然就會被識別為爬蟲文件,禁止你的爬取,那麽如何解決呢?
手動在爬取每個圖片的時候添加referer字段。
xingan.py
# -*- coding: utf-8 -*-
import scrapy
from happy5.items import Happy5Item
import re
class XinganSpider(scrapy.Spider):
name = ‘xingan‘
allowed_domains = [‘www.mm131.com‘]
start_urls = [‘http://www.mm131.com/xinggan/‘]
def parse(self, response):
every_html = response.xpath(‘//div[@class="main"]/dl//dd‘)
for one_html in every_html[0:-1]:
item = Happy5Item()
# 每個圖片的鏈接
link = one_html.xpath(‘./a/@href‘).extract_first()
# 每個圖片的名字
title = one_html.xpath(‘./a/img/@alt‘).extract_first()
item[‘title‘] = title
# 進入到每個標題裏面
request = scrapy.Request(url=link, callback=self.parse_one, meta={‘item‘:item})
yield request
# 每個人下面的圖集
def parse_one(self, response):
item = response.meta[‘item‘]
# 找到總頁數
total_page = response.xpath(‘//div[@class="content-page"]/span[@class="page-ch"]/text()‘).extract_first()
num = int(re.findall(‘(\d+)‘, total_page)[0])
# 找到當前頁數
now_num = response.xpath(‘//div[@class="content-page"]/span[@class="page_now"]/text()‘).extract_first()
now_num = int(now_num)
# 當前頁圖片的url
every_pic = response.xpath(‘//div[@class="content-pic"]/a/img/@src‘).extract()
# 當前頁的圖片url
item[‘image_urls‘] = every_pic
# 當前圖片的refer
item[‘referer‘] = response.url
yield item
# 如果當前數小於總頁數
if now_num < num:
if now_num == 1:
url1 = response.url[0:-5] + ‘_%d‘%(now_num+1) + ‘.html‘
elif now_num > 1:
url1 = re.sub(‘_(\d+)‘, ‘_‘ + str(now_num+1), response.url)
headers = {
‘referer‘:self.start_urls[0]
}
# 給下一頁發送請求
yield scrapy.Request(url=url1, headers=headers, callback=self.parse_one, meta={‘item‘:item})
items.py
import scrapy
class Happy5Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
title = scrapy.Field()
referer = scrapy.Field()
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class Happy5Pipeline(object):
def process_item(self, item, spider):
return item
class QiushiImagePipeline(ImagesPipeline):
# 下載圖片時加入referer請求頭
def get_media_requests(self, item, info):
for image_url in item[‘image_urls‘]:
headers = {‘referer‘:item[‘referer‘]}
yield Request(image_url, meta={‘item‘: item}, headers=headers)
# 這裏把item傳過去,因為後面需要用item裏面的書名和章節作為文件名
# 獲取圖片的下載結果, 控制臺查看
def item_completed(self, results, item, info):
image_paths = [x[‘path‘] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
# 修改文件的命名和路徑
def file_path(self, request, response=None, info=None):
item = request.meta[‘item‘]
image_guid = request.url.split(‘/‘)[-1]
filename = ‘./{}/{}‘.format(item[‘title‘], image_guid)
return filename
settings.py
BOT_NAME = ‘happy5‘
SPIDER_MODULES = [‘happy5.spiders‘]
NEWSPIDER_MODULE = ‘happy5.spiders‘
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0‘
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
# ‘scrapy.pipelines.images.ImagesPipeline‘: 1,
‘happy5.pipelines.QiushiImagePipeline‘: 2,
}
IMAGES_STORE = ‘images‘
得到圖片文件:
這種圖還是要少看。
爬蟲框架之Scrapy(四 ImagePipeline)