scrapy框架持久化儲存

概要

基於終端指令的持久化儲存
基於管道的持久化儲存

1.基於終端指令的持久化儲存

保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。

執行輸出指定格式進行儲存：將爬取到的資料寫入不同格式的檔案中進行儲存
    scrapy crawl 爬蟲名稱 -o xxx.json
    scrapy crawl 爬蟲名稱 -o xxx.xml
    scrapy crawl 爬蟲名稱 -o xxx.csv

2.基於管道的持久化儲存

scrapy框架中已經為我們專門整合好了高效、便捷的持久化操作功能，我們直接使用即可。要想使用scrapy的持久化操作功能，我們首先來認識如下兩個檔案：

    items.py：資料結構模板檔案。定義資料屬性。
    pipelines.py：管道檔案。接收資料（items），進行持久化操作。

持久化流程：
    1.爬蟲檔案爬取到資料後，需要將資料封裝到items物件中。
    2.使用yield關鍵字將items物件提交給pipelines管道進行持久化操作。
    3.在管道檔案中的process_item方法中接收爬蟲檔案提交過來的item物件，然後編寫持久化儲存的程式碼將item物件中儲存的資料進行持久化儲存
    4.settings.py配置檔案中開啟管道

小試牛刀：將糗事百科首頁中的段子和作者資料爬取下來，然後進行持久化儲存

- 爬蟲檔案：qiubaiDemo.py

# -*- coding: utf-8 -*-
import scrapy
from secondblood.items import SecondbloodItem

class QiubaidemoSpider(scrapy.Spider): name = 'qiubaiDemo' allowed_domains = ['www.qiushibaike.com'] start_urls = ['http://www.qiushibaike.com/'] def parse(self, response): odiv = response.xpath('//div[@id="content-left"]/div') for div in odiv: # xpath函式返回的為列表，列表中存放的資料為Selector型別的資料。我們解析到的內容被封裝在了Selector物件中，需要呼叫extract()函式將解析的內容從Selecor中取出。 author = div.xpath('.//div[@class="author clearfix"]//h2/text()').extract_first() author = author.strip('\n')#過濾空行 content = div.xpath('.//div[@class="content"]/span/text()').extract_first() content = content.strip('\n')#過濾空行 #將解析到的資料封裝至items物件中 item = SecondbloodItem() item['author'] = author item['content'] = content yield item#提交item到管道檔案（pipelines.py）

- items檔案：items.py

import scrapy


class SecondbloodItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() #儲存作者 content = scrapy.Field() #儲存段子內容

- 管道檔案：pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html class SecondbloodPipeline(object): #構造方法 def __init__(self): self.fp = None #定義一個檔案描述符屬性 　　#下列都是在重寫父類的方法： #開始爬蟲時，執行一次 def open_spider(self,spider): print('爬蟲開始') self.fp = open('./data.txt', 'w') 　　 #因為該方法會被執行呼叫多次，所以檔案的開啟和關閉操作寫在了另外兩個只會各自執行一次的方法中。 def process_item(self, item, spider): #將爬蟲程式提交的item進行持久化儲存 self.fp.write(item['author'] + ':' + item['content'] + '\n') return item #結束爬蟲時，執行一次 def close_spider(self,spider): self.fp.close() print('爬蟲結束')

- 配置檔案：settings.py

#開啟管道
ITEM_PIPELINES = {
    'secondblood.pipelines.SecondbloodPipeline': 300, #300表示為優先順序，值越小優先順序越高
}

2.1 基於mysql的管道儲存

小試牛刀案例中，在管道檔案裡將item物件中的資料值儲存到了磁碟中，如果將item資料寫入mysql資料庫的話，只需要將上述案例中的管道檔案修改成如下形式：

- pipelines.py檔案

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html #匯入資料庫的類 import pymysql class QiubaiproPipelineByMysql(object): conn = None #mysql的連線物件宣告 cursor = None#mysql遊標物件宣告 def open_spider(self,spider): print('開始爬蟲') #連結資料庫 self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai') #編寫向資料庫中儲存資料的相關程式碼 def process_item(self, item, spider): #1.連結資料庫 #2.執行sql語句 sql = 'insert into qiubai values("%s","%s")'%(item['author'],item['content']) self.cursor = self.conn.cursor() #執行事務 try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): print('爬蟲結束') self.cursor.close() self.conn.close()

- settings.py

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByMysql': 300,
}

2.2 基於redis的管道儲存

小試牛刀案例中，在管道檔案裡將item物件中的資料值儲存到了磁碟中，如果將item資料寫入redis資料庫的話，只需要將上述案例中的管道檔案修改成如下形式：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import redis class QiubaiproPipelineByRedis(object): conn = None def open_spider(self,spider): print('開始爬蟲') #建立連結物件 self.conn = redis.Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dict = { 'author':item['author'], 'content':item['content'] } #寫入redis中 self.conn.lpush('data', dict) return item

- pipelines.py檔案

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByRedis': 300,
}

- 面試題：如果最終需要將爬取到的資料值一份儲存到磁碟檔案，一份儲存到資料庫中，則應該如何操作scrapy？　　

- 答：管道檔案中的程式碼為

#該類為管道類，該類中的process_item方法是用來實現持久化儲存操作的。
class DoublekillPipeline(object): def process_item(self, item, spider): #持久化操作程式碼 （方式1：寫入磁碟檔案） return item #如果想實現另一種形式的持久化操作，則可以再定製一個管道類： class DoublekillPipeline_db(object): def process_item(self, item, spider): #持久化操作程式碼 （方式1：寫入資料庫） return item

在settings.py開啟管道操作程式碼為：

#下列結構為字典，字典中的鍵值表示的是即將被啟用執行的管道檔案和其執行的優先順序。
ITEM_PIPELINES = {
   'doublekill.pipelines.DoublekillPipeline': 300,
    'doublekill.pipelines.DoublekillPipeline_db': 200, } #上述程式碼中，字典中的兩組鍵值分別表示會執行管道檔案中對應的兩個管道類中的process_item方法，實現兩種不同形式的持久化操作。

概要

基於終端指令的持久化儲存
基於管道的持久化儲存

1.基於終端指令的持久化儲存

保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。

執行輸出指定格式進行儲存：將爬取到的資料寫入不同格式的檔案中進行儲存
    scrapy crawl 爬蟲名稱 -o xxx.json
    scrapy crawl 爬蟲名稱 -o xxx.xml
    scrapy crawl 爬蟲名稱 -o xxx.csv

2.基於管道的持久化儲存

    items.py：資料結構模板檔案。定義資料屬性。
    pipelines.py：管道檔案。接收資料（items），進行持久化操作。

持久化流程：
    1.爬蟲檔案爬取到資料後，需要將資料封裝到items物件中。
    2.使用yield關鍵字將items物件提交給pipelines管道進行持久化操作。
    3.在管道檔案中的process_item方法中接收爬蟲檔案提交過來的item物件，然後編寫持久化儲存的程式碼將item物件中儲存的資料進行持久化儲存
    4.settings.py配置檔案中開啟管道

小試牛刀：將糗事百科首頁中的段子和作者資料爬取下來，然後進行持久化儲存

- 爬蟲檔案：qiubaiDemo.py

# -*- coding: utf-8 -*-
import scrapy
from secondblood.items import SecondbloodItem

class QiubaidemoSpider(scrapy.Spider): name = 'qiubaiDemo' allowed_domains = ['www.qiushibaike.com'] start_urls = ['http://www.qiushibaike.com/'] def parse(self, response): odiv = response.xpath('//div[@id="content-left"]/div') for div in odiv: # xpath函式返回的為列表，列表中存放的資料為Selector型別的資料。我們解析到的內容被封裝在了Selector物件中，需要呼叫extract()函式將解析的內容從Selecor中取出。 author = div.xpath('.//div[@class="author clearfix"]//h2/text()').extract_first() author = author.strip('\n')#過濾空行 content = div.xpath('.//div[@class="content"]/span/text()').extract_first() content = content.strip('\n')#過濾空行 #將解析到的資料封裝至items物件中 item = SecondbloodItem() item['author'] = author item['content'] = content yield item#提交item到管道檔案（pipelines.py）

- items檔案：items.py

import scrapy


class SecondbloodItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() #儲存作者 content = scrapy.Field() #儲存段子內容

- 管道檔案：pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html class SecondbloodPipeline(object): #構造方法 def __init__(self): self.fp = None #定義一個檔案描述符屬性 　　#下列都是在重寫父類的方法： #開始爬蟲時，執行一次 def open_spider(self,spider): print('爬蟲開始') self.fp = open('./data.txt', 'w') 　　 #因為該方法會被執行呼叫多次，所以檔案的開啟和關閉操作寫在了另外兩個只會各自執行一次的方法中。 def process_item(self, item, spider): #將爬蟲程式提交的item進行持久化儲存 self.fp.write(item['author'] + ':' + item['content'] + '\n') return item #結束爬蟲時，執行一次 def close_spider(self,spider): self.fp.close() print('爬蟲結束')

- 配置檔案：settings.py

#開啟管道
ITEM_PIPELINES = {
    'secondblood.pipelines.SecondbloodPipeline': 300, #300表示為優先順序，值越小優先順序越高
}

2.1 基於mysql的管道儲存

- pipelines.py檔案

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html #匯入資料庫的類 import pymysql class QiubaiproPipelineByMysql(object): conn = None #mysql的連線物件宣告 cursor = None#mysql遊標物件宣告 def open_spider(self,spider): print('開始爬蟲') #連結資料庫 self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai') #編寫向資料庫中儲存資料的相關程式碼 def process_item(self, item, spider): #1.連結資料庫 #2.執行sql語句 sql = 'insert into qiubai values("%s","%s")'%(item['author'],item['content']) self.cursor = self.conn.cursor() #執行事務 try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): print('爬蟲結束') self.cursor.close() self.conn.close()

- settings.py

11.scrapy框架持久化儲存

今日概要基於終端指令的持久化儲存基於管道的持久化儲存今日詳情 1.基於終端指令的持久化儲存保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。執行輸出指定格式進行儲存：將

scrapy框架持久化儲存

概要基於終端指令的持久化儲存基於管道的持久化儲存 1.基於終端指令的持久化儲存保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字

【Scrapy框架持久化儲存】

基於終端指令的持久化儲存前提：保證爬蟲檔案中的parse方法的返回值為可迭代資料型別(通常為list/dict)。該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化儲存。執行如下命令進行持久化儲存： scrapy crawl 應用名稱 -o

Scrapy框架(持久化,去重,深度控制,cookie)

1. 持久化目前缺點： - 無法完成爬蟲剛開始：開啟連線；爬蟲關閉時：關閉連線；

scrapy框架持久化存儲

終端 spa sele you base64 strip 取出 extra esp 1.概要基於終端指令的持久化存儲基於管道的持久化存儲 2.詳情 1.基於終端指令的持久化存儲保證爬蟲文件的parse方法中有可叠代類型對象（通常為列表or字典）的

scrapy框架的另一種分頁處理以及mongodb的持久化儲存以及from_crawler類方法的使用

Coding pca rom utf-8 ngs ODB 持久 same req 一.scrapy框架處理　　1.分頁處理　　　　以爬取亞馬遜為例　　　　爬蟲文件.py # -*- coding: utf-8 -*- import scrapy fro

Python：scrapy框架爬取校花網男神圖片儲存到本地

爬蟲四部曲，本人按自己的步驟來寫，可能有很多漏洞，望各位大神指點指點 1、建立專案 scrapy startproject xiaohuawang scrapy.cfg: 專案的配置檔案 xiaohuawang/: 該專案的python模組。之後您將在此加入程

linux下在伺服器上配置scrapy框架的python爬蟲，使用mysql資料庫儲存

最近在做把爬蟲部署到伺服器上，爬下來的資料再存到資料庫裡。因為伺服器是linux系統的，所以我們事先需要配置一些環境檔案以及依賴的庫 1、裝python 這一步沒啥好說的吧 2、裝pip，然後再用pip裝依賴的庫： pip install pymysql

scrapy框架基於mysql資料庫儲存資料方法、案例

流程思路將解析資料存到items物件使用yield 將items交給管道檔案處理在管道檔案pipelines編寫程式碼儲存到資料庫在setting配置檔案開啟管道案例 items中按照格式定義欄位 import s

Scrapy框架基於管道儲存資料到本地檔案流程、案例

流程思路將解析資料存到items物件使用yield 將items交給管道檔案處理在管道檔案pipelines編寫程式碼儲存在setting配置檔案開啟管道案例 setting.py配置檔案取消註釋，數字為優先順序

Scrapy爬蟲框架使用流程、框架、儲存模式介紹

Scrapy特色建議使用 xpath 進行解析 (因為Scrapy集成了xpath介面) 高效能爬蟲、多執行緒、資料解析、持久化儲存自動攜帶cookie無需單獨操作安裝 mac下 pip install scrapy 使用流程終

[Xcode10 實際操作]七、檔案與資料-(11)資料持久化儲存框架CoreData的使用：建立CoreData實體並插入資料

本文將演示【CoreData】資料持久化儲存框架的使用。點選【Create a new Xcode project】建立一個新的專案 ->【Single View App】選擇建立一個簡單的單檢視應用->【Next】【Product Name】:CoreDataProject 勾選使用

[Xcode10 實際操作]七、檔案與資料-(12)資料持久化儲存框架CoreData的使用：查詢CoreData中的資料

本文將演示如何查詢資料持久化物件。在專案導航區，開啟檢視控制器的程式碼檔案【ViewController.swift】 1 import UIKit 2 //引入資料持久化儲存框架【CoreData】 3 import CoreData 4 5 class ViewContro

爬蟲--Scrapy-持久化儲存操作

總體概況持久化儲存操作： a. 磁碟檔案 a) 基於終端指令 i. 保證parse方法返回一個可迭代型別的物件（儲存解析到的頁面內容） ii. 使用終端指令完成資料儲存到制定磁碟檔案中的操作 1. scrapy crawl 爬蟲檔名稱 –o 磁碟檔案.字尾

爬蟲--Scrapy-持久化儲存操作2

1、管道的高階操作需求：將爬取到的資料值分別儲存到本地磁碟、redis資料庫、mysql資料。　　1.需要在管道檔案中編寫對應平臺的管道類　　2.在配置檔案中對自定義的管道類進行生效操作 qiubai.py import scrapy from qiubaipro.items

python scrapy框架爬取豆瓣top250電影篇一儲存資料到mongogdb | mysql中

存到mongodb中環境 windows7 mongodb4.0 mongodb安裝教程設定具體引數在管道里面寫具體引數開啟settings 設定引數測試開始–結果程式碼 import pymongo from douban.

scrapy持久化儲存

持久化儲存操作： a.磁碟檔案　　a) 基於終端指令　　　　i. 保證parse方法返回一個可迭代型別的物件（儲存解析到的頁面內容）　　　　ii. 使用終端指令完成資料儲存到指定磁碟檔案的操作　　　　　　1. scrapy crawl&nb

爬蟲-scrapy資料的持久化儲存

今日概要基於終端指令的持久化儲存基於管道的持久化儲存 1.基於終端指令的持久化儲存保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作。執行輸出指定格式進行儲存：將爬取到的資料

（六--二）scrapy框架之持久化操作

pass 集成 ref 步驟 com fin content none 提交 scrapy框架之持久化操作基於終端指令的持久化存儲基於管道的持久化存儲 1 基於終端指令的持久化存儲保證爬蟲文件的parse方法中有可叠代類型對象（通常為列表or字典）

12. scrapy 框架持續化儲存

一、基於終端指令的持久化儲存保證爬蟲檔案的parse方法中有可迭代型別物件（通常為列表or字典）的返回，該返回值可以通過終端指令的形式寫入指定格式的檔案中進行持久化操作　執行輸出指定格式進行儲存：將爬取到的資料寫入不同格式的檔案中進行儲存：　　 scrapy crawl 爬蟲名

scrapy框架持久化儲存

相關推薦