scrapy爬取資料儲存csv、mysql、mongodb、json

阿新 • • 發佈：2019-01-14

前言

用Scrapy進行資料的儲存進行一個常用的方法進行解析

Items

item 是我們儲存資料的容器，其類似於 python 中的字典。使用 item 的好處在於： Item 提供了額外保護機制來避免拼寫錯誤導致的未定義欄位錯誤。且看栗子：

import scrapy
class Doubantop250Item(scrapy.Item):
    title = scrapy.Field()  # 電影名字
    star = scrapy.Field()  # 電影評分 

    quote = scrapy.Field()  # 膾炙人口的一句話
    movieInfo = scrapy.Field()  # 電影的描述資訊，包括導演、主演、電影型別

View Code

Pipelines

pipelines.py 一般我們用於儲存資料，其方法的一些介紹如下圖。下面，我會分多種方式來儲存我們的資料，避免你耍流氓。

儲存到 Json

import json
class JsonPipeline(object):
    file_name = base_dir + ' 
/doubanTop250/data.json'  # json 檔案路徑
    def process_item(self, item, spider):
        file = open(self.file_name, 'r', encoding='utf-8')
        load_data = json.load(file)
        load_data.append({"title": item["title"].strip()}) # 追加資料
        file = open(self.file_name, 'w', encoding='utf-8')
        json.dump(load_data, file, ensure_ascii 
=False) # 儲存資料
        file.close()
        return item

View Code

儲存到 CSV

def appendDta2Csv(self, file_name, new_headers, new_data):
        with open(file_name,'r') as f:
            f_csv = csv.reader(f)
            try:# 如何有原始檔沒有 headers ，將呼叫傳進來的 headers
                headers = next(f_csv)
            except:
                headers = new_headers
            old_data = list(f_csv)
            old_data.append(new_data) # 追加新的資料
            with open(file_name, 'w') as f2:# 儲存資料
                f_csv = csv.writer(f2)
                f_csv.writerow(headers)
                f_csv.writerows(old_data)
                f2.close()
            f.close()

    def process_item(self, item, spider):
        self.appendDta2Csv(self.file_name, ["title"], [item["title"].strip()])
        return item

View Code

儲存到 MongoDB

from pymongo import MongoClient
import os
base_dir = os.getcwd()
class MongoPipeline(object):
    # 實現儲存到mongo資料庫的類，
    collection = 'douban'  # mongo 資料庫的 collection 名字

    def __init__(self, mongo_uri, db_name, db_user, db_pass):
        self.mongo_uri = mongo_uri
        self.db_name = db_name
        self.db_user = db_user
        self.db_pass = db_pass

    @classmethod
    def from_crawler(cls, crawler):
        # scrapy 為我們訪問settings提供了這樣的一個方法，這裡，
        # 我們需要從 settings.py 檔案中，取得資料庫的URI和資料庫名稱
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            db_name=crawler.settings.get('DB_NAME'),
            db_user=crawler.settings.get('DB_USER'),
            db_pass=crawler.settings.get('DB_PASS'))

    def open_spider(self, spider):  # 爬蟲啟動時呼叫，連線到資料庫
        self.client = MongoClient(self.mongo_uri)
        self.zfdb = self.client[self.db_name]
        self.zfdb.authenticate(self.db_user, self.db_pass)

    def close_spider(self, spider):  # 爬蟲關閉時呼叫，關閉資料庫連線
        self.client.close()

    def process_item(self, item, spider):
        self.zfdb[self.collection].insert({"title": item["title"].strip()})
        return item

儲存到 MySQL

from sqlalchemy import create_engine, Column, Integer, String, BIGINT, ForeignKey, UniqueConstraint, Index, and_, \
    or_, inspect
from sqlalchemy.orm import sessionmaker, relationship, contains_eager
class MysqlPipeline(object):
    MYSQL_URI = 'mysql+pymysql://username:[email protected]:3306/db_name'
    # echo 為 True 將會輸出 SQL 原生語句
    engine = create_engine(MYSQL_URI, echo=True)
    from sqlalchemy.ext.declarative import declarative_base
    Base = declarative_base()

    # 建立單表
    class Movie(Base):
        __tablename__ = 'movies'
        id = Column(BIGINT, primary_key=True, autoincrement=True)
        title = Column(String(200))
    # 初始化資料庫
    def init_db(self):
        self.Base.metadata.create_all(self.engine)
    # 刪除資料庫
    def drop_db(self):
        self.Base.metadata.drop_all(self.engine)
    def open_spider(self, spider):  # 爬蟲啟動時呼叫，連線到資料庫
        self.init_db()
        Session = sessionmaker(bind=self.engine)
        self.session = Session()
    def process_item(self, item, spider):
        new_movie = self.Movie(title=item["title"].strip())
        self.session.add(new_movie)
        self.session.commit()
        return item

在寫好相關的 pipeline 之後，需要在 settings.py 中啟用相關的 pipeline，後面的數字為呼叫的優先順序，數字是0-1000,你可以自定義。你可以所有格式都儲存，也可以註釋掉其他，值保留一個。

ITEM_PIPELINES = {
    'doubanTop250.pipelines.MongoPipeline': 300,
    'doubanTop250.pipelines.MysqlPipeline': 301,
    'doubanTop250.pipelines.CsvPipeline': 302,
    'doubanTop250.pipelines.JsonPipeline': 303,
}

scrapy爬取資料儲存csv、mysql、mongodb、json

目錄前言 Items Pipelines 前言用Scrapy進行資料的儲存進行一個常用的方法進行解析 Items item 是我們儲存資料的容器，其類似於 python 中的字典。使用 item 的好處在於： Item 提供了額外保護機制來避免拼寫錯誤導致

Scrapy爬取豆瓣電影top250的電影數據、海報，MySQL存儲

p地址 rom gin ani char 代碼 pipeline print 關閉數據庫從GitHub得到完整項目（https://github.com/daleyzou/douban.git）1、成果展示數據庫本地海報圖片2、環境（1）已安裝Scrapy的Pycharm

scrapy爬取資料之後，如何存入mysql

pipelines.py檔案中新建MySQLPipeline類： # 匯入庫 from scrapy.utils.project import get_project_settings import

爬取資料儲存至mysql資料庫

做爬蟲，免不了將抓取下來的資料儲存到資料庫，但是如何儲存到資料庫呢，下面我通過我工作中抓取的一個網站來展示，程式碼有點多，但是邏輯很簡單，此例是將view Details的連結儲存在了mysql中，先看看網站是什麼樣子：下邊這個圖是頁碼網站是這個

scrapy 爬取資料遞歸回掉出錯錯誤日誌【Filtered offsite request to】

爬取zol 網站圖片,無法抓取. 在 setting.py 檔案中設定日誌記錄等級 LOG_LEVEL= 'DEBUG' LOG_FILE ='log.txt' 檢視日誌發現報 2015-11-07 14:43:43+0800 [meizitu] DEBUG: Fi

python爬取資料儲存為Excel格式

#encoding:'utf-8' import urllib.request from bs4 import BeautifulSoup import os import time import xlrd import xlwt from xlutils.copy impo

關於爬取資料儲存到json檔案,中文是unicode解決方式

原帖地址: https://www.cnblogs.com/yuyang26/p/7813097.html 流程：爬取的資料處理為列表，包含字典。裡面包含中文，經過json.dumps，儲存到json檔案中，發現裡面的中文顯示未\ue768這樣子查閱資

Scrapy爬取並儲存到TXT檔案

在建立完成專案並建立爬蟲的基礎上，編寫儲存到TXT的專案 0.設定setting檔案 1.將 ROBOTSTXT_OBEY 設定為false 2.將 ITEM_PIPELINES 開啟 1.定義items.py資料容器 item是Scrapy提供的類似於字典型別的資

scrapy爬取新浪微博並存入MongoDB中

spider.pyimport json from scrapy import Request, Spider from weibo.items import * class WeiboSpider(Spider): name = 'weibocn'

Python3.6實現scrapy框架爬取資料並將資料插入MySQL與存入文件中

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org

scrapy 爬取的資料儲存到exce表格中

只需把爬取過來的資料yield出來,並在pipelines中定義表格即可。from openpyxl import Workbook from Job import settings class JobPipeline(object): # 設定工序一 wb =

Python爬蟲-利用百度地圖API介面爬取資料並儲存至MySQL資料庫

首先，我這裡有一份相關城市以及該城市的公園數量的txt檔案：其次，利用百度地圖API提供的介面爬取城市公園的相關資訊。所利用的API介面有兩個： 1、http://api.map.baidu.com/place/v2/search?q=公園&

資料視覺化三步走（一）：資料採集與儲存，利用python爬蟲框架scrapy爬取網路資料並存儲

前言最近在研究python爬蟲，突然想寫部落格了，那就寫點東西吧。給自己定個小目標，做一個完整的簡單的資料視覺化的小專案，把整個相關技術鏈串聯起來，目的就是為了能夠對這塊有個系統的認識，具體設計思路如下： 1. 利用python爬蟲框架scr

【爬蟲】Scrapy 爬取excel中500個網址首頁，使用Selenium模仿使用者瀏覽器訪問，將網頁title、url、文字內容組成的item儲存至json檔案

建立含有網址首頁的excel檔案 host_tag_網站名稱_主域名_子域名.xlsx 編輯讀取excel檔案的工具類專案FileUtils 新建專案FileUtils 編輯file_utils.py # -*- coding: utf-8 -*- """

Scrapy爬取知名技術網站文章並儲存到MySQL資料庫

之前的幾篇文章都是在講如何把資料爬下來，今天記錄一下把資料爬下來並儲存到MySQL資料庫。文章中有講同步和非同步兩種方法。所有文章文章的地址：http://blog.jobbole.com/all-posts/ 對所有文章

Python3 + Scrapy 爬取豆瓣評分資料存入Mysql與MongoDB資料庫。

首先我們先抓包分析一下，可以看到我們想要的每一頁的全部資料都在"article"下。而其中每一部的電影的資料可以看到在"info"下。所以我們只要在info下找到自己的目標資料並想好匹配方法即可，本文使用的是xpath，其實也可以在spiders中匯入pyquery或者Bea

scrapy爬取海量資料並儲存在MongoDB和MySQL資料庫中

前言一般我們都會將資料爬取下來儲存在臨時檔案或者控制檯直接輸出，但對於超大規模資料的快速讀寫，高併發場景的訪問，用資料庫管理無疑是不二之選。首先簡單描述一下MySQL和MongoDB的區別：MySQL與MongoDB都是開源的常用資料庫，MySQL是傳

Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫

start table ise utf-8 action jpg yield star root 爬取目標：使用scrapy爬取所有課程數據，分別為 1.課程名 2.課程簡介 3.課程等級 4.學習人數並存入MySQL數據庫（目標網址 http://www.imoo

scrapy爬蟲框架（三）：爬取桌布儲存並命名

寫在開始之前按照上一篇介紹過的 scrapy爬蟲的建立順序，我們開始爬取桌布的爬蟲的建立。首先，我們先過一遍 scrapy爬蟲的建立順序：第一步：確定要在pipelines裡進行處理的資料，寫好items檔案第二步：建立爬蟲檔案，將所需要的資訊從

利用scrapy爬取需要登入的網站的資料（包含驗證碼的處理）

利用scrapy爬取需要登入的網站的資料（包含驗證碼的處理）–以爬取豆瓣網資料為例 1、在cmd命令列中輸入 scrapy startproject douban，建立scrapy爬蟲專案 2、在cmd命令列中調整到douban專案資料夾下輸入 scrapy genspider -t

scrapy爬取資料儲存csv、mysql、mongodb、json

目錄

前言

Items

Pipelines

儲存到 Json

儲存到 CSV

儲存到 MongoDB

儲存到 MySQL

相關推薦