1. 程式人生 > >scrapy爬蟲儲存為csv檔案的技術分析

scrapy爬蟲儲存為csv檔案的技術分析

由於工作需要,將爬蟲的檔案要儲存為csv,以前只是儲存為json,但是目前網上很多方法都行不通,主要有一下兩種:

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = [list with Names of fields to export - order is important]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item
第二種:
import csv
import itertools

class CSVPipeline(object):

   def __init__(self):
      self.csvwriter = csv.writer(open('items.csv', 'wb'), delimiter=',')
      self.csvwriter.writerow(['names','starts','subjects','reviews'])

   def process_item(self, item, ampa):

      rows = zip(item['names'],item['stars'],item['subjects'],item['reviews'])


      for row in rows:
         self.csvwriter.writerow(row)

      return item

結果行不通,無法儲存。後來經過研究發現,無法儲存的根本原因在於爬蟲得到的資料格式和儲存檔案的格式不一樣,修改格式後,儲存成功,如有需要,請扣扣聯絡:1241296318

儲存以後直接用excel開啟是亂碼


用其他工具editplus開啟,另存為bom編碼格式


再次開啟,則檔案成功