Scrapy從json檔案載入解析規則,使一個爬蟲重複使用.並進行資料清洗

阿新 • • 發佈：2018-12-25

我們在scrapy框架做爬蟲的時候,對於不同規則的頁面,需要寫不同的爬蟲檔案,在這種情況下,部分程式碼需要重複書寫很不方便,對於這種問題.我們可以通過json檔案載入解析規則的方法,來解決這樣個問題.
同時在爬取到的資料中也有一些資料是我們不需要的,同時資料的型別/格式也可能不是我們需要的.需要對資料進行清洗整合才能達到我們的需求,並儲存.
1.建立一個json檔案,並把頁面的解析規則寫入json檔案.

{
    "title":"tr#places_country__row td.w2p_fw::text",
    "population":"tr#places_population__row td.w2p_fw::text"
}

2.建立爬蟲.

# -*- coding: utf-8 -*-
import scrapy
import json
from scrapy import Request
from country_spider.items import CountryItem
from country_spider.items import CountryItemLoader


class CountrySpider(scrapy.Spider):
    name = 'country'
    allowed_domains = ['example.com']

    def __init__(self):
        self.urls = ['http://example.webscraping.com/places/default/view/China-47']

    def start_requests(self):
        for url_str in self.urls:
            yield Request(url_str, callback=self.parse, dont_filter=True)

    def parse(self, response):
        item = CountryItemLoader(item=CountryItem(), response=response)

        # 從json檔案載入解析規則
        with open("json檔案路徑", "r", encoding="utf8") as fp:
            datas = json.load(fp)
        for key in datas:
            item.add_css(key, datas[key])

        # 新增相應欄位的解析規則
        # 可以使用item.add_xpath / item.add_css    
        # item.add_css("title","tr#places_country__row td.w2p_fw::text")
        # item.add_css("population","tr#places_population__row td.w2p_fw::text")
        return item.load_item()

3.在item中對爬取資料進行清洗

# -*- coding: utf-8 -*-

from scrapy.loader import ItemLoader
from scrapy.item import Item
from scrapy import Field
from scrapy.loader.processors import MapCompose, TakeFirst, Join


def str_convert(value):
    return "country_" + value


def get_nums(value):
    return value.replace(",", "")


class CountryItemLoader(ItemLoader):
    # TakeFirst 取出陣列中的第一個(相當於extract_first)
    # 定義一個預設的輸出處理器
    default_output_processor = TakeFirst()


class CountryItem(Item):
    # 定義一個輸入處理器,這裡將處理函式對映到函式str_convent,進行資料清洗
    title = Field(
        input_processor=MapCompose(str_convert),
    )
    population = Field(
        input_processor=MapCompose(get_nums),
    )

4.在pipelines中對資料進行儲存


class CountrySpiderPipeline(object):
    def process_item(self, item, spider):
        print("#" * 50)
        print("item name is ::", item["title"])
        print("item content is ::", item["population"])
        print("#" * 50)
        return item

最後爬蟲的起始url也可以通過json載入,在這裡就不寫了,參考解析規則就可輕鬆實現.

Scrapy從json檔案載入解析規則,使一個爬蟲重複使用.並進行資料清洗

Scrapy從json檔案載入解析規則,使一個爬蟲重複使用.並進行資料清洗

【c++基礎】從json檔案提取資料

《機器學習實戰》第2章閱讀筆記3 使用K近鄰演算法改進約會網站的配對效果—分步驟詳細講解1——資料準備：從文字檔案中解析資料（附詳細程式碼及註釋）

springboot~openfeign從JSON檔案讀取資料

shell指令碼如何從json檔案讀取一個某個值

bootstrap multiselect外掛級聯選擇框操作內容從json檔案獲取

ios json檔案載入動態圖,讓app真正動起來

C#從json檔案中讀取內容

python爬蟲如何解析json檔案 json檔案的解析提取和jsonpath的應用

曹工說Spring Boot原始碼（4）-- 我是怎麼自定義ApplicationContext，從json檔案讀取bean definition的？

曹工說Spring Boot原始碼（6）-- Spring怎麼從xml檔案裡解析bean的

從CSV檔案中讀取jpg圖片的URL地址並多執行緒批量下載

json檔案批量寫入列表的一個指令碼

unity之——AssetBundle在安卓平臺的打包與載入解析（超大一個坑！今天給填了）

手把手教你利用前端字型檔案(.ttf)混淆數字來阻止爬蟲爬取網站資料

c#從100到題中隨機抽出20到題並進行判斷對錯連線mysql資料庫

Unity 用C#指令碼從Web伺服器獲取、解析Json檔案資料

Python載入和解析包含多個JSON物件的JSON檔案

.net從網絡接口地址獲取json，然後解析成對象（二）

java實現json檔案的讀取和解析

Scrapy從json檔案載入解析規則,使一個爬蟲重複使用.並進行資料清洗

相關推薦