CNTK API文件翻譯(11)——使用LSTM預測時間序列資料（物聯網資料）

阿新 • • 發佈：2019-01-11

在上一期我們開發了一個簡單的LSTM神經網路來預測時序資料的值。在本期我們要把這模型用在真實世界的物聯網資料上。作為示例，我們會根據之前幾天觀測到的資料預測太陽能電池板的日產電量。

太陽能發電量預測是一個重要且艱難的問題。太陽能產電量的預測還與天氣預測密切相關。實際上，這個問題分為兩部分，第一部分，我們需要關注太陽能光強度或者其他氣象的變數，另一方面我們需要計算在預測的天氣狀況下太陽能電池板的產電量。通常來說，這種複雜問題的處理經常會涉及到空間和時間尺度的變化，本教程僅僅簡單的通過之前太陽能電池板生成的電量資料預測之後的資料。

目標

通過一塊太陽能電池板以往一天產生的電量，我們需要預測全部太陽能電池板在預測天內產生的電量。我們需要使用在上期開發的基於時間序列預測的LSTM網路模型來根據以往的資料預測太陽能電池板的產電量。

我們使用太陽能電池板的歷史資料據來訓練模型，在我們的示例中，我們需要使用之前的資料預測整個太陽能電池板陣列的產電量。我們在最初的兩個資料之後就開始預測，然後與新讀取的資料進行擬合。

在本教程中我們我們會使用上期的LSTM模型，因此與上期類似，有如下幾個部分：

初始化
資料生成
LSTM模型構建
模型訓練和評估

初始化

我們先匯入一些庫，定義一些以後會用到的常量。

from matplotlib import pyplot as plt
import math
import numpy as np
import os
import pandas as pd
import 
 random
import time

import cntk as C

try:
    from urllib.request import urlretrieve
except ImportError:
    from urllib import urlretrieve

在下面的程式碼中，我們通過檢查在CNTK內部定義的環境變數來選擇正確的裝置（GPU或者CPU）來執行程式碼，如果不檢查的話，會使用CNTK的預設策略來使用最好的裝置（如果GPU可用的話就使用GPU，否則使用CPU）

if 'TEST_DEVICE' in os.environ:
    if os.environ['TEST_DEVICE' 
] == 'cpu':
        C.device.try_set_default_device(C.device.cpu())
    else:
        C.device.try_set_default_device(C.device.gpu(0))

我們設定了兩種執行模式：

快速模式：isFast變數設定成True。這是我們的預設模式，在這個模式下我們會訓練更少的次數，也會使用更少的資料，這個模式保證功能的正確性，但訓練的結果還遠遠達不到可用的要求。
慢速模式：我們建議學習者在學習的時候試試將isFast變數設定成False，這會讓學習者更加了解本教程的內容。

快速模式我們會訓練100個週期，得到的結果會有比較低的精度，不過對於開發來說已經足夠了。這個模型目前比較好的表現需要訓練1000到2000個週期。

isFast = True

# we need around 2000 epochs to see good accuracy. For testing 100 epochs will do.
EPOCHS = 100 if isFast else 2000

我們的太陽能電池板每三十分鐘採集一次資料：

solar.current表示當前的產電量，用瓦特表示。
solar.total是到當天的目前為止，太陽能電池板的平均功率，用瓦特/小時表示。

我們的預測採用的資料從最初讀取的兩個資料開始。以這些資料為基礎，我們不斷的預測和把預測結果與新的真實值擬合。我們用到的資料採用csv格式，其形式如下：

time,solar.current,solar.total
7am,6.3,1.7
7:30am,44.3,11.4
…

預處理

在本示例中大部分程式碼都是用來資料預處理的。感謝pandas庫讓資料預處理工作變得非常簡單。

generate_solar_data函式執行了如下任務：

以pandas資料幀的方式讀取CSV資料
格式化資料
資料分組
新增solar.current.max列和solar.total.max列。
生成每天的資料序列

序列生成：所有的序列都會被構成一系列的序列，在這裡再也沒有時間戳的概念，只有序列才是要緊的。

注：如果一天的資料量少於8個，我們會假設當天的資料丟失了，就跳過這天的資料。如果一天的資料多於14條，我們會擷取為14條。

訓練/測試/驗證

我們從使用CNTK讀取CSV檔案開始，資料以一行行組織，通過時間排列，我們需要隨機將這些資料分割成訓練資料集、驗證資料集和測試資料集，但是這樣的分割會讓資料與真實情況不符。因此，我們我們使用如下方式分割資料：把一天的資料讀取進序列，其中八個資料作為訓練資料，一個是驗證資料，一個是測試資料，這就可以讓訓練資料、驗證資料和測試資料分佈在整個時間線上。

def generate_solar_data(input_url, time_steps, normalize=1, val_size=0.1, test_size=0.1):
    """
    generate sequences to feed to rnn based on data frame with solar panel data
    the csv has the format: time ,solar.current, solar.total
     (solar.current is the current output in Watt, solar.total is the total production
      for the day so far in Watt hours)
    """
    # try to find the data file local. If it doesn't exists download it.
    cache_path = os.path.join("data", "iot")
    cache_file = os.path.join(cache_path, "solar.csv")
    if not os.path.exists(cache_path):
        os.makedirs(cache_path)
    if not os.path.exists(cache_file):
        urlretrieve(input_url, cache_file)
        print("downloaded data successfully from ", input_url)
    else:
        print("using cache for ", input_url)

    df = pd.read_csv(cache_file, index_col="time", parse_dates=['time'], dtype=np.float32)

    df["date"] = df.index.date

    # normalize data
    df['solar.current'] /= normalize
    df['solar.total'] /= normalize

    # group by day, find the max for a day and add a new column .max
    grouped = df.groupby(df.index.date).max()
    grouped.columns = ["solar.current.max", "solar.total.max", "date"]

    # merge continuous readings and daily max values into a single frame
    df_merged = pd.merge(df, grouped, right_index=True, on="date")
    df_merged = df_merged[["solar.current", "solar.total",
                           "solar.current.max", "solar.total.max"]]
    # we group by day so we can process a day at a time.
    grouped = df_merged.groupby(df_merged.index.date)
    per_day = []
    for _, group in grouped:
        per_day.append(group)

    # split the dataset into train, validatation and test sets on day boundaries
    val_size = int(len(per_day) * val_size)
    test_size = int(len(per_day) * test_size)
    next_val = 0
    next_test = 0

    result_x = {"train": [], "val": [], "test": []}
    result_y = {"train": [], "val": [], "test": []}    

    # generate sequences a day at a time
    for i, day in enumerate(per_day):
        # if we have less than 8 datapoints for a day we skip over the
        # day assuming something is missing in the raw data
        total = day["solar.total"].values
        if len(total) < 8:
            continue
        if i >= next_val:
            current_set = "val"
            next_val = i + int(len(per_day) / val_size)
        elif i >= next_test:
            current_set = "test"
            next_test = i + int(len(per_day) / test_size)
        else:
            current_set = "train"
        max_total_for_day = np.array(day["solar.total.max"].values[0])
        for j in range(2, len(total)):
            result_x[current_set].append(total[0:j])
            result_y[current_set].append([max_total_for_day])
            if j >= time_steps:
                break
    # make result_y a numpy array
    for ds in ["train", "val", "test"]:
        result_y[ds] = np.array(result_y[ds])
    return result_x, result_y

資料快取

對於常規測試，我們希望使用本地快取的資料。如果快取不能用，我們就需要去下載。

# there are 14 lstm cells, 1 for each possible reading we get per day
TIMESTEPS = 14

# 20000 is the maximum total output in our dataset. We normalize all values with 
# this so our inputs are between 0.0 and 1.0 range.
NORMALIZE = 20000

X, Y = generate_solar_data("https://www.cntk.ai/jup/dat/solar.csv", 
                           TIMESTEPS, normalize=NORMALIZE)

獲取實時資料

next_batch產生下一次用於訓練的資料集。我們在CNTK中使用的變長序列和取樣包都是一系列的numpy陣列，這些陣列都是變長的。

標準的做法是每個訓練週期都隨機抽取資料包，但是我們不這麼幹，為了以後資料視覺化的時候易於理解。

# process batches of 10 days
BATCH_SIZE = TIMESTEPS * 10

def next_batch(x, y, ds):
    """get the next batch for training"""

    def as_batch(data, start, count):
        return data[start:start + count]

    for i in range(0, len(x[ds]), BATCH_SIZE):
        yield as_batch(X[ds], i, BATCH_SIZE), as_batch(Y[ds], i, BATCH_SIZE)

理解資料格式

現在你可以看到我們輸進LSTM神經網路的資料了。

X['train'][0:3]

輸出：

[array([ 0.       ,  0.0006985], dtype=float32),
 array([ 0.       ,  0.0006985,  0.0033175], dtype=float32),
 array([ 0.       ,  0.0006985,  0.0033175,  0.010375 ], dtype=float32)]

Y['train'][0:3]

輸出

array([[ 0.23899999],
       [ 0.23899999],
       [ 0.23899999]], dtype=float32)

LSTM神經網路初始化

與最多14個輸入資料相對應，我們的模型有14個LSTM單元，每個單元代表當天的每個輸入資料點。由於輸入的資料在8到14之間變動，我們就可以利用CNTK可變長序列的優勢，而不用專門去填充空白的輸入單元。

神經網路的輸出值是某天的輸出電量值，給定的每個序列具有一樣的總輸出電量。打個比方：

1.7,11.4 -> 10300
1.7,11.4,67.5 -> 10300
1.7,11.4,67.5,250.5 ... -> 10300
1.7,11.4,67.5,250.5,573.5 -> 10300

LSTM的輸出值作為全連線網路層的輸入值，當然其中會被隨機丟掉百分之二十，來避免過度擬合。全連線網路層的輸出值就是我們這個模型的預測值了。

我們的LSTM模型設計如下：

下面的程式碼就是對上圖的實現：

def create_model(x):
    """Create the model for time series prediction"""
    with C.layers.default_options(initial_state = 0.1):
        m = C.layers.Recurrence(C.layers.LSTM(TIMESTEPS))(x)
        m = C.sequence.last(m)
        m = C.layers.Dropout(0.2)(m)
        m = C.layers.Dense(1)(m)
        return m

訓練

在我們訓練之前我們需要繫結輸入資料，選定訓練器。在本示例中我們使用adam訓練器，squared_error作為成本函式。

# input sequences
x = C.sequence.input_variable(1)

# create the model
z = create_model(x)

# expected output (label), also the dynamic axes of the model output
# is specified as the model of the label input
l = C.input_variable(1, dynamic_axes=z.dynamic_axes, name="y")

# the learning rate
learning_rate = 0.005
lr_schedule = C.learning_rate_schedule(learning_rate, C.UnitType.minibatch)

# loss function
loss = C.squared_error(z, l)

# use squared error to determine error for now
error = C.squared_error(z, l)

# use adam optimizer
momentum_time_constant = C.momentum_as_time_constant_schedule(BATCH_SIZE / -math.log(0.9)) 
learner = C.fsadagrad(z.parameters, 
                      lr = lr_schedule, 
                      momentum = momentum_time_constant)
trainer = C.Trainer(z, (loss, error), [learner])


# training
loss_summary = []

start = time.time()
for epoch in range(0, EPOCHS):
    for x_batch, l_batch in next_batch(X, Y, "train"):
        trainer.train_minibatch({x: x_batch, l: l_batch})

    if epoch % (EPOCHS / 10) == 0:
        training_loss = trainer.previous_minibatch_loss_average
        loss_summary.append(training_loss)
        print("epoch: {}, loss: {:.4f}".format(epoch, training_loss))

print("Training took {:.1f} sec".format(time.time() - start))

結果：

epoch: 0, loss: 0.0966
epoch: 10, loss: 0.0305
epoch: 20, loss: 0.0208
epoch: 30, loss: 0.0096
epoch: 40, loss: 0.0088
epoch: 50, loss: 0.0072
epoch: 60, loss: 0.0071
epoch: 70, loss: 0.0075
epoch: 80, loss: 0.0082
epoch: 90, loss: 0.0082
Training took 134.4 sec

然後我們測試和驗證。我們使用均方誤差成本函式可能稍微有點簡單，有個辦法，我們可以計算在誤差允許範圍內的比率。

# validate
def get_mse(X,Y,labeltxt):
    result = 0.0
    for x1, y1 in next_batch(X, Y, labeltxt):
        eval_error = trainer.test_minibatch({x : x1, l : y1})
        result += eval_error
    return result/len(X[labeltxt])

# Print the train and validation errors
for labeltxt in ["train", "val"]:
    print("mse for {}: {:.6f}".format(labeltxt, get_mse(X, Y, labeltxt)))

# Print the test error
labeltxt = "test"
print("mse for {}: {:.6f}".format(labeltxt, get_mse(X, Y, labeltxt)))

視覺化預測結果

我們的模型訓練狀況量好，因為訓練、測試和驗證的誤差值都在一個可控的範圍內。預測的時序資料能夠很好的可視化出來，讓我們看看實際和預測的對比。

# predict
f, a = plt.subplots(2, 1, figsize=(12, 8))
for j, ds in enumerate(["val", "test"]):
    results = []
    for x_batch, _ in next_batch(X, Y, ds):
        pred = z.eval({x: x_batch})
        results.extend(pred[:, 0])
    # because we normalized the input data we need to multiply the prediction
    # with SCALER to get the real values.
    a[j].plot((Y[ds] * NORMALIZE).flatten(), label=ds + ' raw');
    a[j].plot(np.array(results) * NORMALIZE, label=ds + ' pred');
    a[j].legend();

如果訓練兩千個週期，資料會非常好看。

歡迎掃碼關注我的微信公眾號獲取最新文章

CNTK API文件翻譯(11)——使用LSTM預測時間序列資料（物聯網資料）

目標

初始化

預處理

訓練/測試/驗證

資料快取

獲取實時資料

理解資料格式

LSTM神經網路初始化

訓練

視覺化預測結果

CNTK API文件翻譯(11)——使用LSTM預測時間序列資料（物聯網資料）

CNTK API文件翻譯(10)——使用LSTM預測時間序列資料

CNTK API文件翻譯(19)——藝術風格轉變

CNTK API文件翻譯(9)——使用自編碼器壓縮MNIST資料

CNTK API文件翻譯(17)——多對多神經網路處理文字資料（1）

CNTK API文件翻譯(25)——後記

CNTK API文件翻譯(18)——多對多神經網路處理文字資料（2）

CNTK API文件翻譯(23)——使用CTC標準訓練聲學模型

CNTK API文件翻譯(20)——GAN處理MSIST資料基礎

CNTK API文件翻譯(3)——前饋神經網路

CNTK API文件翻譯(8)——使用Pandas和金融資料進行時序資料基本分析

CNTK API文件翻譯(2)——邏輯迴歸

CNTK API文件翻譯(15)——自然語言理解

CNTK API文件翻譯(12)——CNTK進階

RabbitMQ開發庫的完整API文件(翻譯)

java API文件翻譯

Django Rest 與 React(Django2.1 + coverage測試 + xadmin + 線上api文件)-翻譯實踐強化版

Pulsar官方文件翻譯-入門必看-概念和架構-（一）概覽（Pulsar Overview）

Pulsar官方文件翻譯-概念和架構-Pulsar客戶端（Pulsar Clients）

maven的pom.xml配置文件中常用的配置標簽解析（2018-03-13）

CNTK API文件翻譯(11)——使用LSTM預測時間序列資料（物聯網資料）

目標

初始化

預處理

訓練/測試/驗證

資料快取

獲取實時資料

理解資料格式

LSTM神經網路初始化

訓練

視覺化預測結果

相關推薦