3. Multivariate LSTM Forecast Model/多變數LSTM預測模型

In this section, we will fit an LSTM to the problem.


LSTM Data Preparation/LSTM 資料準備

The first step is to prepare the pollution dataset for the LSTM.


This involves framing the dataset as a supervised learning problem and normalizing the input variables.


We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.


This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:


  • Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
  • Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

We can transform the dataset using the series_to_supervised() function developed in the blog post:

我們可以使用部落格中開發的series_to_supervised() function來轉換資料集

First, the “pollution.csv” dataset is loaded. The wind speed feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

首先,‘pollution.csv’資料集被載入,風速特徵是標籤編碼(整數編碼)。 如果您有興趣探索它,這可能會在未來進一步被熱編碼(你這樣寫,我怎麼可能會懂呀罵人,看到後面再回頭看大概懂了,大笑label encoded和one-hot encoded 是兩種編碼處理,就不該翻譯成中文)。

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.


The complete code listing is provided below.


from pandas import DataFrame
from pandas import concat
from pandas import read_csv
from pandas import set_option
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence(t-n,... t-1)
    for i in range(n_in, 0, -1):
        names += [('var%d(t-%d' % (j + 1, i)) for j in range(n_vars)]
    for i in range(0, n_out):
        if i == 0:
            names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
            names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
    return agg

# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:, 4] = encoder.fit_transform(values[:, 4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9, 10, 11, 12, 13, 14, 15]], axis=1, inplace=True)
set_option('display.max_columns', None)

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).


   var1(t-1  var2(t-1  var3(t-1  var4(t-1  var5(t-1  var6(t-1  var7(t-1  \
1  0.129779  0.352941  0.245902  0.527273  0.666667  0.002290  0.000000   
2  0.148893  0.367647  0.245902  0.527273  0.666667  0.003811  0.000000   
3  0.159960  0.426471  0.229508  0.545454  0.666667  0.005332  0.000000   
4  0.182093  0.485294  0.229508  0.563637  0.666667  0.008391  0.037037   
5  0.138833  0.485294  0.229508  0.563637  0.666667  0.009912  0.074074   

   var8(t-1   var1(t)  
1       0.0  0.148893  
2       0.0  0.159960  
3       0.0  0.182093  
4       0.0  0.138833  
5       0.0  0.109658  

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

這個資料準備工作很簡單,我們可以探索更多。 您可以檢視的一些想法包括:

  • One-hot encoding wind speed.
    One-hot encoding 風速
  • Making all series stationary with differencing and seasonal adjustment.
  • Providing more than 1 hour of input time steps.

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.





