1. 程式人生 > >04】tsfresh:一種“提取時間序列特徵”的包

04】tsfresh:一種“提取時間序列特徵”的包

Install

  • 假設你的PC已經裝了python開發環境:

## 使用pip直接安裝
pip install tsfresh

## 測試是否安裝成功
from tsfresh import extract_features
requests>=2.9.1
numpy>=1.10.4
pandas>=0.20.3
scipy>=0.17.0
statsmodels>=0.8.0  ## 基於 statsmodels 框架
patsy>=0.4.1
scikit-learn>=0.17.1
future>=0.16.0
six>=1.10.0
tqdm>=4.10.0
ipaddress>=1.0.18; python_version <= '2.7'
dask>=0.15.2
distributed>=1.18.3

基本步驟

  • 準備資料:需要處理的時間序列資料,女裝專案就是時間與gmv的資料;

  • 特徵提取:extract_features

  • 特徵過濾:過濾掉沒有意義的值(NaN),保留有意義的特徵;降維;

  • 特徵提取和過濾同時進行:extract_relevant_features(timeseries, y, column_id='id', column_sort='time')

案例

原始碼中的案例

  • https://github.com/blue-yonder/tsfresh/tree/master/notebooks

available tasks

  • time series classification

  • compression

  • forecasting

Time Series Forecasting - jupyter notebook

  • tsfresh.utilities.dataframe_functions.make_forecasting_frame(x, kind, max_timeshift, rolling_direction)

  1. x (np.array or pd.Series) – the singular time series;歷史資料,

  2. kind (str) – the kind of the time series;

  3. max_timeshift (int)

    – If not None, shift only up to max_timeshift. If None, shift as often as possible;

  4. rolling_direction (int) – The sign decides, if to roll backwards (if sign is positive) or forwards in "time";

  5. Returns:time series container df, target vector y;

說明:df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)make_forecasting_frame() 函式的滑動過程如上圖所示,假如:len(class_df_all['y']) = 59,max_timeshift = 10。

  • (max_timeshift + 1)*(max_timeshift/2) + (len(y) - max_timeshift)*max_timeshift

當rolling_direction = 1,那麼返回的 df_shift 將是一個545行的組合資料,過程如下:

id = 1:feature_matrix, time = 0

id = 2:feature_matrix, time = 0,1,

id = 3:feature_matrix, time = 0,1,2

... ...

id = 10:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ##

id = 11:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10

id = 12:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10

... ...

id = 58:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

所以:545 = (1+10)*10/2 + (59-10)*10

當 rolling_direction = -1 時,過程如下:

id = 1:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由於 max_timeshift =10,限制了最大長度為10

id = 2:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

id = 3:feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

... ...

id = 57:feature_matrix, time = 0,1

id = 58:feature_matrix, time = 0

683·380

  • extract_features,特徵提取:根據上述滑動組合得到的 df_shift 資料,提取特徵:X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute, show_warnings=False) ## 在 spyder 上無法work,而在 jupyter notebook 可以 work;

  • 得到的特徵:[59 rows x 794 columns] --> 794 維的特徵,59行樣本數

(794維特徵,class ComprehensiveFCParameters)

  • extract_features 提取特徵的物件:

1)a pandas.DataFrame containing the different time series;

2)a dictionary of pandas.DataFrame each containing one type of time series;

  • extract_relevant_features:過濾掉部分特徵

思路問題

迴歸模型

  • 輸入:特徵向量 - feature

  • 輸出:預測值(迴歸值)

  • 問題:gmv是目標值,如果資料僅僅是(ds,gmv),是否不適用迴歸模型?

  • 分析:迴歸模型的輸入是特徵,如果需要預測未來2個月的gmv值,那麼需要知道未來2個月各自對應的特徵向量 feature,並將 feature 作為模型的輸入,得到對應的預測值。

Script - 20180717

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from sklearn.ensemble import AdaBoostRegressor
from tsfresh.utilities.dataframe_functions import impute

import warnings
warnings.filterwarnings('ignore')

## load dateset

month_list = ["Jan","Feb","Mar","Apr","May","June",
              "July","Aug","Sept","Oct","Nov","Dec"]

all_leaf_class_name_dict = {cate_id: cate_name}

df = pd.read_csv('./cate_by_month_histroy.csv', header=0, encoding='gbk')
df.columns = ['ds', 'cate_id', 'cate_name', 'y']
class_df_all = df[df.cate_name.str.startswith(cate_name)].reset_index(drop=True)
class_df_all = class_df_all[['ds', 'y']]
class_df_all = class_df_all[:60]
# print(class_df_all.head())

## plot
fig = plt.figure(facecolor='white')
ax = fig.add_subplot(111)
ax.plot(class_df_all['ds'], class_df_all['y'])
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
fig.set_size_inches(18, 8)
plt.legend()

## make_forecasting_frame
df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)
# print(df_shift)
# print(y)

## extract_features
X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute, 
                     show_warnings=False)

## 迴歸模型
ada = AdaBoostRegressor()

y_pred = [0] * len(y)
# print(y_pred)
y_pred[0] = y.iloc[0]
# print(y_pred[0])

ada.fit(X.iloc[:], y[:])
y_pred = ada.predict(X.iloc[:])
print((X.iloc[:]).shape)
# for i in range(1, len(y)):
#     ada.fit(X.iloc[:i], y[:i])
#     # print(len(X.iloc[i, :]))
#     y_pred[i] = ada.predict(X.iloc[i, :])
    
y_pred = pd.Series(data=y_pred, index=y.index)

plt.figure(figsize=(15, 6))
plt.plot(y, label="true")
plt.plot(y_pred, label="predicted")
plt.legend()
plt.show()

問題彙總

  • ImportError: cannot import name 'is_list_like':https://stackoverflow.com/questions/50394873/import-pandas-datareader-gives-importerror-cannot-import-name-is-list-like

  • extract_features:Anaconda-spyder 執行到 extract_features 命令時,跑不動(編譯器問題?),如下圖所示:

  • extract_features:使用 jupyter notebook 就能順利跑動,如下圖所示:

Reference