CNTK API文件翻譯(8)——使用Pandas和金融資料進行時序資料基本分析

阿新 • • 發佈：2019-01-08

本期將帶來使用CNTK處理時間序列資料的教程。本教程中會展示怎樣為深度學習演算法準備時間資料、訓練神經網路和評估神經網路。具體來說，我們會探究預測交易性開放式指數基金（Exchange-traded Funds,EFI）的分類是否靠譜，進而通過這種簡單的分類來決定是買是賣。本教程僅僅是CNTK分析時序資料的例子，不保證訓練的結果可以用於基金買賣的決策，股票市場過於複雜，非常難以預測，目前為止做的最好的依然是該領域的專家。

本教程介紹了pandas資料讀取器和pandas庫的使用方法，之前用過的numpy資料結構在Pandas資料幀中也會表現良好。

from __future__ import 
 print_function
import datetime
import numpy as np
import os
import pandas as pd

# default='warn'
pd.options.mode.chained_assignment = None  

import cntk as C

在下面的程式碼中，我們通過檢查在CNTK內部定義的環境變數來選擇正確的裝置（GPU或者CPU）來執行程式碼，如果不檢查的話，會使用CNTK的預設策略來使用最好的裝置（如果GPU可用的話就使用GPU，否則使用CPU）

# Select the right target device when this notebook is being tested: 

if 'TEST_DEVICE' in os.environ:
    if os.environ['TEST_DEVICE'] == 'cpu':
        C.device.try_set_default_device(C.device.cpu())
    else:
        C.device.try_set_default_device(C.device.gpu(0))

匯入股票資料

首先我們使用get_stock_data函式獲取股票資料。這個函式的功能是從雅虎金融獲取以天為單位的股票資料（也可以換成谷歌金融或者其他資料來源）。下面的程式碼中展示了pandas資料讀取器的多個用例。

# A method which obtains stock data from Yahoo finance
# Requires that you have an internet connection to retreive stock data from Yahoo finance

import time
try:
    from  pandas_datareader import data
except ImportError:
    !pip install pandas_datareader
    from  pandas_datareader import data 

# Set a random seed
np.random.seed(123)

def get_stock_data(contract, s_year, s_month, s_day, e_year, e_month, e_day):
    """
    Args:
        contract (str): the name of the stock/etf
        s_year (int): start year for data
        s_month (int): start month
        s_day (int): start day
        e_year (int): end year
        e_month (int): end month
        e_day (int): end day
    Returns:
        Pandas Dataframe: Daily OHLCV bars
    """
    start = datetime.datetime(s_year, s_month, s_day)
    end = datetime.datetime(e_year, e_month, e_day)

    retry_cnt, max_num_retry = 0, 3

    while(retry_cnt < max_num_retry):
        try:
            bars = data.DataReader(contract,"google", start, end)
            return bars
        except:
            retry_cnt += 1
            time.sleep(np.random.randint(1,10)) 

    print("Google Finance is not reachable")
    raise Exception('Google Finance is not reachable')


import pickle as  pkl

# We search in cached stock data set with symbol SPY.               
# Check for an environment variable defined in CNTK's test infrastructure
envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'
def is_test(): return envvar in os.environ

def download(data_file):
    try:
        data = get_stock_data("SPY", 2000, 1,2,2017,1,1)
    except:
        raise Exception("Data could not be downloaded")

    dir = os.path.dirname(data_file)

    if not os.path.exists(dir):
        os.makedirs(dir)

    if not os.path.isfile(data_file):
        print("Saving", data_file )
        with open(data_file, 'wb') as f:
            pkl.dump(data, f, protocol = 2)
    return data

data_file = os.path.join("data", "Stock", "stock_SPY.pkl")

# Check for data in local cache
if os.path.exists(data_file):
        print("File already exists", data_file)
        data = pd.read_pickle(data_file) 
else: 
    # If not there we might be running in CNTK's test infrastructure
    if is_test():
        test_file = os.path.join(os.environ[envvar], 'Tutorials','data','stock','stock_SPY.pkl')
        if os.path.isfile(test_file):
            print("Reading data from test data directory")
            data = pd.read_pickle(test_file)
        else:
            print("Test data directory missing file", test_file)
            print("Downloading data from Google Finance")
            data = download(data_file)         
    else:
        # Local cache is not present and not test env
        # download the data from Yahoo finance and cache it in a local directory
        # Please check if there is trade data for the chosen stock symbol during this period
        data = download(data_file)

建立訓練引數

股市的漲跌很大程度上表現出自相關性。我們使用ETF的SPY指數代表股市，這個指數涵蓋了美國上市公司五百強。我們的交易決策會根據這種通過股市自相關預測出來的短期結果來做。

預測

股市接下來一天的資料比當天的資料高還是低。

預測器

通過前八天的資料，接下來一天是否會比當天好。
資料波動百分比
前一天資料波動的百分比

注意，我們不會直接把股價作為神經網路的輸入值。金融時序資料充滿噪音，這對減少過度擬合至關重要。我們在進行訓練之前可以做很多事情，比如資料平滑、新增更多的要素等，不過為了使教程簡單，我們就不做處理，僅僅表現一下CNTK處理時間序列資料的能力就可以了。

# Feature name list
predictor_names = []

# Compute price difference as a feature
data["diff"] = np.abs((data["Close"] - data["Close"].shift(1)) / data["Close"]).fillna(0) 
predictor_names.append("diff")

# Compute the volume difference as a feature
data["v_diff"] = np.abs((data["Volume"] - data["Volume"].shift(1)) / data["Volume"]).fillna(0) 
predictor_names.append("v_diff")

# Compute the stock being up (1) or down (0) over different day offsets compared to current dat closing price
num_days_back = 8

for i in range(1,num_days_back+1):
    data["p_" + str(i)] = np.where(data["Close"] > data["Close"].shift(i), 1, 0) # i: number of look back days
    predictor_names.append("p_" + str(i))

# If you want to save the file to your local drive
#data.to_csv("PATH_TO_SAVE.csv")
data.head(10)

我們要預測什麼

我們要預測下一天股市資料是漲是跌，如果漲，我們用1表示，否則用0（我們忽略了不大可能出現的情況：不變）。

data["next_day"] = np.where(data["Close"].shift(-1) > data["Close"], 1, 0)
data["next_day_opposite"] = np.where(data["next_day"]==1,0,1) # The label must be one-hot encoded

# Establish the start and end date of our training timeseries (picked 2000 days before the market crash)
training_data = data["2001-02-05":"2009-01-20"] 

# We define our test data as: data["2008-01-02":]
# This example allows to to include data up to current date

test_data= data["2009-01-20":"2016-12-29"] 
training_features = np.asarray(training_data[predictor_names], dtype = "float32")
training_labels = np.asarray(training_data[["next_day","next_day_opposite"]], dtype="float32")

現在我們建立了神經網路，我們將使用簡單的前饋神經網路。注意，我們會使用CNTK的Layer庫，不會自建網路層。

# Lets build the network
input_dim = 2 + num_days_back
num_output_classes = 2 #Remember we need to have 2 since we are trying to classify if the market goes up or down 1 hot encoded
num_hidden_layers = 2
hidden_layers_dim = 2 + num_days_back
input_dynamic_axes = [C.Axis.default_batch_axis()]
input = C.input_variable(input_dim, dynamic_axes=input_dynamic_axes)
label = C.input_variable(num_output_classes, dynamic_axes=input_dynamic_axes)

def create_model(input, num_output_classes):
    h = input
    with C.layers.default_options(init = C.glorot_uniform()):
        for i in range(0,num_hidden_layers):
            h = C.layers.Dense(hidden_layers_dim, 
                               activation = C.relu)(h)
        r = C.layers.Dense(num_output_classes, activation=None)(h)   
    return r

z = create_model(input, num_output_classes)
loss = C.cross_entropy_with_softmax(z, label)
label_error = C.classification_error(z, label)
lr_per_minibatch = C.learning_rate_schedule(0.125,C.UnitType.minibatch)
trainer = C.Trainer(z, (loss, label_error), [C.sgd(z.parameters, lr=lr_per_minibatch)])


#Initialize the parameters for the trainer, we will train in large minibatches in sequential order
minibatch_size = 100
num_minibatches = len(training_data.index) // minibatch_size

#Run the trainer on and perform model training
training_progress_output_freq = 1

# Visualize the loss over minibatch
plotdata = {"batchsize":[], "loss":[], "error":[]}

我們如何訓練時序資料：資料使用的次數

與我們之前的資料不同，我們這次不隨機往訓練器裡面送資料了，這次我們的每個取樣包都是按時間序列組織的。在處理時間資料時，我們需要讓更新是的資料權重稍大一些。你也可以把這些資料使用幾次，不過這回導致過度擬合，進而導致訓練結果不理想。當然也可以使用另外的一些方法來避免過度擬合，比如L1正則化（詳情請看我的Python與人工神經網路系列第六期）。

tf = np.split(training_features,num_minibatches)

print("Number of mini batches")
print(len(tf))

print("The shape of the training feature minibatch")
print(tf[0].shape)

tl = np.split(training_labels, num_minibatches)

# It is key that we make only one pass through the data linearly in time
num_passes = 1 

# Defines a utility that prints the training progress
def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss = "NA"
    eval_error = "NA"
    if mb%frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        if verbose: 
            print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%".format(mb, training_loss, eval_error*100))
    return mb, training_loss, eval_error

# Train our neural network
tf = np.split(training_features,num_minibatches)
tl = np.split(training_labels, num_minibatches)

for i in range(num_minibatches*num_passes): # multiply by the 
    features = np.ascontiguousarray(tf[i%num_minibatches])
    labels = np.ascontiguousarray(tl[i%num_minibatches])

    # Specify the mapping of input variables in the model to actual minibatch data to be trained with
    trainer.train_minibatch({input : features, label : labels})
    batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq, verbose=1)
    if not (loss == "NA" or error =="NA"):
        plotdata["batchsize"].append(batchsize)
        plotdata["loss"].append(loss)
        plotdata["error"].append(error)

訓練結果

Minibatch: 0, Loss: 0.7874, Error: 54.00%
Minibatch: 1, Loss: 0.7570, Error: 51.00%
Minibatch: 2, Loss: 0.7579, Error: 61.00%
Minibatch: 3, Loss: 0.6916, Error: 47.00%
Minibatch: 4, Loss: 0.7127, Error: 54.00%
Minibatch: 5, Loss: 0.7286, Error: 59.00%
Minibatch: 6, Loss: 0.7056, Error: 50.00%
Minibatch: 7, Loss: 0.6975, Error: 48.00%
Minibatch: 8, Loss: 0.7059, Error: 56.00%
Minibatch: 9, Loss: 0.7037, Error: 54.00%
Minibatch: 10, Loss: 0.7567, Error: 60.00%
Minibatch: 11, Loss: 0.8480, Error: 52.00%
Minibatch: 12, Loss: 0.6917, Error: 45.00%
Minibatch: 13, Loss: 0.7526, Error: 58.00%
Minibatch: 14, Loss: 0.6823, Error: 47.00%
Minibatch: 15, Loss: 0.8856, Error: 40.00%
Minibatch: 16, Loss: 0.8299, Error: 48.00%
Minibatch: 17, Loss: 1.1737, Error: 51.00%
Minibatch: 18, Loss: 0.7951, Error: 53.00%
Minibatch: 19, Loss: 0.7809, Error: 48.00%

視覺化

import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["loss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss ')
plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["error"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error ')
plt.show()

ERROR值在百分之五十左右。注意這些資料都是根據時間變化的，因此係統會隨著時間推移產生一些噪聲。與此同時，模型還在繼續從市場資料中學習。在有很多噪音的情況下，達到低於50%的ERROR值是一個不錯的結果。對於投資公司來說，他們有很高的交易頻率，因此可以在較低正確率的情況下獲利。試圖通過分類來決定每天的交易從交易成本的角度看是比較昂貴的，那麼我們是不是可以訓練出一個模型，來判定是否會盈利呢？

讓我們試試看：

# Now that we have trained the net, and we will do out of sample test to see how we did.
# and then more importantly analyze how that set did

test_features = np.ascontiguousarray(test_data[predictor_names], dtype = "float32")
test_labels = np.ascontiguousarray(test_data[["next_day","next_day_opposite"]], dtype="float32")

avg_error = trainer.test_minibatch({input : test_features, label : test_labels})
print("Average error: {0:2.2f}%".format(avg_error * 100))

這裡列印的結果應該也在50%左右，似乎跟瞎猜差不多，讓我們再做一次點點工作，看看有沒有點預測的效果：

out = C.softmax(z)
predicted_label_prob = out.eval({input:test_features})
test_data["p_up"] = pd.Series(predicted_label_prob[:,0], index = test_data.index)
test_data["p_down"] = predicted_label_prob[:,1]
test_data['long_entries'] = np.where((test_data.p_up > 0.55), 1, 0)
test_data['short_entries'] = np.where((test_data.p_down > 0.55) , -1, 0)
test_data['positions'] = test_data['long_entries'].fillna(0) + test_data['short_entries'].fillna(0)

評估我們的資料

通過上面的程式碼得到了測試資料的輸出值，通過softmax函式歸一化成了概率，但僅僅是概率還沒有有，換句話說，我們需要找到一個明顯的標誌，告訴我們可以買，而不是僅僅得出股市明天會漲會跌：資料有太多噪音，而且頻繁交易需要支付較多的交易費。

在程式碼中我們設定的概率是55%。如果預測出來下一天高於當天的概率是55%，我們就買，如果預測出來低於當天的概率是55%，我們就賣。

我們還需要使用其他引數來評估時序資料：月平均回報率、月回報率標準差、夏普比率、最大資金回撤。夏普比率是平均回報率減去無風險收益率再除以回報率標準差，上述資料均以年為計量單位。

一般來說，夏普比率越高，表示獲得相同回報需要承擔的風險越小，這個引數建立在平均回報率和回報率標準差足以描述回報分佈的基礎之上。

def create_drawdowns(equity_curve):
    """
    Calculate the largest peak-to-trough drawdown of the PnL curve
    as well as the duration of the drawdown. Requires that the 
    pnl_returns is a pandas Series.

    Parameters:
    pnl - A pandas Series representing period percentage returns.

    Returns:
    drawdown, duration - Highest peak-to-trough drawdown and duration.
    """

    # Calculate the cumulative returns curve 
    # and set up the High Water Mark
    # Then create the drawdown and duration series
    hwm = [0]
    eq_idx = equity_curve.index
    drawdown = pd.Series(index = eq_idx)
    duration = pd.Series(index = eq_idx)

    # Loop over the index range
    for t in range(1, len(eq_idx)):
        cur_hwm = max(hwm[t-1], equity_curve[t])
        hwm.append(cur_hwm)
        drawdown[t]= (hwm[t] - equity_curve[t]) 
        duration[t]= 0 if drawdown[t] == 0 else duration[t-1] + 1
    return drawdown.max(), duration.max()


plt.figure()
test_data["p_up"].hist(bins=20, alpha=0.4)
test_data["p_down"].hist(bins=20, alpha=0.4)
plt.title("Distribution of Probabilities")
plt.legend(["p_up", "p_down"])
plt.ylabel("Frequency")
plt.xlabel("Probablity")
plt.show()

test_data["pnl"] = test_data["Close"].diff().shift(-1).fillna(0)*test_data["positions"]/np.where(test_data["Close"]!=0,test_data["Close"],1)
test_data["perc"] = (test_data["Close"] - test_data["Close"].shift(1)) / test_data["Close"].shift(1)
monthly = test_data.pnl.resample("M").sum()
monthly_spy = test_data["perc"].resample("M").sum()
avg_return = np.mean(monthly)
std_return = np.std(monthly)
sharpe = np.sqrt(12) * avg_return / std_return
drawdown = create_drawdowns(monthly.cumsum())
spy_drawdown = create_drawdowns(monthly_spy.cumsum())
print("TRADING STATS")
print("AVG Monthly Return :: " + "{0:.2f}".format(round(avg_return*100,2))+ "%")
print("STD Monthly        :: " + "{0:.2f}".format(round(std_return*100,2))+ "%")
print("SHARPE             :: " + "{0:.2f}".format(round(sharpe,2)))
print("MAX DRAWDOWN       :: " + "{0:.2f}".format(round(drawdown[0]*100,2)) + "%, " + str(drawdown[1]) + " months" )
print("Correlation to SPY :: " + "{0:.2f}".format(round(np.corrcoef(test_data["pnl"], test_data["diff"])[0][1],2)))
print("NUMBER OF TRADES   :: " + str(np.sum(test_data.positions.abs())))
print("TOTAL TRADING DAYS :: " + str(len(data)))
print("SPY MONTHLY RETURN :: " + "{0:.2f}".format(round(monthly_spy.mean()*100,2)) + "%")
print("SPY STD RETURN     :: " + "{0:.2f}".format(round(monthly_spy.std()*100,2)) + "%")
print("SPY SHARPE         :: " + "{0:.2f}".format(round(monthly_spy.mean()/monthly_spy.std()*np.sqrt(12),2)))
print("SPY DRAWDOWN       :: " + "{0:.2f}".format(round(spy_drawdown[0]*100,2)) + "%, "  + str(spy_drawdown[1]) + " months" )

print(drawdown[0])
(monthly.cumsum()*100).plot()
(monthly_spy.cumsum()*100).plot()
plt.legend(["NN", "SPY"],loc=2)
plt.ylabel("% Return")
plt.title("TRADING SPY OUT OF SAMPLE")
plt.show()

輸出結果：

TRADING STATS
AVG Monthly Return :: -0.45%
STD Monthly        :: 3.17%
SHARPE             :: -0.49
MAX DRAWDOWN       :: 48.20%, nan months
Correlation to SPY :: -0.01
NUMBER OF TRADES   :: 1175
TOTAL TRADING DAYS :: 4000
SPY MONTHLY RETURN :: 1.19%
SPY STD RETURN     :: 3.92%
SPY SHARPE         :: 1.05
SPY DRAWDOWN       :: 17.25%, 11.0 months
0.482027152898

歡迎掃碼關注我的微信公眾號獲取最新文章

CNTK API文件翻譯(8)——使用Pandas和金融資料進行時序資料基本分析

匯入股票資料

建立訓練引數

預測

預測器

我們要預測什麼

我們如何訓練時序資料：資料使用的次數

評估我們的資料

CNTK API文件翻譯(8)——使用Pandas和金融資料進行時序資料基本分析

CNTK API文件翻譯(9)——使用自編碼器壓縮MNIST資料

CNTK API文件翻譯(19)——藝術風格轉變

CNTK API文件翻譯(17)——多對多神經網路處理文字資料（1）

CNTK API文件翻譯(25)——後記

CNTK API文件翻譯(18)——多對多神經網路處理文字資料（2）

CNTK API文件翻譯(23)——使用CTC標準訓練聲學模型

CNTK API文件翻譯(20)——GAN處理MSIST資料基礎

CNTK API文件翻譯(3)——前饋神經網路

CNTK API文件翻譯(10)——使用LSTM預測時間序列資料

CNTK API文件翻譯(2)——邏輯迴歸

CNTK API文件翻譯(11)——使用LSTM預測時間序列資料（物聯網資料）

CNTK API文件翻譯(15)——自然語言理解

CNTK API文件翻譯(12)——CNTK進階

RabbitMQ開發庫的完整API文件(翻譯)

java API文件翻譯

Django Rest 與 React(Django2.1 + coverage測試 + xadmin + 線上api文件)-翻譯實踐強化版

broadleaf電商文件翻譯專案模組和封裝結構

win32 api 文件和目錄管理

Laravel（PHP）使用Swagger生成API文件不完全指南 - 基本概念和環境搭建 - 簡書

CNTK API文件翻譯(8)——使用Pandas和金融資料進行時序資料基本分析

匯入股票資料

建立訓練引數

預測

預測器

我們要預測什麼

我們如何訓練時序資料：資料使用的次數

評估我們的資料

相關推薦