1. 程式人生 > >機器學習- RNN以及LSTM的原理分析

機器學習- RNN以及LSTM的原理分析

  • 概述

RNN是遞迴神經網路,它提供了一種解決深度學習的另一個思路,那就是每一步的輸出不僅僅跟當前這一步的輸入有關,而且還跟前面和後面的輸入輸出有關,尤其是在一些NLP的應用中,經常會用到,例如在NLP中,每一個輸出的Word,都跟整個句子的內容都有關係,而不僅僅跟某一個詞有關。LSTM是RNN的一種升級版本,它的核心思想跟RNN是一樣的,但是它透過一下方法避免了一些RNN的缺點。那麼下面就逐步的解析一下RNN和LSTM的結構,然後分析一下它們的原理吧。

  • RNN解析

要理解RNN,咱們得先來看一下RNN的結構,然後就來解釋一下它的原理

 

 

上圖中左邊的圖是一個RNN網路結構中總體的圖,右邊的圖片是一個RNN Cell裡面的具體細節; 從上面的左邊的圖咱們可以看出,一個RNN的網路結構中,無論RNNcell迴圈多少次,它的weights都是share的,即一個weights只有一份copy, 而每一步的Hidden state(即右圖中的a<t>

和a<t-1>)是不同的,每一個time step它都有一份hidden state的copy。從上面的圖片分析來看, RNN的每一步的輸入不單單是有X<t>, 而且還有有前面的time step中學習來的hidden state -  a<t-1>。這就實現了咱們前面的需求了,讓每一步的輸出不僅僅跟當前的輸入有關,還得跟前面的輸入有關。具體的程式碼實現這個RNN cell的方法,你們可以參考下面的程式碼來加深對RNN的理解,實際在TensorFlow中來應用RNN的話,其實非常簡單,就是一句程式碼就搞定了,但是,這裡我還是貼出RNN cell建立的原始碼方便大家理解

def rnn_cell(xt, a_prev, parameters):
    """
    Implements a single forward step of the RNN-cell as described
    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    # compute next activation state using the formula given above
    a_next = np.tanh(Wax.dot(xt)+Waa.dot(a_prev)+ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(Wya.dot(a_next)+by)   
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

np.random.seed(1)
xt_tmp = np.random.randn(3,10)
a_prev_tmp = np.random.randn(5,10)
parameters_tmp = {}
parameters_tmp['Waa'] = np.random.randn(5,5)
parameters_tmp['Wax'] = np.random.randn(5,3)
parameters_tmp['Wya'] = np.random.randn(2,5)
parameters_tmp['ba'] = np.random.randn(5,1)
parameters_tmp['by'] = np.random.randn(2,1)

a_next_tmp, yt_pred_tmp, cache_tmp = rnn_cell_forward(xt_tmp, a_prev_tmp, parameters_tmp)
print("a_next[4] = ", a_next_tmp[4])
print("a_next.shape = ", a_next_tmp.shape)
print("yt_pred[1] =", yt_pred_tmp[1])
print("yt_pred.shape = ", yt_pred_tmp.shape)
print( a_next_tmp[:,:])
print( a_next_tmp[:,0])
  • LSTM解析

根據上面的RNN的結構圖片,你們仔細的看一下有沒有什麼缺點。如果怎麼的RNN需要迴圈很多次的話,咱們可能會有丟失資訊的可能,就是gradient vanishing的情況發生,如果gradient vanishing的情況發生的話,它就不會繼續學習咱們的資訊了,就變成了standard neuro network了,RNN就失去了意義了。而且,咱們的sequence越長(即:迴圈的次數越多),gradient vanishing的可能性就越大。這個時候,咱們就有必要優化咱們的RNN了,讓優化了的結構不但能夠不斷的學習能力,還能夠有記憶功能,能把咱們學習的主要的東西能夠記住,這就讓咱們的RNN進化到了LSTM(Long short term memory)階段了。為了能夠更好的解釋LSTM的網路結構,咱們還是先來看一些它就結構圖片,然後再來解釋一下吧

 

 上圖是一個LSTM cell的基本結構,這個圖有有些不重要的元素,我都省略了,主要留下了一些最重要的資訊。首先對比RNN, 咱們可以看出咱們多了3個gate和一個memory cell - C<t>。這個memory cell也稱作internal hidden state。那麼咱們來看看這個三個gate到底是幹什麼的。第一個gate是forget gate,它是幫助咱們的memory cell刪除(或者稱之為過濾)掉一些不重要的資訊的,這個gate的值是在[0,1]這個區間,一般咱們用sigmoid函式來運算了,然後再和C來做element-wise的乘法運算,如果forget gate中的值趨向於0就刪除掉相應的資訊,如果forget gate中的值趨向於1則保留相應的值。第二個gate是update gate; 這個gate要跟咱們candidate memory cell來共同作用來產生新的資訊,它們兩個進行element-wise的相乘運算後,再跟咱們經過forget gate後的memory cell來進行element-wise的相加勻速,相當於把咱們從當前time step中學習到的資訊新增到咱們的memory cell中。第三個gate就是output gate了;顧名思義就是過濾咱們輸出的hidden state的, 這個gate也是sigmoid函式,根據前一個hidden state a<t-1> 和當前time step的輸入 X<t>來共同決定的,它跟經過forget gate和update gate處理後的memory cell 的tanh值進行element-wise相乘過後,得到了了咱們當前time step的hidden state-a<t>,同時還得到了咱們當前time step的memory cell值;從這裡咱們也可以看出output hidden state-a<t>和internal hidden state(memory cell)的dimension是一樣的。上面解釋了一下一個LSTM Cell中具體的結構跟功能。為了能夠更好的加深大家對於LSTM的理解,我還是用程式碼演示一下如何構建一個LSTM cell, 程式碼如下所示:

 

def lstm_cell(xt, a_prev, c_prev, parameters):
    """
    Implement a single forward step of the LSTM-cell as described in Figure (4)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
          c stands for the cell state (memory)
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"] # forget gate weight
    bf = parameters["bf"]
    Wi = parameters["Wi"] # update gate weight (notice the variable name)
    bi = parameters["bi"] # (notice the variable name)
    Wc = parameters["Wc"] # candidate value weight
    bc = parameters["bc"]
    Wo = parameters["Wo"] # output gate weight
    bo = parameters["bo"]
    Wy = parameters["Wy"] # prediction weight
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    ### START CODE HERE ###
    # Concatenate a_prev and xt 
    concat = np.concatenate((a_prev, xt), axis=0)

    # Compute values for ft (forget gate), it (update gate),
    # cct (candidate value), c_next (cell state), 
    # ot (output gate), a_next (hidden state)
    ft = sigmoid(Wf.dot(concat)+bf)        # forget gate
    it = sigmoid(Wi.dot(concat)+bi)        # update gate
    cct = np.tanh(Wc.dot(concat)+bc)       # candidate value
    c_next = c_prev*ft + cct*it   # cell state
    ot = sigmoid(Wo.dot(concat)+bo)        # output gate
    a_next = ot*np.tanh(c_next)    # hidden state
    
    # Compute prediction of the LSTM cell
    yt_pred = softmax(Wy.dot(a_next)+by)    
    ### END CODE HERE ###

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

np.random.seed(1)
xt_tmp = np.random.randn(3,10)
a_prev_tmp = np.random.randn(5,10)
c_prev_tmp = np.random.randn(5,10)
parameters_tmp = {}
parameters_tmp['Wf'] = np.random.randn(5, 5+3)
parameters_tmp['bf'] = np.random.randn(5,1)
parameters_tmp['Wi'] = np.random.randn(5, 5+3)
parameters_tmp['bi'] = np.random.randn(5,1)
parameters_tmp['Wo'] = np.random.randn(5, 5+3)
parameters_tmp['bo'] = np.random.randn(5,1)
parameters_tmp['Wc'] = np.random.randn(5, 5+3)
parameters_tmp['bc'] = np.random.randn(5,1)
parameters_tmp['Wy'] = np.random.randn(2,5)
parameters_tmp['by'] = np.random.randn(2,1)

a_next_tmp, c_next_tmp, yt_tmp, cache_tmp = lstm_cell_forward(xt_tmp, a_prev_tmp, c_prev_tmp, parameters_tmp)
print("a_next[4] = \n", a_next_tmp[4])
print("a_next.shape = ", c_next_tmp.shape)
print("c_next[2] = \n", c_next_tmp[2])
print("c_next.shape = ", c_next_tmp.shape)
print("yt[1] =", yt_tmp[1])
print("yt.shape = ", yt_tmp.shape)
print("cache[1][3] =\n", cache_tmp[1][3])
print("len(cache) = ", len(cache_tmp))
  • 總結

上面的兩個部分主要介紹了RNN和LSTM的結構,以及分析了它們結構內部的功能和流程。並且分別在每一個cell後面都用Python展示瞭如何用程式碼去構建一個RNN cell和LSTM cell。咱們可以理解LSTM是對RNN的一種優化,同時要明白為什麼要進行這樣的優化;其次最重要的是理解RNN的這樣一種新的解決問題的方法和思路,這跟咱們之前見過的standard neuro network最明顯的一個區別就是,之前在神經網路,regressor 或者classfier中,每一個輸出只跟咱們的輸入features相關, 而RNN的思路則是不僅僅跟當前的輸入有關,還跟前面的輸入有關,這在以前sequence model中是非常常見的,例如Language modeling, machine translation等等的應用中,都是要用到RNN的思想的。

&n