1. 程式人生 > >CS231n-2017 Assignment3 RNN、LSTM、風格遷移

CS231n-2017 Assignment3 RNN、LSTM、風格遷移

一、RNN

所需完成的步驟記錄在RNN_Captioning.ipynb檔案中。

本例中所用的資料為Microsoft於2014年釋出的COCO資料集。該資料集中和影象標註想拐的圖片包含80000張訓練圖片和40000張驗證圖片。而這些圖片的特徵已通過VGG-16網路獲得,儲存在train2014_vgg16_fc7.h5val2014_vgg16_fc7.h5檔案中,每張圖片由一個4096維的向量表徵。為減少問題複雜度,本例還提供了經過PCA處理之後的特徵,儲存在train2014_vgg16_fc7_pca.h5val2014_vgg16_fc7_pca.h5檔案中,特徵維度由4096維降低為512維。

圖片和其標註示例如下,其中<START><END>為標註的起始和結束字元,<UNK>為詞表中未出現的罕見詞,另外為保證標註的長度一致,會在較短的標註後填充<NULL>特殊字元。

圖 1. 影象標註示例
1. RNN的單步前向傳播

前向傳播的實現的方式,與上次作業大同小異,只不過這裡將會實現迴圈網路層的邏輯。
考慮每次網路讀入一個標註詞時,將根據此次輸入和此時的網路隱藏狀態,計算新的網路隱藏狀態。

rnn_layers.py檔案中的rnn_step_forward()函式:

def
rnn_step_forward(x, prev_h, Wx, Wh, b): next_h, cache = None, None # TODO: Implement a single forward step for the vanilla RNN. next_h = tanh(np.dot(x, Wx) + np.dot(prev_h, Wh) + b) cache = (next_h, Wx, Wh, x, prev_h) return next_h, cache

其中tanh()為求取超正切值的輔助函式,一個考慮計算溢位異常的穩定版本如下:

def tanh(x):
    tmp = x.copy()
    tmp[tmp > 10] = 10
    tmp = np.exp(tmp*2)
    return (tmp - 1)/(tmp + 1)

2. RNN的單步反向傳播

關於超正切函式的求導:

tanh &ThinSpace; x = e x e x e x + e x tanh x = 1 tanh 2 x \tanh\, x = \frac{e^x - e^{-x}}{e^x + e^{-x}}\Rightarrow \tanh&#x27; x = 1 - \tanh^2 x
故在rnn_layers.py檔案中實現rnn_step_backward()函式如下:

def rnn_step_backward(dnext_h, cache):
    dx, dprev_h, dWx, dWh, db = None, None, None, None, None
    # TODO: Implement the backward pass for a single step of a vanilla RNN. 
    next_h, Wx, Wh, x, prev_h = cache
    dtanh_h = dnext_h * (1 - next_h**2)
    dx = dtanh_h.dot(Wx.T)
    dprev_h = dtanh_h.dot(Wh.T)
    dWx = x.T.dot(dtanh_h)
    dWh = prev_h.T.dot(dtanh_h)
    db = np.sum(dtanh_h, axis=0)

    return dx, dprev_h, dWx, dWh, db
3. RNN的前向傳播

網路讀取一小批的標註資料x,(樣本數為N,每條標註的長度為T),並使用這批標註所對應圖片的特徵作為網路的初始隱藏狀態h0,通過前向傳播過程,獲得各個樣本在每一步推進中產生的隱藏狀態h,並存儲反向傳播所需變數。
rnn_layers.py檔案中的rnn_forward()函式:

def rnn_forward(x, h0, Wx, Wh, b):
    h, cache = None, None
    # TODO: Implement forward pass for a vanilla RNN running on a sequence of input data.
    N, T, D = x.shape
    _, H = h0.shape
    h = np.zeros((N, T, H))
    prev_h = h0
    for iter_time in range(T):
        h[:, iter_time, :],_ = rnn_step_forward(x[:, iter_time, :], prev_h, Wx, Wh, b)
        prev_h = h[:, iter_time, :]

    cache = (h0, h, Wx, Wh, x)

    return h, cache
4. RNN的反向傳播

利用儲存的變數實現反向傳播過程。rnn_layers.py檔案中的rnn_backward()函式:

def rnn_backward(dh, cache):

    dx, dh0, dWx, dWh, db = None, None, None, None, None
    # TODO: Implement the backward pass for a vanilla RNN running an entire sequence of data.
    N, T, H = dh.shape
    h0, h, Wx, Wh, x = cache
    dh0 = np.zeros_like(h0)
    dx = np.zeros_like(x)
    dWx = np.zeros_like(Wx)
    dWh = np.zeros_like(Wh)
    db = np.zeros(H)
    h = np.concatenate((h0[:, np.newaxis, :], h), axis=1)
    
    for iter_time in range(T):
        dnext_h = dh[:, -(iter_time+1), :] + dh0
        cache = (h[:, -(iter_time+1), :], Wx, Wh, x[:, -(iter_time+1), :], h[:, -(iter_time+2), :])
        dx_step, dh0, dWx_step, dWh_step, db_step = rnn_step_backward(dnext_h, cache)
        dx[:, -(iter_time+1), :] = dx_step
        dWx += dWx_step
        dWh += dWh_step
        db  += db_step

    return dx, dh0, dWx, dWh, db

注意其中梯度值的累積,這其實就是RNN共享引數的一種體現。

5. 字詞的向量化表達

將影象標註中的詞索引x轉化為向量表達,並在後向傳播時更新字詞所對應的向量。
rnn_layers.py檔案中的word_embedding_forward()函式:

def word_embedding_forward(x, W):
    out, cache = None, None
    # TODO: Implement the forward pass for word embeddings.
    out = W[x, :]
    cache = (x, W.shape)

    return out, cache

rnn_layers.py檔案中的word_embedding_backward()函式:

def word_embedding_backward(dout, cache):
    dW = None
    # TODO: Implement the backward pass for word embeddings.
    x, shp = cache
    dW = np.zeros(shp)
    np.add.at(dW, x, dout)
    
    return dW
6. 考慮損失函式

rnn.py檔案中的loss()函式:

def loss(self, features, captions):

        captions_in = captions[:, :-1]
        captions_out = captions[:, 1:]

        # You'll need this
        mask = (captions_out != self._null)

        # Weight and bias for the affine transform from image features to initial
        # hidden state
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

        # Word embedding matrix
        W_embed = self.params['W_embed']

        # Input-to-hidden, hidden-to-hidden, and biases for the RNN
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

        # Weight and bias for the hidden-to-vocab transformation.
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the forward and backward passes for the CaptioningRNN.
        h0, cache_affine = affine_forward(features, W_proj, b_proj)  # (1)
        captions_in_vec, cache_embed = word_embedding_forward(captions_in, W_embed)  #(2)
        if self.cell_type == "rnn":
          h, cache_rnn = rnn_forward(captions_in_vec, h0, Wx, Wh, b)  # (3)
        elif self.cell_type == "lstm":
          h, cache_lstm = lstm_forward(captions_in_vec, h0, Wx, Wh, b)
        
        scores, cache_score = temporal_affine_forward(h, W_vocab, b_vocab)  # (4)
        loss, dscores = temporal_softmax_loss(scores, captions_out, mask)  # (5)

        dh, dW_vocab, db_vocab = temporal_affine_backward(dscores, cache_score) # (4)
        if self.cell_type == "rnn":
          dcaptions_in_vec, dh0, dWx, dWh, db = rnn_backward(dh, cache_rnn)  # (3)
        elif self.cell_type == "lstm":
          dcaptions_in_vec, dh0, dWx, dWh, db = lstm_backward(dh, cache_lstm)  # (3)
        
        dW_embed = word_embedding_backward(dcaptions_in_vec, cache_embed)  # (2)
        _, dW_proj, db_proj = affine_backward(dh0, cache_affine)  # (1)

        grads = {"W_vocab": dW_vocab, "b_vocab": db_vocab, 
                 "Wx": dWx, "Wh": dWh, "b": db,
                 "W_embed": dW_embed, "W_proj": dW_proj, "b_proj": db_proj}

        return loss, grads
7. 測試過程

rnn.py檔案中的sample()函式:

def sample(self, features, max_length=30):
        N = features.shape[0]
        captions = self._null * np.ones((N, max_length), dtype=np.int32)

        # Unpack parameters
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
        W_embed = self.params['W_embed']
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        # TODO: Implement test-time sampling for the model.
        c = np.zeros(b.shape[0]//4)
        h = features.dot(W_proj) + b_proj  # (1)
        captions[:, 0] = self._start


        for iter_time in range(1, max_length):
          prev_word = captions[:, iter_time-1]
          captions_in_vec, _ = word_embedding_forward(prev_word, W_embed)  #(2)
          if self.cell_type == "rnn":
            h, _ = rnn_step_forward(captions_in_vec, h, Wx, Wh, b)  # (3)
          else:
            h, c, _ = lstm_step_forward(captions_in_vec, h, c, Wx, Wh, b)  # (3)
          scores =  np.dot(h, W_vocab) + b_vocab  # (4)
          captions[:, iter_time] = np.argmax(scores, axis=1)
          
        pass

        return captions

二、LSTM

所需完成的步驟記錄在LSTM_Captioning.ipynb檔案中。

1. LSTM的單步前向傳播

rnn_layers.py檔案中的lstm_step_forward()函式:

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):

    next_h, next_c, cache = None, None, None
    # TODO: Implement the forward pass for a single timestep of an LSTM.
    H = b.shape[0]
    ifog = x.dot(Wx) + prev_h.dot(Wh) + b
    ifog = getIFOG(ifog, "T")
    next_c = getIFOG(ifog, 'f')*prev_c + getIFOG(ifog,'i')*getIFOG(ifog,"g")
    next_h = getIFOG(ifog, 'o')*tanh(next_c)
    cache = (x, prev_h, prev_c, Wx, Wh, next_c, ifog)

    return next_h, next_c, cache

其中getIFOG()函式為變換並拆分四門輸出的輔助函式:

def getIFOG(ifog, which):
    H = ifog.shape[1]//4
    indx = {char:i*H for i, char in enumerate("ifog")}
    if which == "t" or which == "T":
        for char in indx:
            if char == "g":
                ifog[:, indx[char]:indx[char]+H] = tanh(ifog[:, indx[char]:indx[char]+H])
            else:
                ifog[:, indx[char]:indx[char]+H] = sigmoid(ifog[:, indx[char]:indx[char]+H])
        return ifog
    else:
        if which == "g":
            return ifog[:, indx[which]:indx[which]+H]
        else:
            return ifog[:, indx[which]:indx[which]+H]
2. LSTM的單步後向傳播

rnn_layers.py檔案中實現lstm_step_backward()函式:

def lstm_step_backward(dnext_h, dnext_c, cache):
    dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None
    # TODO: Implement the backward pass for a single timestep of an LSTM.
    N, H = dnext_c.shape
    da = np.zeros((N, 4*H))

    x, prev_h, prev_c, Wx, Wh, next_c, ifog = cache

    tanhc_t = tanh(next_c)
    i = getIFOG(ifog, "i")
    f = getIFOG(ifog, "f")
    o = getIFOG(ifog, "o")
    g = getIFOG(ifog, "g")
    dh_c = dnext_h*o*(1-tanhc_t**2)
    setIFOG(da, "i", (dnext_c + dh_c)*g*(1-i)*i)
    setIFOG(da, "f", (dnext_c + dh_c)*prev_c*(1-f)*f)
    setIFOG(da, "o", dnext_h*tanhc_t*(1-o)*o)
    setIFOG(da, "g", (dnext_c + dh_c)*i*(1-g**2))

    dx = da.dot(Wx.T)
    dprev_h = da.dot(Wh.T)
    dprev_c = (dnext_c + dh_c) * f
    dWx = x.T.dot(da)
    dWh = prev_h.T.dot(da)
    db = np.sum(da, axis=0)
    
    return dx, dprev_h, dprev_c, dWx, dWh, db

由實現過程可見:LSTM中反饋到前一層的梯度除了dprev_h外,還包含dprev_c。其中dprev_h涉及與係數矩陣W的相乘,因此這一項在經歷多步操作時,極易出現梯度爆炸或消失。而dprev_c這一項,只涉及元素相乘,因此,緩解了上述問題。

3. LSTM的前向傳播

rnn_layers.py檔案中的lstm_forward()函式:

def lstm_forward(x, h0, Wx, Wh, b):
    h, cache = None, None

    # TODO: Implement the forward pass for an LSTM over an entire timeseries.
    N, T, D = x.shape
    _, H = h0.shape
    h = np