他們只說注意力機制（Attention Mechanism）不練，還是我來給大家擼程式碼講解

向量自然語言處理 · 發表 2018-10-12 10:14:03

摘要： Attention是一種用於提升基於RNN（LSTM或GRU）的Encoder + Decoder模型的效果的的機制（Mechanism），一般稱為Attention Mechanism。Attention Mechanism目前非常流行，廣泛應用於機器翻譯、語音識別、影象標註（Image C...

Attention是一種用於提升基於RNN（LSTM或GRU）的Encoder + Decoder模型的效果的的機制（Mechanism），一般稱為Attention Mechanism。Attention Mechanism目前非常流行，廣泛應用於機器翻譯、語音識別、影象標註（Image Caption）等很多領域，之所以它這麼受歡迎，是因為Attention給模型賦予了區分辨別的能力，例如，在機器翻譯、語音識別應用中，為句子中的每個詞賦予不同的權重，使神經網路模型的學習變得更加靈活（soft），同時Attention本身可以做為一種對齊關係，解釋翻譯輸入/輸出句子之間的對齊關係，解釋模型到底學到了什麼知識，為我們開啟深度學習的黑箱，提供了一個視窗。

我為大家收集的有關注意力機制的精華文章

標題	說明	附加
ofollow,noindex">模型彙總24 - 深度學習中Attention Mechanism詳細介紹：原理、分類及應用	首推知乎	2017
目前主流的attention方法都有哪些？	attention機制詳解知乎	2017
Attention_Network_With_Keras 注意力模型的程式碼的實現與分析	程式碼解析簡書	20180617
Attention_Network_With_Keras	程式碼實現 GitHub	2018
各種注意力機制窺探深度學習在NLP中的神威	綜述機器之心	20181008

覺得不想看太多文字直接拖到文末看我的程式碼講解，來個醍醐灌頂

要介紹Attention Mechanism結構和原理，首先需要介紹下Seq2Seq模型的結構。基於RNN的Seq2Seq模型主要由兩篇論文介紹，只是採用了不同的RNN模型。Ilya Sutskever等人與2014年在論文《Sequence to Sequence Learning with Neural Networks》中使用LSTM來搭建Seq2Seq模型。隨後，2015年，Kyunghyun Cho等人在論文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》提出了基於GRU的Seq2Seq模型。兩篇文章所提出的Seq2Seq模型，想要解決的主要問題是，如何把機器翻譯中，變長的輸入X對映到一個變長輸出Y的問題，其主要結構如圖所示。

傳統的Seq2Seq結構

其中，Encoder把一個變成的輸入序列x1，x2，x3....xt編碼成一個固定長度隱向量（背景向量，或上下文向量context）c，c有兩個作用：1、做為初始向量初始化Decoder的模型，做為decoder模型預測y1的初始向量。2、做為背景向量，指導y序列中每一個step的y的產出。Decoder主要基於背景向量c和上一步的輸出yt-1解碼得到該時刻t的輸出yt，直到碰到結束標誌（）為止。

如上文所述，傳統的Seq2Seq模型對輸入序列X缺乏區分度，因此，2015年，Kyunghyun Cho等人在論文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》中，引入了Attention Mechanism來解決這個問題，他們提出的模型結構如圖所示。

Attention Mechanism模組圖解

以 Attention_Network_With_Keras 為例講解一種Attention實現程式碼

部分程式碼

Tx = 50 # Max x sequence length
Ty = 5 # y sequence length
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

# Split data 80-20 between training and test
train_size = int(0.8*m)
Xoh_train = Xoh[:train_size]
Yoh_train = Yoh[:train_size]
Xoh_test = Xoh[train_size:]
Yoh_test = Yoh[train_size:]
複製程式碼

To be careful, let's check that the code works:

i = 5
print("Input data point " + str(i) + ".")
print("")
print("The data input is: " + str(dataset[i][0]))
print("The data output is: " + str(dataset[i][1]))
print("")
print("The tokenized input is:" + str(X[i]))
print("The tokenized output is: " + str(Y[i]))
print("")
print("The one-hot input is:", Xoh[i])
print("The one-hot output is:", Yoh[i])
複製程式碼

Input data point 5.

The data input is: 23 min after 20 p.m.
The data output is: 20:23

The tokenized input is:[ 560 25 22 260 14 19 32 18 300530 282 252 40 40 40 40
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40 40]
The tokenized output is: [ 20 1023]

The one-hot input is: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
The one-hot output is: [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]
複製程式碼

Model

Our next goal is to define our model. The important part will be defining the attention mechanism and then making sure to apply that correctly.

Define some model metadata:

layer1_size = 32
layer2_size = 128 # Attention layer
複製程式碼

The next two code snippets defined the attention mechanism. This is split into two arcs:

Calculating context
Creating an attention layer

As a refresher, an attention network pays attention to certain parts of the input at each output time step. attention denotes which inputs are most relevant to the current output step. An input step will have attention weight ~1 if it is relevant, and ~0 otherwise. The context is the "summary of the input".

The requirements are thus. The attention matrix should have shape and sum to 1. Additionally, the context should be calculated in the same manner for each time step. Beyond that, there is some flexibility. This notebook calculates both this way:

$$ context = \sum_{i=1}^{m} ( attention_i * x_i ) $$

For safety, is defined as .

# Define part of the attention layer gloablly so as to
# share the same layers for each attention step.
def softmax(x):
return K.softmax(x, axis=1)

at_repeat = RepeatVector(Tx)
at_concatenate = Concatenate(axis=-1)
at_dense1 = Dense(8, activation="tanh")
at_dense2 = Dense(1, activation="relu")
at_softmax = Activation(softmax, name='attention_weights')
at_dot = Dot(axes=1)

def one_step_of_attention(h_prev, a):
"""
Get the context.

Input:
h_prev - Previous hidden state of a RNN layer (m, n_h)
a - Input data, possibly processed (m, Tx, n_a)

Output:
context - Current context (m, Tx, n_a)
"""
# Repeat vector to match a's dimensions
h_repeat = at_repeat(h_prev)
# Calculate attention weights
i = at_concatenate([a, h_repeat])
i = at_dense1(i)
i = at_dense2(i)
attention = at_softmax(i)
# Calculate the context
context = at_dot([attention, a])

return context
複製程式碼

def attention_layer(X, n_h, Ty):
"""
Creates an attention layer.

Input:
X - Layer input (m, Tx, x_vocab_size)
n_h - Size of LSTM hidden layer
Ty - Timesteps in output sequence

Output:
output - The output of the attention layer (m, Tx, n_h)
"""
# Define the default state for the LSTM layer
h = Lambda(lambda X: K.zeros(shape=(K.shape(X)[0], n_h)), name='h_attention_layer')(X)
c = Lambda(lambda X: K.zeros(shape=(K.shape(X)[0], n_h)), name='c_attention_layer')(X)
# Messy, but the alternative is using more Input()

at_LSTM = LSTM(n_h, return_state=True, name='at_LSTM_attention_layer')

output = []

# Run attention step and RNN for each output time step
for _ in range(Ty):
context = one_step_of_attention(h, X)

h, _, c = at_LSTM(context, initial_state=[h, c])

output.append(h)

return output
複製程式碼

The sample model is organized as follows:

BiLSTM
Attention Layer
- Outputs Ty lists of activations.
Dense
- Necessary to convert attention layer's output to the correct y dimensions

layer3 = Dense(machine_vocab_size, activation=softmax)

def get_model(Tx, Ty, layer1_size, layer2_size, x_vocab_size, y_vocab_size):
"""
Creates a model.

input:
Tx - Number of x timesteps
Ty - Number of y timesteps
size_layer1 - Number of neurons in BiLSTM
size_layer2 - Number of neurons in attention LSTM hidden layer
x_vocab_size - Number of possible token types for x
y_vocab_size - Number of possible token types for y

Output:
model - A Keras Model.
"""

# Create layers one by one
X = Input(shape=(Tx, x_vocab_size), name='X_Input')

a1 = Bidirectional(LSTM(layer1_size, return_sequences=True), merge_mode='concat', name='Bid_LSTM')(X)

a2 = attention_layer(a1, layer2_size, Ty)

a3 = [layer3(timestep) for timestep in a2]

# Create Keras model
model = Model(inputs=[X], outputs=a3)

return model
複製程式碼

The steps from here on out are for creating the model and training it. Simple as that.

# Obtain a model instance
model = get_model(Tx, Ty, layer1_size, layer2_size, human_vocab_size, machine_vocab_size)
複製程式碼

plot_model(model, to_file='Attention_tutorial_model_copy.png', show_shapes=True)
複製程式碼

模型結構及說明（重點來了）

模型評估

Evaluation

The final training loss should be in the range of 0.02 to 0.5

The test loss should be at a similar level.

# Evaluate the test performance
outputs_test = list(Yoh_test.swapaxes(0,1))
score = model.evaluate(Xoh_test, outputs_test)
print('Test loss: ', score[0])
複製程式碼

2000/2000 [==============================] - 2s 1ms/step
Test loss:0.4966005325317383
複製程式碼

Now that we've created this beautiful model, let's see how it does in action.

The below code finds a random example and runs it through our model.

# Let's visually check model output.
import random as random

i = random.randint(0, m)

def get_prediction(model, x):
prediction = model.predict(x)
max_prediction = [y.argmax() for y in prediction]
str_prediction = "".join(ids_to_keys(max_prediction, machine_vocab))
return (max_prediction, str_prediction)

max_prediction, str_prediction = get_prediction(model, Xoh[i:i+1])

print("Input: " + str(dataset[i][0]))
print("Tokenized: " + str(X[i]))
print("Prediction: " + str(max_prediction))
print("Prediction text: " + str(str_prediction))
複製程式碼

Input: 13.09
Tokenized: [ 4623 12 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40 40]
Prediction: [1, 3, 10, 0, 9]
Prediction text: 13:09
複製程式碼

Last but not least, all introductions to Attention networks require a little tour.

The below graph shows what inputs the model was focusing on when writing each individual letter.

注意力機制圖

注意力機制精要

全域性注意力機制

區域性注意力機制

自注意力機制

隱藏向量首先會傳遞到全連線層。然後校準係數會對比全連線層的輸出和可訓練上下文向量 u（隨機初始化），並通過 Softmax 歸一化而得出。注意力向量 s 最後可以為所有隱藏向量的加權和。上下文向量可以解釋為在平均上表徵的最優單詞。但模型面臨新的樣本時，它會使用這一知識以決定哪一個詞需要更加註意。在訓練中，模型會通過反向傳播更新上下文向量，即它會調整內部表徵以確定最優詞是什麼。

Self Attention與傳統的Attention機制非常的不同：傳統的Attention是基於source端和target端的隱變數（hidden state）計算Attention的，得到的結果是源端的每個詞與目標端每個詞之間的依賴關係。但Self Attention不同，它分別在source端和target端進行，僅與source input或者target input自身相關的Self Attention，捕捉source端或target端自身的詞與詞之間的依賴關係；然後再把source端的得到的self Attention加入到target端得到的Attention中，捕捉source端和target端詞與詞之間的依賴關係。因此，self Attention Attention比傳統的Attention mechanism效果要好，主要原因之一是，傳統的Attention機制忽略了源端或目標端句子中詞與詞之間的依賴關係，相對比，self Attention可以不僅可以得到源端與目標端詞與詞之間的依賴關係，同時還可以有效獲取源端或目標端自身詞與詞之間的依賴關係

層級注意力機制

在該架構中，自注意力機制共使用了兩次：在詞層面與在句子層面。該方法因為兩個原因而非常重要，首先是它匹配文件的自然層級結構（詞——句子——文件）。其次在計算文件編碼的過程中，它允許模型首先確定哪些單詞在句子中是非常重要的，然後再確定哪個句子在文件中是非常重要的。

如果看完還是有疑惑或者想了解更多，請關注我部落格望江人工智庫，或者去 GitHub 聯絡我。

關於注意力機制的實驗，我完成之後會發布在我的 GitHub 上。