自然語言處理中的Attention機制

阿新 • • 發佈：2018-12-15

Attention in NLP

Advantage:

integrate information over time
handle variable-length sequences
could be parallelized

Seq2seq

Encoder–Decoder framework:

Encoder:

$h_t = f(x_t, h_{t-1})$

$c = q({h_1,...,h_{T_x}})$

Sutskeveretal.(2014) used an LSTM as f and $q$

(h1,⋅⋅⋅,hT)=hTq ({h_1,··· ,h_T}) = h_T

q (h_{1}, \cdot \cdot \cdot, h_{T}) = h_{T}

Decoder:

$p(y) = \sum_{t=1}^T p(y_t | {y_1,...,y_{t-1}}, c)$

$p(y_t | {y_1,...,y_{t-1}}, c) = g(y_{t-1}, s_t, c)$

)

LEARNING TO ALIGN AND TRANSLATE

Decoder:

each conditional probability:

$p(y_i | {y_1,...,y_{i-1}}, x) = g(y_{i-1}, s_i, c_i)$

$s_i = f(s_{i-1}, y_{i-1}, c_i)$

context vector $c_i$ :

$c_{i}$

=∑j=1Txαijhj c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j

c_{i} = j = 1 \sum T_{x} α_{i j} h_{j}

$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$

$e_{ij} = a(s_{i-1}, h_j)$

in [1], $a$ is computed by:

$a(s_{i-1}, h_j) = v^T tanh(W_a s_{i-1} + U_a h_j)$

在這裡插入圖片描述

Kinds of attention

Hard and soft attention

hard attention 會專注於很小的區域，而 soft attention 的注意力相對發散

Global and local attention

在這裡插入圖片描述

四種alignment function計算方法:

在這裡插入圖片描述

$another, \qquad \qquad a_t = softmax(W_ah_t) \qquad \qquad location$

小結：

在這裡插入圖片描述

attention in feed-forword NN

在這裡插入圖片描述

simpliﬁed version of attention:

在這裡插入圖片描述

$here, \qquad \qquad a(h_t) = tanh(W_{hc}h_t + b_{hc})$

Hierarchical Attention

在這裡插入圖片描述

word level attention:

在這裡插入圖片描述

sentence level attention:

在這裡插入圖片描述

inner attention mechanism:

在這裡插入圖片描述

annotation $h_t$ is ﬁrst passed to a dense layer. An alignment coeﬃcient $α_t$ is then derived by comparing the output $u_t$ of the dense layer with a trainable context vector $u$ (initialized randomly) and normalizing with a softmax. The attentional vector $s$ is ﬁnally obtained as a weighted sum of the annotations.

score can in theory be any alignment function. A straightforward approach is to use dot. The context vector can be interpreted as a representation of the optimal word, on average. When faced with a new example, the model uses this knowledge to decide which word it should pay attention to. During training, through backpropagation, the model updates the context vector, i.e., it adjusts its internal representation of what the optimal word is.

Note： The context vector in the deﬁnition of inner-attention above has nothing to do with the context vector used in seq2seq attention！

self-attention

在這裡插入圖片描述

Self-Attention 即 K=V=Q，例如輸入一個句子，那麼裡面的每個詞都要和該句子中的所有詞進行 Attention 計算。目的是學習句子內部的詞依賴關係，捕獲句子的內部結構。

在這裡插入圖片描述

Conclusion

Attention 函式的本質可以被描述為一個查詢（query）到一系列（鍵key-值value）對的對映。

在這裡插入圖片描述

將Source中的構成元素想象成是由一系列的<Key,Value>資料對構成，此時給定Target中的某個元素Query，通過計算Query和各個Key的相似性或者相關性，得到每個Key對應Value的權重係數，然後對Value進行加權求和，即得到了最終的Attention數值。所以本質上Attention機制是對Source中元素的Value值進行加權求和，而Query和Key用來計算對應Value的權重係數

在這裡插入圖片描述

Attention機制的具體計算過程，如果對目前大多數方法進行抽象的話，可以將其歸納為三個階段：第一個階段根據Query和Key計算兩者的相似性或者相關性；第二個階段對第一階段的原始分值進行歸一化處理；第三個階段根據權重係數對Value進行加權求和。

在這裡插入圖片描述

在一般任務的Encoder-Decoder框架中，輸入Source和輸出Target內容是不一樣的，比如對於英-中機器翻譯來說，Source是英文句子，Target是對應的翻譯出的中文句子，Attention機制發生在Target的元素Query和Source中的所有元素之間。K=V
Self Attention是Source內部元素之間或者Target內部元素之間發生的Attention機制，也可以理解為Target=Source這種特殊情況下的注意力計算機制。Q=K=V

Paper:

[1] 《NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE》 https://arxiv.org/pdf/1409.0473v7.pdf

[2] 《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》 http://cn.arxiv.org/pdf/1502.03044v3.pdf

[2] 《Effective Approaches to Attention-based Neural Machine Translation》 http://cn.arxiv.org/pdf/1508.04025v5.pdf

Blog:

Implement:

自然語言處理中的Attention機制

Attention in NLP

Seq2seq

Encoder–Decoder framework:

LEARNING TO ALIGN AND TRANSLATE

Kinds of attention

Hard and soft attention

Global and local attention

attention in feed-forword NN

Hierarchical Attention

self-attention

Conclusion

自然語言處理中的自注意力機制（Self-attention Mechanism）

自然語言處理中的Attention機制

深度學習和自然語言處理中的attention和memory機制

自然語言處理中的Attention Model：是什麽及為什麽

(zhuan) 自然語言處理中的Attention Model：是什麽及為什麽

自然語言處理中的語言模型預訓練方法

網頁和自然語言處理中的字符問題（半角和全角）

用深度學習解決自然語言處理中的7大問題，文字分類、語言建模、機器翻譯

自然語言處理中常見的10個任務簡介及其資源

網頁和自然語言處理中的字元問題（半形和全形）

nodejs在自然語言處理中的一些小應用

自然語言處理中CNN模型幾種常見的Max Pooling操作

自然語言處理中的詞袋模型

「詞嵌入」在自然語言處理中扮演什麼角色？一文搞懂Word Embeddings的背後原理

自然語言處理中的N-Gram模型詳解

自然語言處理中的文字處理和特徵工程

深度學習在自然語言處理中的應用（一）

RNN在自然語言處理中的應用

深度學習在自然語言處理中的應用綜述

RNN在自然語言處理中的應用及其PyTorch實現

自然語言處理中的Attention機制

Attention in NLP

Seq2seq

Encoder–Decoder framework:

LEARNING TO ALIGN AND TRANSLATE

Kinds of attention

Hard and soft attention

Global and local attention

attention in feed-forword NN

Hierarchical Attention

self-attention

Conclusion

相關推薦