基於RNN的語言模型與機器翻譯NMT
阿新 • • 發佈:2019-02-09
以RNN為代表的語言模型在機器翻譯領域已經達到了State of Art
的效果,本文將簡要介紹語言模型、機器翻譯,基於RNN的seq2seq架構及優化方法。
語言模型
語言模型就是計算一序列詞出現的概率
語言模型在機器翻譯領域的應用有:
- 詞排序:p(the cat is small) > p(small the is cat)
- 詞選擇:p(walking home after school) > p(walking house after school)
傳統的語言模型
傳統的語言模型通過兩點假設,將詞序列的聯合概率轉化為每個詞條件概率的連乘形式:
- 每個詞只和它前面出現的詞有關
- 每個詞只和它前面出現的
k 個詞有關
每個詞條件概率的計算通過n-gram
的形式,具體如下圖。
然而,傳統語言模型的一大缺點就是,精度的提升需要提高n-gram
中的n
。提高n
的值帶來需要記憶體的指數提高。
基於RNN的語言模型
基於RNN的語言模型利用RNN本身輸入是序列的特點,在隱含層神經元之上加了全連線層、Softmax層,得到輸出詞的概率分佈。
然而,RNN的訓練比較困難,通常採用的trick如下:
- gradient clipping
- Initialization(identity matrix) + ReLus
- Class-based word prediction,
p - 採用rmsprop、adma等優化方法
- 採用gru、lstm等神經單元
機器翻譯
基於統計的機器翻譯架構
基於統計的機器翻譯架構,簡單來說包含兩個步驟:
- 構建從source到target的alignment。
- 根據source到target的alignment獲得各種組合,根據language model獲得概率最大的組合。
基於RNN的seq2seq架構
seq2seq結構
基於RNN的seq2seq架構包含encoder和decoder,decoder部分又分train和inference兩個過程,具體結構如下面兩圖所示:
優化seq2seq
- seq2seq的decoder中,輸入資訊只有
ht−1,xt ,在這基礎上,可以增加新的資訊yt−1,senc 。 - 加深網路結構
- 雙向RNN,結合了上文和下文資訊
- Train input sequence in reverse order for simple optimization problem
- 採用gru、lstm
attention
attention機制的核心是將encoder的狀態儲存起來並供decoder每個step有選擇地呼叫。具體包括以下步驟:
- 計算decoder的當前step的state與encoder各個step state的分數。
- 將分數正則化。
- 依照正則化後的分數將encoder的各個step的state線性組合起來。
- 將組合後的state作為decoder當前step的輸入,相當於給decoder提供了額外的資訊。
search in decoder
decoder inference的時候對inference序列的選擇有以下幾種方法:
- Exhaustive Search:窮舉每種可能的情況,不實用。
- Ancestral Sampling:按照概率分佈抽樣,優點是效率高無偏差,缺點是方差大。
- Greedy Search:選擇概率最大的抽樣,優點是效率高,缺點是不容易找到最佳值。
- Beam Search:常用的取樣方法,設定取樣為K,第一次迭代出現K個,第二次出現K^2個並在其中在挑選K個。優點是質量好,缺點是計算量大。
Tensorflow例項
下面,介紹基於Tensorflow 1.1
的例項程式碼,程式碼參考了Udacity DeepLearning NanoDegree的部分示例程式碼。
輸入部分:
- input: 秩為2的輸入資料(翻譯源)的Tensor,長度不同的增加PAD。是encoder的輸入。
- targets: 秩為2的輸出資料(翻譯目標)的Tensor,每個序列的結尾增加EOS,長度不同的增加PAD。是decoder的輸出。
構建網路部分程式碼函式如下:
- model_inputs: 構建網路輸入。
- process_decoder_input:對target去除最後一列並在每個序列起始位加上GO,構建decoder的輸入dec_input。
- encoding_layer:對input加embedding層,加RNN得到輸出enc_output, enc_state。
- decoding_layer_train:得到decoder train過程的輸出dec_outputs_train。
- decoding_layer_infer:得到decoder infer過程的輸出dec_outputs_infer。
- decoding_layer:對dec_input加embedding層,加RNN得到train和infer的輸出dec_outputs_train, dec_outputs_infer。
- seq2seq_model:將encoder和decoder封裝到一起。
# prepare input
def model_inputs():
"""
Create TF Placeholders for input, targets, learning rate, and lengths of source and target sequences.
:return: Tuple (input, targets, learning rate, keep probability, target sequence length,
max target sequence length, source sequence length)
"""
# TODO: Implement Function
# input parameters
input = tf.placeholder(tf.int32, [None, None], name="input")
targets = tf.placeholder(tf.int32, [None, None], name="targets")
# training parameters
learning_rate = tf.placeholder(tf.float32, name="learning_rate")
keep_prob = tf.placeholder(tf.float32, name="keep_prob")
# sequence length parameters
target_sequence_length = tf.placeholder(tf.int32, [None], name="target_sequence_length")
max_target_sequence_length = tf.reduce_max(target_sequence_length)
source_sequence_length = tf.placeholder(tf.int32, [None], name="source_sequence_length")
return (input, targets, learning_rate, keep_prob, target_sequence_length, \
max_target_sequence_length, source_sequence_length)
def process_decoder_input(target_data, target_vocab_to_int, batch_size):
"""
Preprocess target data for encoding
:param target_data: Target Placehoder
:param target_vocab_to_int: Dictionary to go from the target words to an id
:param batch_size: Batch Size
:return: Preprocessed target data
"""
x = tf.strided_slice(target_data, [0,0], [batch_size, -1], [1,1])
y = tf.concat([tf.fill([batch_size, 1], target_vocab_to_int['<GO>']), x], 1)
return y
# Encoding
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob,
source_sequence_length, source_vocab_size,
encoding_embedding_size):
"""
Create encoding layer
:param rnn_inputs: Inputs for the RNN
:param rnn_size: RNN Size
:param num_layers: Number of layers
:param keep_prob: Dropout keep probability
:param source_sequence_length: a list of the lengths of each sequence in the batch
:param source_vocab_size: vocabulary size of source data
:param encoding_embedding_size: embedding size of source data
:return: tuple (RNN output, RNN state)
"""
# TODO: Implement Function
# embedding input
enc_inputs = tf.contrib.layers.embed_sequence(rnn_inputs, source_vocab_size, encoding_embedding_size)
# construcll rnn cell
cell = tf.contrib.rnn.MultiRNNCell([
tf.contrib.rnn.LSTMCell(rnn_size) \
for _ in range(num_layers) ])
# rnn forward
enc_output, enc_state = tf.nn.dynamic_rnn(cell, enc_inputs, sequence_length=source_sequence_length, dtype=tf.float32)
return enc_output, enc_state
# Decoding
## Decoding Training
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
target_sequence_length, max_summary_length,
output_layer, keep_prob):
"""
Create a decoding layer for training
:param encoder_state: Encoder State
:param dec_cell: Decoder RNN Cell
:param dec_embed_input: Decoder embedded input
:param target_sequence_length: The lengths of each sequence in the target batch
:param max_summary_length: The length of the longest sequence in the batch
:param output_layer: Function to apply the output layer
:param keep_prob: Dropout keep probability
:return: BasicDecoderOutput containing training logits and sample_id
"""
helper = tf.contrib.seq2seq.TrainingHelper(dec_embed_input, target_sequence_length)
decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder, impute_finished=True, maximum_iterations=max_summary_length)
return dec_outputs
## Decoding Inference
def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
end_of_sequence_id, max_target_sequence_length,
vocab_size, output_layer, batch_size, keep_prob):
"""
Create a decoding layer for inference
:param encoder_state: Encoder state
:param dec_cell: Decoder RNN Cell
:param dec_embeddings: Decoder embeddings
:param start_of_sequence_id: GO ID
:param end_of_sequence_id: EOS Id
:param max_target_sequence_length: Maximum length of target sequences
:param vocab_size: Size of decoder/target vocabulary
:param decoding_scope: TenorFlow Variable Scope for decoding
:param output_layer: Function to apply the output layer
:param batch_size: Batch size
:param keep_prob: Dropout keep probability
:return: BasicDecoderOutput containing inference logits and sample_id
"""
# TODO: Implement Function
start_tokens = tf.tile(tf.constant([start_of_sequence_id], dtype=tf.int32), [batch_size], name='start_tokens')
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings,
start_tokens, end_of_sequence_id)
decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder,impute_finished=True,
maximum_iterations=max_target_sequence_length)
return dec_outputs
## Decoding Layer
from tensorflow.python.layers import core as layers_core
def decoding_layer(dec_input, encoder_state,
target_sequence_length, max_target_sequence_length,
rnn_size,
num_layers, target_vocab_to_int, target_vocab_size,
batch_size, keep_prob, decoding_embedding_size):
"""
Create decoding layer
:param dec_input: Decoder input
:param encoder_state: Encoder state
:param target_sequence_length: The lengths of each sequence in the target batch
:param max_target_sequence_length: Maximum length of target sequences
:param rnn_size: RNN Size
:param num_layers: Number of layers
:param target_vocab_to_int: Dictionary to go from the target words to an id
:param target_vocab_size: Size of target vocabulary
:param batch_size: The size of the batch
:param keep_prob: Dropout keep probability
:return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
"""
# embedding target sequence
dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
# construct decoder lstm cell
dec_cell = tf.contrib.rnn.MultiRNNCell([
tf.contrib.rnn.LSTMCell(rnn_size) \
for _ in range(num_layers) ])
# create output layer to map the outputs of the decoder to the elements of our vocabulary
output_layer = layers_core.Dense(target_vocab_size,
kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
# decoder train
with tf.variable_scope("decoding") as decoding_scope:
dec_outputs_train = decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
target_sequence_length, max_target_sequence_length,
output_layer, keep_prob)
# decoder inference
start_of_sequence_id = target_vocab_to_int["<GO>"]
end_of_sequence_id = target_vocab_to_int["<EOS>"]
with tf.variable_scope("decoding", reuse=True) as decoding_scope:
dec_outputs_infer = decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
end_of_sequence_id, max_target_sequence_length,
target_vocab_size, output_layer, batch_size, keep_prob)
# rerturn
return dec_outputs_train, dec_outputs_infer
# Seq2seq
def seq2seq_model(input_data, target_data, keep_prob, batch_size,
source_sequence_length, target_sequence_length,
max_target_sentence_length,
source_vocab_size, target_vocab_size,
enc_embedding_size, dec_embedding_size,
rnn_size, num_layers, target_vocab_to_int):
"""
Build the Sequence-to-Sequence part of the neural network
:param input_data: Input placeholder
:param target_data: Target placeholder
:param keep_prob: Dropout keep probability placeholder
:param batch_size: Batch Size
:param source_sequence_length: Sequence Lengths of source sequences in the batch
:param target_sequence_length: Sequence Lengths of target sequences in the batch
:param source_vocab_size: Source vocabulary size
:param target_vocab_size: Target vocabulary size
:param enc_embedding_size: Decoder embedding size
:param dec_embedding_size: Encoder embedding size
:param rnn_size: RNN Size
:param num_layers: Number of layers
:param target_vocab_to_int: Dictionary to go from the target words to an id
:return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
"""
# TODO: Implement Function
# embedding and encoding
enc_output, enc_state = encoding_layer(input_data, rnn_size, num_layers, keep_prob,
source_sequence_length, source_vocab_size,
enc_embedding_size)
# process target data
dec_input = process_decoder_input(target_data, target_vocab_to_int, batch_size)
# embedding and decoding
dec_outputs_train, dec_outputs_infer = decoding_layer(dec_input, enc_state,
target_sequence_length, tf.reduce_max(target_sequence_length),
rnn_size,
num_layers, target_vocab_to_int, target_vocab_size,
batch_size, keep_prob, dec_embedding_size)
return dec_outputs_train, dec_outputs_infer
# Build Graph
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
save_path = 'checkpoints/dev'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()
max_target_sentence_length = max([len(sentence) for sentence in source_int_text])
train_graph = tf.Graph()
with train_graph.as_default():
input_data, targets, lr, keep_prob, target_sequence_length, max_target_sequence_length, source_sequence_length = model_inputs()
#sequence_length = tf.placeholder_with_default(max_target_sentence_length, None, name='sequence_length')
input_shape = tf.shape(input_data)
train_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
targets,
keep_prob,
batch_size,
source_sequence_length,
target_sequence_length,
max_target_sequence_length,
len(source_vocab_to_int),
len(target_vocab_to_int),
encoding_embedding_size,
decoding_embedding_size,
rnn_size,
num_layers,
target_vocab_to_int)
training_logits = tf.identity(train_logits.rnn_output, name='logits')
inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')
with tf.name_scope("optimization"):
# Loss function
cost = tf.contrib.seq2seq.sequence_loss(
training_logits,
targets,
masks)
# Optimizer
optimizer = tf.train.AdamOptimizer(lr)
# Gradient Clipping
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)
# Training
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
def get_accuracy(target, logits):
"""
Calculate accuracy
"""
max_seq = max(target.shape[1], logits.shape[1])
if max_seq - target.shape[1]:
target = np.pad(
target,
[(0,0),(0,max_seq - target.shape[1])],
'constant')
if max_seq - logits.shape[1]:
logits = np.pad(
logits,
[(0,0),(0,max_seq - logits.shape[1])],
'constant')
return np.mean(np.equal(target, logits))
# Split data to training and validation sets
train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]
valid_source = source_int_text[:batch_size]
valid_target = target_int_text[:batch_size]
(valid_sources_batch, valid_targets_batch, valid_sources_lengths, valid_targets_lengths ) = next(get_batches(valid_source,
valid_target,
batch_size,
source_vocab_to_int['<PAD>'],
target_vocab_to_int['<PAD>']))
with tf.Session(graph=train_graph) as sess:
sess.run(tf.global_variables_initializer())
for epoch_i in range(epochs):
for batch_i, (source_batch, target_batch, sources_lengths, targets_lengths) in enumerate(
get_batches(train_source, train_target, batch_size,
source_vocab_to_int['<PAD>'],
target_vocab_to_int['<PAD>'])):
_, loss = sess.run(
[train_op, cost],
{input_data: source_batch,
targets: target_batch,
lr: learning_rate,
target_sequence_length: targets_lengths,
source_sequence_length: sources_lengths,
keep_prob: keep_probability})
if batch_i % display_step == 0 and batch_i > 0:
batch_train_logits = sess.run(
inference_logits,
{input_data: source_batch,
source_sequence_length: sources_lengths,
target_sequence_length: targets_lengths,
keep_prob: 1.0})
batch_valid_logits = sess.run(
inference_logits,
{input_data: valid_sources_batch,
source_sequence_length: valid_sources_lengths,
target_sequence_length: valid_targets_lengths,
keep_prob: 1.0})
train_acc = get_accuracy(target_batch, batch_train_logits)
valid_acc = get_accuracy(valid_targets_batch, batch_valid_logits)
print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
.format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))
# Save Model
saver = tf.train.Saver()
saver.save(sess, save_path)
print('Model Trained and Saved')