1. 程式人生 > >基於RNN的語言模型與機器翻譯NMT

基於RNN的語言模型與機器翻譯NMT

這裡寫圖片描述

以RNN為代表的語言模型在機器翻譯領域已經達到了State of Art的效果,本文將簡要介紹語言模型、機器翻譯,基於RNN的seq2seq架構及優化方法。

語言模型

語言模型就是計算一序列詞出現的概率P(w1,w2,...,wT)

語言模型在機器翻譯領域的應用有:

  • 詞排序:p(the cat is small) > p(small the is cat)
  • 詞選擇:p(walking home after school) > p(walking house after school)

傳統的語言模型

傳統的語言模型通過兩點假設,將詞序列的聯合概率轉化為每個詞條件概率的連乘形式:

  • 每個詞只和它前面出現的詞有關
  • 每個詞只和它前面出現的k個詞有關

每個詞條件概率的計算通過n-gram的形式,具體如下圖。

這裡寫圖片描述

然而,傳統語言模型的一大缺點就是,精度的提升需要提高n-gram中的n。提高n的值帶來需要記憶體的指數提高。

基於RNN的語言模型

基於RNN的語言模型利用RNN本身輸入是序列的特點,在隱含層神經元之上加了全連線層、Softmax層,得到輸出詞的概率分佈。

這裡寫圖片描述

然而,RNN的訓練比較困難,通常採用的trick如下:

  • gradient clipping
  • Initialization(identity matrix) + ReLus
  • Class-based word prediction,p
    (wt|h)=p(ct|h)p(wt|ct)
  • 採用rmsprop、adma等優化方法
  • 採用gru、lstm等神經單元

機器翻譯

基於統計的機器翻譯架構

基於統計的機器翻譯架構,簡單來說包含兩個步驟:

  1. 構建從source到target的alignment。
  2. 根據source到target的alignment獲得各種組合,根據language model獲得概率最大的組合。

這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述

基於RNN的seq2seq架構

seq2seq結構

基於RNN的seq2seq架構包含encoder和decoder,decoder部分又分train和inference兩個過程,具體結構如下面兩圖所示:

這裡寫圖片描述

這裡寫圖片描述

優化seq2seq

  • seq2seq的decoder中,輸入資訊只有ht1,xt,在這基礎上,可以增加新的資訊yt1,senc
  • 加深網路結構
  • 雙向RNN,結合了上文和下文資訊
  • Train input sequence in reverse order for simple optimization problem
  • 採用gru、lstm

attention

attention機制的核心是將encoder的狀態儲存起來並供decoder每個step有選擇地呼叫。具體包括以下步驟:

  1. 計算decoder的當前step的state與encoder各個step state的分數。
  2. 將分數正則化。
  3. 依照正則化後的分數將encoder的各個step的state線性組合起來。
  4. 將組合後的state作為decoder當前step的輸入,相當於給decoder提供了額外的資訊。

這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述

search in decoder

decoder inference的時候對inference序列的選擇有以下幾種方法:

  • Exhaustive Search:窮舉每種可能的情況,不實用。
  • Ancestral Sampling:按照概率分佈抽樣,優點是效率高無偏差,缺點是方差大。
  • Greedy Search:選擇概率最大的抽樣,優點是效率高,缺點是不容易找到最佳值。
  • Beam Search:常用的取樣方法,設定取樣為K,第一次迭代出現K個,第二次出現K^2個並在其中在挑選K個。優點是質量好,缺點是計算量大。

這裡寫圖片描述

這裡寫圖片描述

Tensorflow例項

下面,介紹基於Tensorflow 1.1的例項程式碼,程式碼參考了Udacity DeepLearning NanoDegree的部分示例程式碼。

輸入部分:

  • input: 秩為2的輸入資料(翻譯源)的Tensor,長度不同的增加PAD。是encoder的輸入。
  • targets: 秩為2的輸出資料(翻譯目標)的Tensor,每個序列的結尾增加EOS,長度不同的增加PAD。是decoder的輸出。

構建網路部分程式碼函式如下:

  • model_inputs: 構建網路輸入。
  • process_decoder_input:對target去除最後一列並在每個序列起始位加上GO,構建decoder的輸入dec_input。
  • encoding_layer:對input加embedding層,加RNN得到輸出enc_output, enc_state。
  • decoding_layer_train:得到decoder train過程的輸出dec_outputs_train。
  • decoding_layer_infer:得到decoder infer過程的輸出dec_outputs_infer。
  • decoding_layer:對dec_input加embedding層,加RNN得到train和infer的輸出dec_outputs_train, dec_outputs_infer。
  • seq2seq_model:將encoder和decoder封裝到一起。
# prepare input
def model_inputs():
    """
    Create TF Placeholders for input, targets, learning rate, and lengths of source and target sequences.
    :return: Tuple (input, targets, learning rate, keep probability, target sequence length,
    max target sequence length, source sequence length)
    """
    # TODO: Implement Function
    # input parameters
    input = tf.placeholder(tf.int32, [None, None], name="input")
    targets = tf.placeholder(tf.int32, [None, None], name="targets")
    # training parameters
    learning_rate = tf.placeholder(tf.float32, name="learning_rate")
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")
    # sequence length parameters
    target_sequence_length = tf.placeholder(tf.int32, [None], name="target_sequence_length")
    max_target_sequence_length = tf.reduce_max(target_sequence_length)
    source_sequence_length = tf.placeholder(tf.int32, [None], name="source_sequence_length")

    return (input, targets, learning_rate, keep_prob, target_sequence_length, \
            max_target_sequence_length, source_sequence_length)


def process_decoder_input(target_data, target_vocab_to_int, batch_size):
    """
    Preprocess target data for encoding
    :param target_data: Target Placehoder
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param batch_size: Batch Size
    :return: Preprocessed target data
    """
    x = tf.strided_slice(target_data, [0,0], [batch_size, -1], [1,1])
    y = tf.concat([tf.fill([batch_size, 1], target_vocab_to_int['<GO>']), x], 1)
    return y


# Encoding
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, 
                   source_sequence_length, source_vocab_size, 
                   encoding_embedding_size):
    """
    Create encoding layer
    :param rnn_inputs: Inputs for the RNN
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param keep_prob: Dropout keep probability
    :param source_sequence_length: a list of the lengths of each sequence in the batch
    :param source_vocab_size: vocabulary size of source data
    :param encoding_embedding_size: embedding size of source data
    :return: tuple (RNN output, RNN state)
    """
    # TODO: Implement Function
    # embedding input
    enc_inputs = tf.contrib.layers.embed_sequence(rnn_inputs, source_vocab_size, encoding_embedding_size)
    # construcll rnn cell
    cell = tf.contrib.rnn.MultiRNNCell([
        tf.contrib.rnn.LSTMCell(rnn_size) \
        for _ in range(num_layers) ])
    # rnn forward
    enc_output, enc_state = tf.nn.dynamic_rnn(cell, enc_inputs, sequence_length=source_sequence_length, dtype=tf.float32)
    return enc_output, enc_state

# Decoding
## Decoding Training
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, 
                         target_sequence_length, max_summary_length, 
                         output_layer, keep_prob):
    """
    Create a decoding layer for training
    :param encoder_state: Encoder State
    :param dec_cell: Decoder RNN Cell
    :param dec_embed_input: Decoder embedded input
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_summary_length: The length of the longest sequence in the batch
    :param output_layer: Function to apply the output layer
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing training logits and sample_id
    """
    helper = tf.contrib.seq2seq.TrainingHelper(dec_embed_input, target_sequence_length)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder, impute_finished=True, maximum_iterations=max_summary_length)
    return dec_outputs

## Decoding Inference
def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                         end_of_sequence_id, max_target_sequence_length,
                         vocab_size, output_layer, batch_size, keep_prob):
    """
    Create a decoding layer for inference
    :param encoder_state: Encoder state
    :param dec_cell: Decoder RNN Cell
    :param dec_embeddings: Decoder embeddings
    :param start_of_sequence_id: GO ID
    :param end_of_sequence_id: EOS Id
    :param max_target_sequence_length: Maximum length of target sequences
    :param vocab_size: Size of decoder/target vocabulary
    :param decoding_scope: TenorFlow Variable Scope for decoding
    :param output_layer: Function to apply the output layer
    :param batch_size: Batch size
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing inference logits and sample_id
    """
    # TODO: Implement Function 
    start_tokens = tf.tile(tf.constant([start_of_sequence_id], dtype=tf.int32), [batch_size], name='start_tokens')
    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings, 
        start_tokens, end_of_sequence_id)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder,impute_finished=True,
                                        maximum_iterations=max_target_sequence_length)
    return dec_outputs

## Decoding Layer
from tensorflow.python.layers import core as layers_core
def decoding_layer(dec_input, encoder_state,
                   target_sequence_length, max_target_sequence_length,
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    """
    Create decoding layer
    :param dec_input: Decoder input
    :param encoder_state: Encoder state
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_target_sequence_length: Maximum length of target sequences
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param target_vocab_size: Size of target vocabulary
    :param batch_size: The size of the batch
    :param keep_prob: Dropout keep probability
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    # embedding target sequence
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    # construct decoder lstm cell
    dec_cell = tf.contrib.rnn.MultiRNNCell([
        tf.contrib.rnn.LSTMCell(rnn_size) \
        for _ in range(num_layers) ])
    # create output layer to map the outputs of the decoder to the elements of our vocabulary
    output_layer = layers_core.Dense(target_vocab_size,
                                    kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    # decoder train
    with tf.variable_scope("decoding") as decoding_scope:
        dec_outputs_train = decoding_layer_train(encoder_state, dec_cell, dec_embed_input, 
                             target_sequence_length, max_target_sequence_length, 
                             output_layer, keep_prob)
    # decoder inference
    start_of_sequence_id = target_vocab_to_int["<GO>"]
    end_of_sequence_id = target_vocab_to_int["<EOS>"]
    with tf.variable_scope("decoding", reuse=True) as decoding_scope:
        dec_outputs_infer = decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                             end_of_sequence_id, max_target_sequence_length,
                             target_vocab_size, output_layer, batch_size, keep_prob)
    # rerturn
    return dec_outputs_train, dec_outputs_infer

# Seq2seq
def seq2seq_model(input_data, target_data, keep_prob, batch_size,
                  source_sequence_length, target_sequence_length,
                  max_target_sentence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size,
                  rnn_size, num_layers, target_vocab_to_int):
    """
    Build the Sequence-to-Sequence part of the neural network
    :param input_data: Input placeholder
    :param target_data: Target placeholder
    :param keep_prob: Dropout keep probability placeholder
    :param batch_size: Batch Size
    :param source_sequence_length: Sequence Lengths of source sequences in the batch
    :param target_sequence_length: Sequence Lengths of target sequences in the batch
    :param source_vocab_size: Source vocabulary size
    :param target_vocab_size: Target vocabulary size
    :param enc_embedding_size: Decoder embedding size
    :param dec_embedding_size: Encoder embedding size
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    # TODO: Implement Function
    # embedding and encoding
    enc_output, enc_state = encoding_layer(input_data, rnn_size, num_layers, keep_prob, 
                   source_sequence_length, source_vocab_size, 
                   enc_embedding_size)
    # process target data
    dec_input = process_decoder_input(target_data, target_vocab_to_int, batch_size) 
    # embedding and decoding
    dec_outputs_train, dec_outputs_infer = decoding_layer(dec_input, enc_state,
                   target_sequence_length, tf.reduce_max(target_sequence_length),
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, dec_embedding_size)
    return dec_outputs_train, dec_outputs_infer

# Build Graph
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
save_path = 'checkpoints/dev'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()
max_target_sentence_length = max([len(sentence) for sentence in source_int_text])

train_graph = tf.Graph()
with train_graph.as_default():
    input_data, targets, lr, keep_prob, target_sequence_length, max_target_sequence_length, source_sequence_length = model_inputs()

    #sequence_length = tf.placeholder_with_default(max_target_sentence_length, None, name='sequence_length')
    input_shape = tf.shape(input_data)

    train_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                   targets,
                                                   keep_prob,
                                                   batch_size,
                                                   source_sequence_length,
                                                   target_sequence_length,
                                                   max_target_sequence_length,
                                                   len(source_vocab_to_int),
                                                   len(target_vocab_to_int),
                                                   encoding_embedding_size,
                                                   decoding_embedding_size,
                                                   rnn_size,
                                                   num_layers,
                                                   target_vocab_to_int)


    training_logits = tf.identity(train_logits.rnn_output, name='logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')

    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)


# Training
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
def get_accuracy(target, logits):
    """
    Calculate accuracy
    """
    max_seq = max(target.shape[1], logits.shape[1])
    if max_seq - target.shape[1]:
        target = np.pad(
            target,
            [(0,0),(0,max_seq - target.shape[1])],
            'constant')
    if max_seq - logits.shape[1]:
        logits = np.pad(
            logits,
            [(0,0),(0,max_seq - logits.shape[1])],
            'constant')

    return np.mean(np.equal(target, logits))

# Split data to training and validation sets
train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]
valid_source = source_int_text[:batch_size]
valid_target = target_int_text[:batch_size]
(valid_sources_batch, valid_targets_batch, valid_sources_lengths, valid_targets_lengths ) = next(get_batches(valid_source,
                                                                                                             valid_target,
                                                                                                             batch_size,
                                                                                                             source_vocab_to_int['<PAD>'],
                                                                                                             target_vocab_to_int['<PAD>']))                                                                                                 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):

        for batch_i, (source_batch, target_batch, sources_lengths, targets_lengths) in enumerate(
                get_batches(train_source, train_target, batch_size,
                            source_vocab_to_int['<PAD>'],
                            target_vocab_to_int['<PAD>'])):

            _, loss = sess.run(
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths,
                 keep_prob: keep_probability})


            if batch_i % display_step == 0 and batch_i > 0:


                batch_train_logits = sess.run(
                    inference_logits,
                    {input_data: source_batch,
                     source_sequence_length: sources_lengths,
                     target_sequence_length: targets_lengths,
                     keep_prob: 1.0})


                batch_valid_logits = sess.run(
                    inference_logits,
                    {input_data: valid_sources_batch,
                     source_sequence_length: valid_sources_lengths,
                     target_sequence_length: valid_targets_lengths,
                     keep_prob: 1.0})


                train_acc = get_accuracy(target_batch, batch_train_logits)


                valid_acc = get_accuracy(valid_targets_batch, batch_valid_logits)


                print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
                      .format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_path)
    print('Model Trained and Saved')