基於seq2seq+attention的中文語音識別

阿新 • • 發佈：2019-01-03

好久沒有寫部落格碼字了

本人的中文語音識別跟小米的語音識別作者有過溝通（即參考論文1的作者），期望能夠實現一個完整版的中文語音識別模型，那麼這就開始啦

提綱如下：

1.資料準備

2.seq2seq介紹

3.Attention介紹

4.bilstm介紹

5.bilstm + seq2seq+Attention

1.資料準備

2.seq2seq介紹

介紹seq2seq的部落格滿天飛（英語好的同學可以看tensorflow的官方教程：https://www.tensorflow.org/tutorials/seq2seq），這裡簡單介紹一下，熟悉seq2seq模型的同學直接跳過

seq2seq就是把一個序列翻譯成另一個序列的模型，實質就是兩個rnn，一個是encoder，另一個是decoder，encoder負責將source序列編碼成固定長度的表達，decoder負責將該固定長度的表達解碼成target序列，剛開始是在機器翻譯上使用的，其實這個模型應用非常廣泛，凡是變長之間的對映關係都可以做，常見是機器翻譯，語音識別，摘要提取等等；還可以把encoder和decoder拆開使用。

seq2seq模型的encoder非常簡單（上圖中ABC對應的部分），就是rnn，可以是多層（simple-rnn，GRU，LSTM），decoder在訓練和測試的時候，稍微有點不同

decoder訓練的時候輸入由兩部分組成，一部分是encoder的last state，另一部分是target序列，如上圖中的<GO> WXYZ;其中<GO>和<EOS>表示的是序列開始符和結束符;

decoder 測試的時候輸入也是由兩部分組成，一部分是encoder的last state，另一部分是來自於上一個時刻的輸出（上一個時刻的輸出作為下一個時刻的輸入），直到某個時刻的輸出遇到結束符<EOS>為止,網路結構如下：

注意⚠️：decoder測試的時候需要重用 cell的引數！

對於encoder 和 decoder 的理論內容也就這麼多啦，下面說一下decoder部分，tensorflow是如何來實現訓練和測試的不同操作的

tensorflow在decoder進行train test的時候，使用的目錄下的各種tf.contrib.seq2seq目錄下的各種helper方法來區分不同的輸入

Helper

常用的Helper：

TrainingHelper：適用於訓練的helper。
InferenceHelper：適用於測試的helper。
GreedyEmbeddingHelper：適用於測試中採用Greedy策略sample的helper。

CustomHelper：使用者自定義的helper。

以下貼出我常用的decoder train和test的函式

def decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
                         target_sequence_length, max_summary_length,
                         output_layer, keep_prob):
    """
    Create a decoding layer for training
    :param encoder_state: Encoder State
    :param dec_cell: Decoder RNN Cell
    :param dec_embed_input: Decoder embedded input
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_summary_length: The length of the longest sequence in the batch
    :param output_layer: Function to apply the output layer
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing training logits and sample_id
    """
    helper = tf.contrib.seq2seq.TrainingHelper(inputs = dec_embed_input, sequence_length = target_sequence_length,time_major = False)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state,dec_sequence_length = tf.contrib.seq2seq.dynamic_decode(decoder, impute_finished=True, maximum_iterations=max_summary_length)
    return dec_outputs

def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                         end_of_sequence_id, max_target_sequence_length,
                         vocab_size, output_layer, batch_size, keep_prob):
    """
    Create a decoding layer for inference
    :param encoder_state: Encoder state
    :param dec_cell: Decoder RNN Cell
    :param dec_embeddings: Decoder embeddings
    :param start_of_sequence_id: GO ID
    :param end_of_sequence_id: EOS Id
    :param max_target_sequence_length: Maximum length of target sequences
    :param vocab_size: Size of decoder/target vocabulary
    :param decoding_scope: TenorFlow Variable Scope for decoding
    :param output_layer: Function to apply the output layer
    :param batch_size: Batch size
    :param keep_prob: Dropout keep probability
    :return: BasicDecoderOutput containing inference logits and sample_id
    """
    # TODO: Implement Function
    start_tokens = tf.tile(tf.constant([start_of_sequence_id], dtype=tf.int32), [batch_size], name='start_tokens')
    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings,
        start_tokens, end_of_sequence_id)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state,dec_sequence_length = tf.contrib.seq2seq.dynamic_decode(decoder,impute_finished=True,
                                        maximum_iterations=max_target_sequence_length)
    return dec_outputs

def decoding_layer(dec_input, encoder_state,
                   target_sequence_length, max_target_sequence_length,
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    """
    Create decoding layer
    :param dec_input: Decoder input
    :param encoder_state: Encoder state
    :param target_sequence_length: The lengths of each sequence in the target batch
    :param max_target_sequence_length: Maximum length of target sequences
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param target_vocab_size: Size of target vocabulary
    :param batch_size: The size of the batch
    :param keep_prob: Dropout keep probability
    :return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
    """
    # embedding target sequence
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    # construct decoder lstm cell
    dec_cell = tf.contrib.rnn.MultiRNNCell([
        tf.contrib.rnn.LSTMCell(rnn_size) \
        for _ in range(num_layers) ])
    # create output layer to map the outputs of the decoder to the elements of our vocabulary
    output_layer = layers_core.Dense(target_vocab_size,
                                    kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    # decoder train
    with tf.variable_scope("decoding") as decoding_scope:
        dec_outputs_train = decoding_layer_train(encoder_state, dec_cell, dec_embed_input,
                             target_sequence_length, max_target_sequence_length,
                             output_layer, keep_prob)
    # decoder inference
    start_of_sequence_id = target_vocab_to_int["<GO>"]
    end_of_sequence_id = target_vocab_to_int["<EOS>"]
    with tf.variable_scope("decoding", reuse=True) as decoding_scope:
        dec_outputs_infer = decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                             end_of_sequence_id, max_target_sequence_length,
                             target_vocab_size, output_layer, batch_size, keep_prob)
    # rerturn
    return dec_outputs_train, dec_outputs_infer

3.Attention介紹：

Attention的原理介紹，網上也是滿天飛，這裡簡單介紹一下，熟悉Attention模型的同學直接跳過（英語好的同學依舊可以從tensorflow的官方教程找到：https://www.tensorflow.org/tutorials/seq2seq）

首先說一下為什麼要有Attention模型

Attention模型的出現是上述的seq2seq模型存在缺陷，即無論之前的encoder的context有多長，包含多少資訊量，最終都要被壓縮成一個幾百維的vector。這意味著context越大，decoder的輸入之一的last state 會丟失越多的資訊。對於機器翻譯問題，意味著輸入sentence長度增加後，最終decoder翻譯的結果會顯著變差。

Attention實質上是一種content-based addressing的機制，即從網路中某些狀態集合中選取與給定狀態較為相似的狀態，進而做後續的資訊抽取；說人話就是：首先根據Encoder和Decoder的特徵計算權值，然後對Encoder的特徵進行加權求和，作為Decoder的輸入，其作用是將Encoder的特徵以更好的方式呈獻給Decoder，即：並不是所有context都對下一個狀態的生成產生影響，Attention就是選擇恰當的context用它生成下一個狀態。

下面使用李巨集毅老師經典的一頁PPT來表達Attention演算法的核心內容

上圖已經很形象概括了Attention的處理過程，下面以公式來表達一下：

其中hj是encoder的對應輸出，公式中的初始值的s0（對應圖中z0）是encoder的last state，上圖的Match對應公式中的Score，這個Score計算方法有很多種，常用的是多層感知機（MLP），這就是常說的attention context，表示式如下：

實際上attention的計算過程比較簡單，attention的原理部分就介紹完啦，下面看看tensorflow是如何實現attention的。tensorflow對attention在不同版本改動比較大，我們只看當前tf1.4.1版本中tensorflow是如何使用attention的tensorflow把attention單獨分出來，引入了attention_wrapper檔案，定義的幾種attention機制（BahdanauAttention、 LuongAttention、 BahdanauMonotonicAttention、 LuongMonotonicAttention），將attention機制封裝到RNNCell上面的方法AttentionWrapper。其實很簡單，就跟dropoutwrapper、outputwrapper一樣，我們只需要在原本RNNCell的基礎上在封裝一層attention即可。下面貼一下我常用的attention函式：

def decoder_attn(batch_size,encoder_hidden_size,attention_size,enc_seq_len,encoder_output,encoder_state,dec_cell):
    attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(num_units=encoder_hidden_size,
                                                               memory=encoder_output,
                                                               memory_sequence_length=enc_seq_len,
                                                               name="BahdanauAttention")
    attention_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attention_mechanism,attention_size,
                                                         name="attention_wrapper")
    init_state = attention_cell.zero_state(batch_size, tf.float32).clone(cell_state=encoder_state)
    return attention_cell,init_state

def decoding_layer(dec_input, encoder_state,encoder_output,source_sequence_length,
                   target_sequence_length, max_target_sequence_length,
                   encoder_rnn_hidden_unit,
                   decode_rnn_hidden_unit,
                   attention_size,
                   target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    # embedding target sequence
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    # Todo : 這裡一定不能用 MultiRNNCell,否則出錯: 'Tensor' object is not iterable.
    dec_cell = tf.contrib.rnn.LSTMCell(decode_rnn_hidden_unit)
    # create output layer to map the outputs of the decoder to the elements of our vocabulary
    output_layer = layers_core.Dense(target_vocab_size,
                                    kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    attention_cell,init_state = decoder_attn(batch_size,encoder_rnn_hidden_unit, attention_size,
                                             source_sequence_length, encoder_output, encoder_state, dec_cell)
    # decoder train
    with tf.variable_scope("decoding") as decoding_scope:
        dec_outputs_train = decoding_layer_train(init_state, attention_cell, dec_embed_input,
                             target_sequence_length, max_target_sequence_length,
                             output_layer, keep_prob)
    # decoder inference
    start_of_sequence_id = target_vocab_to_int["<GO>"]
    end_of_sequence_id = target_vocab_to_int["<EOS>"]
    with tf.variable_scope("decoding", reuse=True) as decoding_scope:
        dec_outputs_infer = decoding_layer_infer(init_state, attention_cell, dec_embeddings,
                            start_of_sequence_id,
                            end_of_sequence_id, max_target_sequence_length,
                            output_layer, batch_size, keep_prob)
    # rerturn
    return dec_outputs_train, dec_outputs_infer

這裡的decoding_layer函式要重新定義，但是修改非常小，可自行比較。

4.bilstm介紹

其實bilstm沒有啥好介紹的，就是把單向的變成雙向而已，可以獲取完整資訊，對於文字或者不要求實時的語音識別比較適用。

直接上自己常用的encoder的程式碼：

def encoding_layer(emb_encoder_inputs, rnn_size, encoder_num_layers, keep_prob,
                   source_sequence_length,encode_type = "lstm"):
    # time_major=False
    if encode_type == "lstm":
        # construcll rnn cell
        cell = tf.contrib.rnn.MultiRNNCell([
            tf.contrib.rnn.LSTMCell(rnn_size) \
            for _ in range(encoder_num_layers) ])
        # rnn forward
        cell = DropoutWrapper(cell, output_keep_prob=keep_prob)
        enc_output, enc_state = tf.nn.dynamic_rnn(cell, emb_encoder_inputs, sequence_length=source_sequence_length, dtype=tf.float32)
        return enc_output,enc_state if encoder_num_layers <=1 else enc_state[-1]
    elif encode_type == "bilstm":
        cell_fw = tf.nn.rnn_cell.LSTMCell(
            rnn_size,
            initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=123),
            state_is_tuple=True)
        cell_bw = tf.nn.rnn_cell.LSTMCell(
            rnn_size,
            initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
            state_is_tuple=True)
        encoder_f_cell = DropoutWrapper(cell_fw, output_keep_prob=keep_prob)
        encoder_b_cell = DropoutWrapper(cell_bw, output_keep_prob=keep_prob)
        (encoder_fw_outputs, encoder_bw_outputs), (encoder_fw_final_state, encoder_bw_final_state) = \
            tf.nn.bidirectional_dynamic_rnn(
                encoder_f_cell, encoder_b_cell, inputs=emb_encoder_inputs, dtype=tf.float32, sequence_length=source_sequence_length)
        emb_encoder_outputs = tf.concat((encoder_fw_outputs, encoder_bw_outputs), 2)
        encoder_final_state_c = tf.concat((encoder_fw_final_state.c, encoder_bw_final_state.c), 1)
        encoder_final_state_h = tf.concat((encoder_fw_final_state.h, encoder_bw_final_state.h), 1)
        encoder_final_state = LSTMStateTuple(
            c=encoder_final_state_c,
            h=encoder_final_state_h
        )
        return emb_encoder_outputs, encoder_final_state

完整程式碼見我的github

本文參考論文如下：

1.ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH

2.Listen, Attend and Spell

3.STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

4.VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION

基於seq2seq+attention的中文語音識別

基於seq2seq+attention的中文語音識別

語音識別——基於深度學習的中文語音識別系統實現（程式碼詳解）

kaldi中文語音識別_基於thchs30(1)

kaldi中文語音識別_基於thchs30(3)

Unity中使用百度中文語音識別功能

kaldi中文語音識別thchs30模型訓練程式碼功能和配置引數解讀

一套基於模板匹配的語音識別技術。提取語音的特徵，並建立模板庫,可以將語音識別技術應用於機器人

kaldi中文語音識別(1)——thchs30

一套基於模板匹配的語音識別技術提取語音的特征，並建立模板庫可以將語音識別技術應用於機器人

IOS Android 和 Unity上基於kaldi的離線語音識別系統

Tensorflow 自動文摘: 基於Seq2Seq+Attention模型的Textsum模型

使用 pocketsphinx 做中文語音識別時報錯 ERROR: Input audio file has sample rate [44100], but decoder expects [160

基於深度學習的語音識別研究-CTC理論推導（四）

winform程式實現中文語音識別

基於Seq2seq的中文聊天機器人

基於HTK的連續語音識別系統搭建學習筆記（一）

基於android的語音識別

基於React-Native0.55.4的語音識別項目全棧方案

基於vc++2008託管程式碼開發Windows Vista語音識別

VC++基於微軟語音引擎開發語音識別總結

基於seq2seq+attention的中文語音識別

相關推薦