1. 程式人生 > >自頂向下分析一個簡單的語音識別系統(八)

自頂向下分析一個簡單的語音識別系統(八)

上回我們說到了get_audio_and_transcript函式、pad_sequences函式和sparse_tuple_from函式等3個函式,本回我們分析這3個函式分別實現了哪些功能。

1.get_audio_and_transcript函式

該函式主要通過上文獲得的txt_files列表和wav_files列表,得到audio和標記文字資料,具體程式碼如下:

def get_audio_and_transcript(txt_files, wav_files, n_input, n_context):
    '''
    Loads audio files and text transcriptions from ordered lists of filenames.
    Converts to audio to MFCC arrays and text to numerical arrays.
    Returns list of arrays. Returned audio array list can be padded with
    pad_sequences function in this same module.
    '''
audio = [] audio_len = [] transcript = [] transcript_len = [] for txt_file, wav_file in zip(txt_files, wav_files): # load audio and convert to features audio_data = audiofile_to_input_vector(wav_file, n_input, n_context) audio_data = audio_data.astype('float32'
) audio.append(audio_data) audio_len.append(np.int32(len(audio_data))) # load text transcription and convert to numerical array target = normalize_txt_file(txt_file) target = text_to_char_array(target) transcript.append(target) transcript_len.append(len(target)) audio = np.asarray(audio) audio_len = np.asarray(audio_len) transcript = np.asarray(transcript) transcript_len = np.asarray(transcript_len) return
audio, audio_len, transcript, transcript_len

由上面程式碼可以看出,主要是通過audiofile_to_input_vector函式將audio資訊轉換為可以輸入網路中的訓練向量的。其中涉及到一些語音處理相關知識,我們首先看看是如何對原始audio進行一些常規的語音處理的。

2.讀取wav檔案

使用如下程式碼繪製出波形圖如下所示:

import wave  
import numpy as np 
import struct 
import pylab as pl

#開啟wav檔案  
#open返回一個的是一個Wave_read類的例項,通過呼叫它的方法讀取WAV檔案的格式和資料  
f = wave.open(r"777-126732-0068.wav","rb")  

#讀取格式資訊  
#一次性返回所有的WAV檔案的格式資訊,它返回的是一個組元(tuple):聲道數, 量化位數(byte單位), 採  
#樣頻率, 取樣點數, 壓縮型別, 壓縮型別的描述。wave模組只支援非壓縮的資料,因此可以忽略最後兩個資訊  
params = f.getparams()  
nchannels, sampwidth, framerate, nframes = params[:4]  
print("channel",nchannels)  
print("sample_width",sampwidth)  
print("framerate",framerate)  
print("numframes",nframes) 

#讀取波形資料  
#讀取聲音資料,傳遞一個引數指定需要讀取的長度(以取樣點為單位)  
str_data  = f.readframes(nframes)  
wave_data = struct.unpack('{n}h'.format(n=nframes), str_data)
wave_data = np.array(wave_data)
f.close()  

time = np.arange(0, nframes) * (1.0 / framerate)  

#繪製波形圖
pl.subplot(211)   
pl.plot(time, wave_data)    
pl.xlabel("time (seconds)")   

# 取樣點數,修改取樣點數和起始位置進行不同位置和長度的音訊波形分析
N=nframes
start=0 #開始取樣位置
df = framerate/(N-1) # 解析度
freq = [df*n for n in range(0,N)] #N個元素
wave_data2=wave_data[start:start+N]
c=np.fft.fft(wave_data2)*2/N
#常規顯示取樣頻率一半的頻譜
d=int(len(c)/2)
pl.subplot(212)
pl.plot(freq[:d-1],abs(c[:d-1]),'r')
pl.xlabel("Hz")
pl.show()  

得到波形圖和頻譜圖如下所示:
這裡寫圖片描述
語音處理通常在頻域進行處理,結合到人耳的一些特徵,我們並不需要輸入所有的頻域資訊進行我們的訓練,只需要計算出其mfcc係數即可。

3. MFCC係數

耳蝸實質上相當於一個濾波器組,耳蝸的濾波作用是在對數頻率尺度上進行的,在1000HZ下,人耳的感知能力與頻率成線性關係;而在1000HZ以上,人耳的感知能力與頻率不構成線性關係,而更偏向於對數關係,這就使得人耳對低頻訊號比高頻訊號更敏感。MFCC在一定程度上模擬了人耳對語音的處理特點,應用了人耳聽覺感知方面的研究成果,採用這種技術語音識別系統的效能有一定提高。
下面結合幾張圖來詳細瞭解一個MFCC係數是如何得到的。
這裡寫圖片描述
研究表明,人的語音中有用的部分包含在上圖頻譜中的共振峰上,即頻譜的包絡中(如上圖中)。去掉包絡資訊,剩下的大部分資訊與環境噪聲有關,稱之為頻譜細節。那麼我們應該如何將這兩部分資訊分別提取出來呢?
我們觀察到頻譜包絡的頻率較低,同時頻譜細節的頻率較高,因此我們可以對我們得到的頻譜做一個FFT。在頻譜上做傅立葉變換就相當於逆傅立葉變換Inverse FFT (IFFT)。具體如下圖所示:
這裡寫圖片描述
最後根據如下公式我們可以得到一組Mel濾波器組,如下:
這裡寫圖片描述
這裡寫圖片描述
濾波器組如下圖所示:
這裡寫圖片描述

4.audiofile_to_input_vector函式

前面介紹了一些背景知識,我們在回到主線上來,我們下面來分析audiofile_to_input_vector函式,程式碼如下圖所示:

def audiofile_to_input_vector(audio_filename, numcep, numcontext):
    # Load wav files
    fs, audio = wav.read(audio_filename)

    # Get mfcc coefficients
    orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)
    #fs=1.6kHz numcep=26 該處呼叫python_speech_features包中的mfcc計算相應的MFCC係數

    # We only keep every second feature (BiRNN stride = 2)
    orig_inputs = orig_inputs[::2]

    # For each time slice of the training set, we need to copy the context this makes
    # the numcep dimensions vector into a numcep + 2*numcep*numcontext dimensions
    # because of:
    #  - numcep dimensions for the current mfcc feature set
    #  - numcontext*numcep dimensions for each of the past and future (x2) mfcc feature set
    # => so numcep + 2*numcontext*numcep
    train_inputs = np.array([], np.float32)
    train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))

    # Prepare pre-fix post fix context
    empty_mfcc = np.array([])
    empty_mfcc.resize((numcep))

    # Prepare train_inputs with past and future contexts
    time_slices = range(train_inputs.shape[0])
    context_past_min = time_slices[0] + numcontext
    context_future_max = time_slices[-1] - numcontext
    for time_slice in time_slices:
        # Reminder: array[start:stop:step]
        # slices from indice |start| up to |stop| (not included), every |step|

        # Add empty context data of the correct size to the start and end
        # of the MFCC feature matrix

        # Pick up to numcontext time slices in the past, and complete with empty
        # mfcc features
        need_empty_past = max(0, (context_past_min - time_slice))
        empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
        data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
        assert(len(empty_source_past) + len(data_source_past) == numcontext)

        # Pick up to numcontext time slices in the future, and complete with empty
        # mfcc features
        need_empty_future = max(0, (time_slice - context_future_max))
        empty_source_future = list(empty_mfcc for empty_slots in range(need_empty_future))
        data_source_future = orig_inputs[time_slice + 1:time_slice + numcontext + 1]
        assert(len(empty_source_future) + len(data_source_future) == numcontext)

        if need_empty_past:
            past = np.concatenate((empty_source_past, data_source_past))
        else:
            past = data_source_past

        if need_empty_future:
            future = np.concatenate((data_source_future, empty_source_future))
        else:
            future = data_source_future

        past = np.reshape(past, numcontext * numcep)
        now = orig_inputs[time_slice]
        future = np.reshape(future, numcontext * numcep)

        train_inputs[time_slice] = np.concatenate((past, now, future))
        assert(len(train_inputs[time_slice]) == numcep + 2 * numcep * numcontext)

    # Scale/standardize the inputs
    # This can be done more efficiently in the TensorFlow graph
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)
    return train_inputs

其中每25ms語音片段我們使用26個MFCC倒譜特徵。第25-70行實現將當前25ms語音片段和前後各9個語音片段的494個倒譜系數拼接到一個train_inputs向量中(不存在的前後片段補0)。
這樣我們就得到了訓練需要的語音資訊,下面我們看看訓練的標註資訊是如何獲得的。這部分主要在text.py中實現。

5.normalize_txt_file函式

由get_audio_and_transcript函式程式碼可知,在呼叫audiofile_to_input_vector函式獲得倒譜資料之後,它緊接著就呼叫了normalize_txt_file函式。那麼這個函式實現了一個什麼功能呢?我們馬上來看程式碼,如下所示:

def normalize_txt_file(txt_file, remove_apostrophe=True):
    with codecs.open(txt_file, encoding="utf-8") as open_txt_file:
        return normalize_text(open_txt_file.read(), remove_apostrophe=remove_apostrophe)

可以看到這個函式只是呼叫了normalize_text函式,我們再看看這個程式碼,如下所示:

def normalize_text(original, remove_apostrophe=True):
    # convert any unicode characters to ASCII equivalent
    # then ignore anything else and decode to a string
    result = unicodedata.normalize("NFKD", original).encode("ascii", "ignore").decode()
    if remove_apostrophe:
        # remove apostrophes to keep contractions together
        result = result.replace("'", "")
    # return lowercase alphabetic characters and apostrophes (if still present)
    return re.sub("[^a-zA-Z']+", ' ', result).strip().lower()

這段程式碼主要是去掉文字檔案中不支援的字元。

6.text_to_char_array函式

normalize_txt_file函式去掉了文字標註檔案中不被支援的字元,現在我們來分析它之後呼叫的text_to_char_array函式,程式碼如下:

# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1  # 0 is reserved to space

def text_to_char_array(original):
    # Create list of sentence's words w/spaces replaced by ''
    result = original.replace(' ', '  ')
    result = result.split(' ')

    # Tokenize words into letters adding in SPACE_TOKEN where required
    result = np.hstack([SPACE_TOKEN if xt == '' else list(xt) for xt in result])

    # Return characters mapped into indicies
    return np.asarray([SPACE_INDEX if xt == SPACE_TOKEN else ord(xt) - FIRST_INDEX for xt in result])

由這段程式碼可以看出,text_to_char_array函式將文字標註檔案中的字串表示成了一個數值陣列(數值對應著對應字母的ASCII碼以及SPACE對應的ASCII碼)。
自此我們就得到了我們訓練輸入輸出的所需的全部向量。再次返回到next_batch函式中,我們還有pad_sequences函式和sparse_tuple_from函式需要分析。

7.pad_sequences函式

這段程式碼主要是將語音輸入向量和本次batch最長的序列保持一致,在向量的頭部或者尾部補0(由padding引數決定)。

def pad_sequences(sequences, maxlen=None, dtype=np.float32,
                  padding='post', truncating='post', value=0.):
    '''
    Pads each sequence to the same length of the longest sequence.

        If maxlen is provided, any sequence longer than maxlen is truncated to
        maxlen. Truncation happens off either the beginning or the end
        (default) of the sequence. Supports post-padding (default) and
        pre-padding.

        Args:
            sequences: list of lists where each element is a sequence
            maxlen: int, maximum length
            dtype: type to cast the resulting sequence.
            padding: 'pre' or 'post', pad either before or after each sequence.
            truncating: 'pre' or 'post', remove values from sequences larger
            than maxlen either in the beginning or in the end of the sequence
            value: float, value to pad the sequences to the desired value.

        Returns:
            numpy.ndarray: Padded sequences shape = (number_of_sequences, maxlen)
            numpy.ndarray: original sequence lengths
    '''
    lengths = np.asarray([len(s) for s in sequences], dtype=np.int64)

    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break

    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)
    return x, lengths

8.sparse_tuple_from函式

該函式主要獲得標註向量的一個稀疏表示,程式碼如下圖所示:

def sparse_tuple_from(sequences, dtype=np.int32):
    """
    Create a sparse representention of ``sequences``.

    Args:
        sequences: a list of lists of type dtype where each element is a sequence
    Returns:
        A tuple with (indices, values, shape)
    """

    indices = []
    values = []

    for n, seq in enumerate(sequences):
        indices.extend(zip([n] * len(seq), range(len(seq))))
        values.extend(seq)

    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)

    # return tf.SparseTensor(indices=indices, values=values, shape=shape)
    return indices, values, shape

其中,假設sequences有2個,值分別為[1 3 4 9 2]、[ 8 5 7 2]。則indices=[[0 0][0 1][0 2][0 3][0 4][0 0][0 1][0 2][0 3]],values=[1 3 4 9 2 8 5 7 2],shape=[2 6]。
自此,我們就得到了訓練的輸入輸出,接下來我們就正式進入模型的訓練程式碼。