#####好好好好####Keras深度神經網路訓練分類模型的四種方法

阿新 • • 發佈：2018-11-04

歡迎光臨我的部落格：https://gaussic.github.io/2017/03/03/imdb-sentiment-classification/

(轉載請註明出處：https://gaussic.github.io)

Keras的官方Examples裡面展示了四種訓練IMDB文字情感分類的方法，藉助這4個Python程式，可以對Keras的使用做一定的瞭解。以下是對各個樣例的解析。

IMDB 資料集

IMDB情感分類資料集是Stanford整理的一套IMDB影評的情感資料，它含有25000個訓練樣本，25000個測試樣本。以下是其中的一個正樣本:

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

本文中的Keras樣例使用的是整理好已經符號化的pkl檔案，其資料格式大致如下：

from six.moves import cPickle
(x_train, labels_train), (x_test, labels_test) = cPickle.load(open('imdb_full.pkl', 'rb'))
print(x_train[0])
>>> [23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5,3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215]
print(labels_train[0])
>>> 1

更詳細的預處理過程請看 keras/dataset/imdb.py

FastText

FastText是Joulin等人在Bags of Tricks for Efficient Text Classification一文中提到的快速文字分類的方法，論文作者說這個方法可以作為很多文字分類任務的baseline。整個模型的結構如下圖所示：

fasttext

給定一個輸入序列，首先提取N gram特徵得到N gram特徵序列，然後對每個特徵做詞嵌入操作，再把該序列的所有特徵詞向量相加做平均，作為模型的隱藏層，最後在輸出層接任何的分類器（常用的softmax）就可以進行分類了。

這個思路類似於平均化的Sentence Embedding，將句子中的所有詞向量相加求平均，得到句子的向量表示。

整個模型的 Keras 實現如下，相關的註釋已經加入到程式碼中：

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.datasets import imdb


# 構建 ngram 資料集
def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    從一個整數列表中提取  n-gram 集合。
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3)
    [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)]
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))


def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    增廣輸入列表中的每個序列，新增 n-gram 值
    Example: adding bi-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
    >>> add_ngram(sequences, token_indice, ngram_range=2)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]
    Example: adding tri-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018}
    >>> add_ngram(sequences, token_indice, ngram_range=3)
    [[1, 3, 4, 5, 1337], [1, 3, 7, 9, 2, 1337, 2018]]
    """
    new_sequences = []
    for input_list in sequences:
        new_list = input_list[:]
        for i in range(len(new_list) - ngram_range + 1):
            for ngram_value in range(2, ngram_range + 1):
                ngram = tuple(new_list[i:i + ngram_value])
                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences

# Set parameters: 設定引數
# ngram_range = 2 will add bi-grams features ngram_range=2會新增二元特徵
ngram_range = 2
max_features = 20000  # 詞彙表大小
maxlen = 400          # 序列最大長度
batch_size = 32       # 批資料量大小
embedding_dims = 50   # 詞向量維度
nb_epoch = 5          # 迭代輪次

# 載入 imdb 資料
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Average train sequence length: {}'.format(
    np.mean(list(map(len, X_train)), dtype=int)))
print('Average test sequence length: {}'.format(
    np.mean(list(map(len, X_test)), dtype=int)))


if ngram_range > 1:
    print('Adding {}-gram features'.format(ngram_range))
    # Create set of unique n-gram from the training set.
    ngram_set = set()
    for input_list in X_train:
        for i in range(2, ngram_range + 1):
            set_of_ngram = create_ngram_set(input_list, ngram_value=i)
            ngram_set.update(set_of_ngram)

    # Dictionary mapping n-gram token to a unique integer. 將 ngram token 對映到獨立整數的詞典
    # Integer values are greater than max_features in order
    # to avoid collision with existing features.
    # 整數大小比 max_features 要大，按順序排列，以避免與已存在的特徵衝突
    start_index = max_features + 1
    token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
    indice_token = {token_indice[k]: k for k in token_indice}

    # max_features is the highest integer that could be found in the dataset.
    # max_features 是可以在資料集中找到的最大的整數
    max_features = np.max(list(indice_token.keys())) + 1

    # Augmenting X_train and X_test with n-grams features
    # 使用 n-gram 特徵增廣 X_train 和 X_test
    X_train = add_ngram(X_train, token_indice, ngram_range)
    X_test = add_ngram(X_test, token_indice, ngram_range)
    print('Average train sequence length: {}'.format(
        np.mean(list(map(len, X_train)), dtype=int)))
    print('Average test sequence length: {}'.format(
        np.mean(list(map(len, X_test)), dtype=int)))

# 填充序列至固定長度
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

# 構建模型
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
# 先從一個高效的嵌入層開始，它將詞彙表索引對映到 embedding_dim 維度的向量上
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))

# we add a GlobalAveragePooling1D, which will average the embeddings
# of all words in the document
# 新增一個 GlobalAveragePooling1D 層，它將平均整個序列的詞嵌入
model.add(GlobalAveragePooling1D())

# We project onto a single unit output layer, and squash it with a sigmoid:
# 投影到一個單神經元輸出層，然後使用 sigmoid 擠壓。
model.add(Dense(1, activation='sigmoid'))

model.summary()  # 概述

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# 訓練與驗證
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

N-gram 特徵提取

本例中 create_ngram_set() 和 add_ngram() 兩個函式用於像輸入中新增N-gram特徵。

create_ngram_set()函式整理了訓練集中的所有N-gram特徵，再將這些特徵新增到詞彙表中，其具體操作可參考程式碼中的註釋部分。

>>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3)
    [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)]

add_ngram()函式與論文中的思路有些不同，它將一個序列的N-gram特徵值（即n-gram特徵在詞彙表中的Id）放到該序列的尾部，不捨棄原始的序列，其操作如程式碼中解釋：

    Example: adding bi-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
    >>> add_ngram(sequences, token_indice, ngram_range=2)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]
    Example: adding tri-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018}
    >>> add_ngram(sequences, token_indice, ngram_range=3)
    [[1, 3, 4, 5, 1337], [1, 3, 7, 9, 2, 1337, 2018]]

Padding

Padding有填充的意思，它將不定長的序列變成定長的序列，方便迴圈神經網路處理，在Keras中，pad_sequences的操作過程是，如果序列沒有達到最大長度，則在前部補 0 ，如果超過最大長度，則從後面擷取最大長度的序列。

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

模型構建

Embedding

首先是一個嵌入層，將樣本序列的每個id投影到固定維度的向量空間中，每個id由一個固定維度的詞向量表示，即，原先輸入的維度為 [樣本個數，序列長度]，經過嵌入層後，變為 [樣本個數，序列長度，詞向量維度]。

model.add(Embedding(max_features, embedding_dims, input_length=maxlen))

GlobalAveragePooling1D

GlobalAveragePooling1D的操作非常簡單，將輸入的詞向量序列相加在求平均，整合成一個向量。

model.add(GlobalAveragePooling1D())

官方的實現程式碼是：

class GlobalAveragePooling1D(_GlobalPooling1D):
    """Global average pooling operation for temporal data.
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    """

    def call(self, x, mask=None):
        return K.mean(x, axis=1)

Dense

由於 IMDB 情感資料集只有正負兩個類別，因此全連線層是隻有一個神經元的二元分類，使用 sigmoid 啟用函式。

model.add(Dense(1, activation='sigmoid'))

整個模型的結構如下：

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
embedding_1 (Embedding)          (None, 400, 50)       60261500    embedding_input_1[0][0]
____________________________________________________________________________________________________
globalaveragepooling1d_1 (Global (None, 50)            0           embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             51          globalaveragepooling1d_1[0][0]
====================================================================================================

訓練

在該樣例的二元分類器中，使用了二元交叉熵作為損失函式，使用 adam 作為優化器，使用 accuracy 作為評估矩陣。

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

訓練過程中，使用測試集驗證訓練結果:

model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

其訓練結果如下：

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 63s - loss: 0.5812 - acc: 0.7871 - val_loss: 0.4320 - val_acc: 0.8593
Epoch 2/5
25000/25000 [==============================] - 58s - loss: 0.2776 - acc: 0.9307 - val_loss: 0.2992 - val_acc: 0.8936
Epoch 3/5
25000/25000 [==============================] - 58s - loss: 0.1370 - acc: 0.9718 - val_loss: 0.2603 - val_acc: 0.9016
Epoch 4/5
25000/25000 [==============================] - 58s - loss: 0.0738 - acc: 0.9886 - val_loss: 0.2428 - val_acc: 0.9040
Epoch 5/5
25000/25000 [==============================] - 58s - loss: 0.0415 - acc: 0.9951 - val_loss: 0.2351 - val_acc: 0.9066

使用bi-gram時，驗證集的準確率達到了0.9066。

CNN

這個例子介紹瞭如何使用一維卷積來處理文字資料，提供了一種將擅長於影象處理的CNN引入到文字處理中的思路，使用 Convolution1D 對序列進行卷積操作，再使用 GlobalMaxPooling1D 對其進行最大池化操作，這個處理類似於CNN的特徵提取過程，用以提升傳統神經網路的效果。其詳細程式碼如下：

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import imdb


# set parameters:  設定引數
max_features = 5000  # 最大特徵數（詞彙表大小）
maxlen = 400         # 序列最大長度
batch_size = 32      # 每批資料量大小
embedding_dims = 50  # 詞嵌入維度
nb_filter = 250      # 1維卷積核個數
filter_length = 3    # 卷積核長度
hidden_dims = 250    # 隱藏層維度
nb_epoch = 10        # 迭代次數

# 載入 imdb 資料
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

# 樣本填充到固定長度 maxlen，在每個樣本前補 0 
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

# 構建模型
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
# 先從一個高效的嵌入層開始，它將詞彙的索引值對映為 embedding_dims 維度的詞向量
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen,
                    dropout=0.2))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
# 新增一個 1D 卷積層，它將學習 nb_filter 個 filter_length 大小的片語卷積核
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# we use max pooling:
# 使用最大池化
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
# 新增一個原始隱藏層
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
# 投影到一個單神經元的輸出層，並且使用 sigmoid 壓縮它
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.summary()  # 模型概述

# 定義損失函式，優化器，評估矩陣
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# 訓練，迭代 nb_epoch 次
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

模型構建

Embedding

詞嵌入層與 FastText 類似，但是多了一個dropout引數，它的存在意義是隨機的丟棄一部分資料，將一定百分比的資料置為0，這樣做有助於防止過擬合。

更多關於 Dropout 的解釋可以看論文：Dropout: A Simple Way to Prevent Neural Networks from Overfitting

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen,
                    dropout=0.2))

Convolution1D

在嵌入層之後，引入一個一維卷積操作，一維卷積的輸入為[nb_samples, steps, input_dim]，輸出為[nb_samples, new_steps, output_dim]，可見在這個卷積操作後詞向量的維度會變為 nb_filters。

由於是基於時序卷積，所以 steps 可能會發生變化

model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))

Convlution1D與Convolution2d不同，前者基於時間卷積，後者則基於空間卷積，其具體含義比較難以理解，有時間另開一篇講解這兩個的區別，當然，也可以參考以下幾個網頁的解釋：

GlobalMaxPooling1D

model.add(GlobalMaxPooling1D())

對卷積後的序列做1維最大池化操作，以[10, 50]的序列為例，序列的長度為10即10行，每個特徵50位即50列，最大池化操作取每一列的最大值，最後輸出變為[50]的一個向量。

官方的實現如下：

class GlobalMaxPooling1D(_GlobalPooling1D):
    """Global max pooling operation for temporal data.
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    """

    def call(self, x, mask=None):
        return K.max(x, axis=1)

Dense

在1D池化操作完成之後，輸出變成了向量，新增一個原始的全連線隱藏層進一步訓練，以讓CNN+MaxPooling得到的特徵發揮更大作用。

再接上單神經元的全連線層進行分類，這一點與 FastText 相同。

# We add a vanilla hidden layer:
# 新增一個原始隱藏層
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
# 投影到一個單神經元的輸出層，並且使用 sigmoid 壓縮它
model.add(Dense(1))
model.add(Activation('sigmoid'))

整個模型的結構如下：

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
embedding_1 (Embedding)          (None, 400, 50)       250000      embedding_input_1[0][0]
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 398, 250)      37750       embedding_1[0][0]
____________________________________________________________________________________________________
globalmaxpooling1d_1 (GlobalMaxP (None, 250)           0           convolution1d_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 250)           62750       globalmaxpooling1d_1[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 250)           0           dense_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 250)           0           dropout_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 1)             251         activation_1[0][0]
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 1)             0           dense_2[0][0]
====================================================================================================

訓練

訓練過程與 FastText 相同，不再贅述。

其結果如下：

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 12s - loss: 0.4323 - acc: 0.7879 - val_loss: 0.3123 - val_acc: 0.8690
Epoch 2/10
25000/25000 [==============================] - 10s - loss: 0.2947 - acc: 0.8759 - val_loss: 0.2831 - val_acc: 0.8820
Epoch 3/10
25000/25000 [==============================] - 10s - loss: 0.2466 - acc: 0.9009 - val_loss: 0.3057 - val_acc: 0.8672
Epoch 4/10
25000/25000 [==============================] - 10s - loss: 0.2124 - acc: 0.9141 - val_loss: 0.2667 - val_acc: 0.8893
Epoch 5/10
25000/25000 [==============================] - 10s - loss: 0.1780 - acc: 0.9297 - val_loss: 0.2696 - val_acc: 0.8883
Epoch 6/10
25000/25000 [==============================] - 10s - loss: 0.1571 - acc: 0.9396 - val_loss: 0.2900 - val_acc: 0.8800
Epoch 7/10
25000/25000 [==============================] - 10s - loss: 0.1321 - acc: 0.9483 - val_loss: 0.2909 - val_acc: 0.8826
Epoch 8/10
25000/25000 [==============================] - 10s - loss: 0.1175 - acc: 0.9552 - val_loss: 0.2924 - val_acc: 0.8866
Epoch 9/10
25000/25000 [==============================] - 10s - loss: 0.1024 - acc: 0.9616 - val_loss: 0.3194 - val_acc: 0.8775
Epoch 10/10
25000/25000 [==============================] - 10s - loss: 0.0933 - acc: 0.9642 - val_loss: 0.3102 - val_acc: 0.8851

訓練後，在驗證集上得到了0.8851的準確率。

LSTM

LSTM在NLP任務中已經成為了較為基礎的工具，但是在這個任務中，由於資料集較小，所以無法發揮其巨大的優勢，另外由於其訓練速度較慢，所以有時候一些更快更簡便的演算法可能是個更好的選擇。其詳細程式碼如下：

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

# 設定引數
max_features = 20000   # 詞彙表大小
# cut texts after this number of words (among top max_features most common words)
# 裁剪文字為 maxlen 大小的長度（取最後部分，基於前 max_features 個常用詞）
maxlen = 80  
batch_size = 32   # 批資料量大小

# 載入資料
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

# 裁剪為 maxlen 長度
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

# 構建模型
print('Build model...')
model = Sequential()
# 嵌入層，每個詞維度為128
model.add(Embedding(max_features, 128, dropout=0.2))
# LSTM層，輸出維度128，可以嘗試著換成 GRU 試試
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))   # 單神經元全連線層
model.add(Activation('sigmoid'))   # sigmoid 啟用函式層

model.summary()   # 模型概述

# try using different optimizers and different optimizer configs
# 這裡可以嘗試使用不同的損失函式和優化器
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# 訓練，迭代 15 次，使用測試集做驗證（真正實驗時最好不要這樣做）
print('Train...')
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=15,
          validation_data=(X_test, y_test))

# 評估誤差和準確率
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

模型構建

IMDB的LSTM模型構建非常簡單，與 FastText 相類似，以下做總體介紹：

print('Build model...')
model = Sequential()
# 嵌入層，每個詞維度為128
model.add(Embedding(max_features, 128, dropout=0.2))
# LSTM層，輸出維度128，可以嘗試著換成 GRU 試試
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))   # 單神經元全連線層
model.add(Activation('sigmoid'))   # sigmoid 啟用函式層

由上可以看到，LSTM模型只是將 FastText 的 GlobalAveragePooling1D 換成了 LSTM 神經網路層，輸入先通過嵌入層轉換為詞向量序列表示，然後經過LSTM轉換為128維的向量，然後直接接上sigmoid分類器。

關於LSTM的兩個dropout引數，其原理與FastText中類似，可以檢視官方文件。

整個模型的結構如下：

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
embedding_1 (Embedding)          (None, None, 128)     2560000     embedding_input_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 128)           131584      embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             129         lstm_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]
====================================================================================================

訓練

訓練過程與 FastText 和 CNN 相同，不再贅述。

其結果如下：

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
25000/25000 [==============================] - 113s - loss: 0.5194 - acc: 0.7420 - val_loss: 0.3996 - val_acc: 0.8220
Epoch 2/15
25000/25000 [==============================] - 114s - loss: 0.3758 - acc: 0.8404 - val_loss: 0.3813 - val_acc: 0.8342
Epoch 3/15
25000/25000 [==============================] - 116s - loss: 0.2956 - acc: 0.8784 - val_loss: 0.4136 - val_acc: 0.8278
Epoch 4/15
25000/25000 [==============================] - 116s - loss: 0.2414 - acc: 0.9032 - val_loss: 0.3953 - val_acc: 0.8372
Epoch 5/15
25000/25000 [==============================] - 115s - loss: 0.2006 - acc: 0.9208 - val_loss: 0.4037 - val_acc: 0.8311
Epoch 6/15
25000/25000 [==============================] - 115s - loss: 0.1644 - acc: 0.9376 - val_loss: 0.4361 - val_acc: 0.8358
Epoch 7/15
25000/25000 [==============================] - 133s - loss: 0.1468 - acc: 0.9448 - val_loss: 0.4974 - val_acc: 0.8313
Epoch 8/15
25000/25000 [==============================] - 114s - loss: 0.1230 - acc: 0.9532 - val_loss: 0.5178 - val_acc: 0.8256
Epoch 9/15
25000/25000 [==============================] - 114s - loss: 0.1087 - acc: 0.9598 - val_loss: 0.5401 - val_acc: 0.8232
Epoch 10/15
25000/25000 [==============================] - 117s - loss: 0.0979 - acc: 0.9637 - val_loss: 0.5493 - val_acc: 0.8271
Epoch 11/15
25000/25000 [==============================] - 119s - loss: 0.0867 - acc: 0.9685 - val_loss: 0.6539 - val_acc: 0.8235
Epoch 12/15
25000/25000 [==============================] - 113s - loss: 0.0806 - acc: 0.9710 - val_loss: 0.5976 - val_acc: 0.8170
Epoch 13/15
25000/25000 [==============================] - 115s - loss: 0.0724 - acc: 0.9730 - val_loss: 0.6591 - val_acc: 0.8180
Epoch 14/15
25000/25000 [==============================] - 115s - loss: 0.0697 - acc: 0.9758 - val_loss: 0.6542 - val_acc: 0.8165
Epoch 15/15
25000/25000 [==============================] - 114s - loss: 0.0634 - acc: 0.9771 - val_loss: 0.6644 - val_acc: 0.8200
24992/25000 [============================>.] - ETA: 0sTest score: 0.664366141396
Test accuracy: 0.82

在測試集上得到了0.82的準確率。

(轉載請註明出處：https://gaussic.github.io)

CNN + LSTM

在閱讀了上面三種方案的解析，對於 CNN+LSTM 方案的解析應該不會陌生。CNN+LSTM 是 CNN 和 LSTM 的結合體，其詳細程式碼如下：

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Convolution1D, MaxPooling1D
from keras.datasets import imdb


# Embedding  詞嵌入
max_features = 20000   # 詞彙表大小
maxlen = 100           # 序列最大長度
embedding_size = 128   # 詞向量維度
  
# Convolution  卷積
filter_length = 5    # 濾波器長度
nb_filter = 64       # 濾波器個數
pool_length = 4      # 池化長度

# LSTM
lstm_output_size = 70   # LSTM 層輸出尺寸

# Training   訓練引數
batch_size = 30   # 批資料量大小
nb_epoch = 2      # 迭代次數

'''
Note:
batch_size is highly sensitive.
Only 2 epochs are needed as the dataset is very small.

注意：
batch_size 這個值非常敏感。
由於資料集很小，所有2輪迭代足夠了。
'''

# 載入模型
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

# 填充到固定長度 maxlen
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
# 構建模型
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))  # 詞嵌入層
model.add(Dropout(0.25))       # Dropout層

# 1D 卷積層，對詞嵌入層輸出做卷積操作
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# 池化層
model.add(MaxPooling1D(pool_length=pool_length))
# LSTM 迴圈層
model.add(LSTM(lstm_output_size))
# 全連線層，只有一個神經元，輸入是否為正面情感值
model.add(Dense(1))
model.add(Activation('sigmoid'))  # sigmoid判斷情感

model.summary()   # 模型概述

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# 訓練
print('Train...')
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          validation_data=(X_test, y_test))

# 測試
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

模型構建

提取出模型構建的程式碼：

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))  # 詞嵌入層
model.add(Dropout(0.25))       # Dropout層

# 1D 卷積層，對詞嵌入層輸出做卷積操作
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
# 池化層
model.add(MaxPooling1D(pool_length=pool_length))
# LSTM 迴圈層
model.add(LSTM(lstm_output_size))
# 全連線層，只有一個神經元，輸入是否為正面情感值
model.add(Dense(1))
model.add(Activation('sigmoid'))  # sigmoid判斷情感

卷積之前的操作與CNN方法（包括卷積）類似，不再贅述。

池化層由 GlobalMaxPooling1D 換成了 MaxPooling1D，兩者的區別在於，前者池化為一個向量，後者池化後讓然為一個序列，但是規模縮小到原始的 1/ pool_length。

詳細的，[100, 128]的序列（100為序列長度，50為詞嵌入維度），經過 Convlution1D(nb_filters=64, filter_length=5) 後，變為 [96, 64]，再經過 MaxPooling1D(pool_length=4) 後，變成了 [24, 64]。

從LSTM開始，接下來的操作就與 LSTM 方法類似，此處不再贅述。

整個模型的結構如下(注意觀察維度變化)：

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
embedding_1 (Embedding)          (None, 100, 128)      2560000     embedding_input_1[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100, 128)      0           embedding_1[0][0]
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 96, 64)        41024       dropout_1[0][0]
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 24, 64)        0           convolution1d_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 70)            37800       maxpooling1d_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             71          lstm_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]
====================================================================================================

訓練

訓練過程不再贅述。

其結果如下：

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 45s - loss: 0.3825 - acc: 0.8208 - val_loss: 0.3418 - val_acc: 0.8499
Epoch 2/2
25000/25000 [==============================] - 42s - loss: 0.1969 - acc: 0.9250 - val_loss: 0.3417 - val_acc: 0.8528
Test score: 0.341682006605
Test accuracy: 0.852760059094

訓練兩輪就達到了0.85276的效果。

(轉載請註明出處：https://gaussic.github.io)

#####好好好好####Keras深度神經網路訓練分類模型的四種方法

IMDB 資料集

FastText

N-gram 特徵提取

Padding

模型構建

訓練

CNN

模型構建

訓練

LSTM

模型構建

訓練

CNN + LSTM

模型構建

訓練

#####好好好好####Keras深度神經網路訓練分類模型的四種方法

Keras搭建神經網路BPNN(分類問題)

深度神經網路訓練的技巧

Keras深度神經網路模型分層分析【輸入層、卷積層、池化層】

【火爐煉AI】深度學習003-構建並訓練深度神經網路模型

keras搭建神經網路分類新聞主題

深度學習神經網路訓練調參技巧

深度神經網路-keras-調參經驗

如何利用Keras中的權重約束減少深度神經網路中的過擬合

六天搞懂“深度學習”之四：基於神經網路的分類

使用Keras建立神經網路對資料集MNIST分類

深度學習入門——利用卷積神經網路訓練CIFAR—10資料集

吳恩達神經網路與深度學習——深度神經網路習題4：DNN分類應用

tf.estimator API技術手冊（8）——DNNClassifier（深度神經網路分類器）

使用深度神經網路完成對鳶尾花的分類

吳恩達深度學習筆記(29)-神經網路訓練的方差和偏差分析

使用IRIS資料集訓練第一個深度神經網路

深度神經網路為何很難訓練（包含梯度消失和梯度爆炸等）

【python keras實戰】用keras搭建捲起神經網路訓練模型

【資料極客】Week3_訓練深度神經網路的技巧

#####好好好好####Keras深度神經網路訓練分類模型的四種方法

IMDB 資料集

FastText

N-gram 特徵提取

Padding

模型構建

訓練

CNN

模型構建

訓練

LSTM

模型構建

訓練

CNN + LSTM

模型構建

訓練

相關推薦