[譯] 基於 TensorFlow + Python 的文字分類全程詳解
本教程將會建立一個神經網路模型,通過分析影評文字將影評分為正面或負面。這是一個典型的二分類問題,是一種重要且廣泛適用的機器學習問題。
我們將使用包含50,000條電影評論文字的IMDB(網際網路電影資料庫)資料集,並將其分為訓練集(含25,000條影評)和測試集(含25,000條影評)。訓練集和測試集是平衡的,也即兩者的正面評論和負面評論的總數量相同。
本教程將會使用tf.keras(一個高階API),用於在TensorFlow中構建和訓練模型。如果你想了解利用tf.keras進行更高階的文字分類的教程,請參閱MLCC文字分類指南。你可以使用以下python程式碼匯入Keras:
import tensorflow as tf
from tensorflow import keras
import numpy as np
print(tf.__version__)
輸出:
1.11.0
1
下載IMDB資料集
IMDB資料集已經集成於TensorFlow中。它已經被預處理,評論(單詞序列)已經被轉換為整數序列,整數序列中每個整數表示字典中的特定單詞。
您可以使用以下程式碼下載IMDB資料集(如果您已經下載了,使用下面程式碼會直接讀取該資料集):
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
輸出:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 0s 0us/step
引數 num_words=10000 表示資料集保留了最常出現的10,000個單詞。為了保持資料大小的可處理性,罕見的單詞會被丟棄。
2
探索資料
讓我們花一點時間來了解資料的格式。資料集經過預處理後,每個影評都是由整數陣列構成,代替影評中原有的單詞。每個影評都有一個標籤,標籤是0或1的整數值,其中0表示負面評論,1表示正面評論。
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
輸出:
Training entries: 25000, labels: 25000
評論文字已轉換為整數陣列,每個整數表示字典中的特定單詞。以下是第一篇評論文字轉換後的形式:
print(train_data[0])
輸出:
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
電影評論的長度可能不同,但是神經網路的輸入必須是相同長度,因此我們需要稍後解決此問題。以下程式碼顯示了第一篇評論和第二篇評論分別包含的單詞數量:
len(train_data[0]), len(train_data[1])
輸出:
(218, 189)
將整數轉換回單詞:
瞭解如何將整數轉換回文字也許是有用的。在下面程式碼中,我們將建立一個輔助函式來查詢包含有整數到字串對映的字典物件:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2# unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
輸出:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json 1646592/1641221 [==============================] - 0s 0us/step
現在我們可以使用decode_review函式來檢視解碼後的第一篇影評文字:
decode_review(train_data[0])
輸出:
"this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy's that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
3
準備資料
在輸入到神經網路之前,整數陣列形式的評論必須轉換為張量。這種轉換可以通過以下兩種方式完成:
-
方法一:對陣列進行獨熱編碼(One-hot-encode),將其轉換為0和1的向量。例如序列[3,5]將成為一個10,000維的向量,除索引3和5為1外,其餘全部為零。然後,將其作為我們網路中的第一層——全連線層(稠密層,Dense layer)——以處理浮點向量資料。然而,這種方法會佔用大量記憶體,需要一個 num_words * num_reviews 大小的矩陣。
-
方法二:填充陣列,使它們都具有相同的長度,然後建立一個形狀為 max_length * num_reviews 的整數張量。我們可以使用能夠處理這種形狀的嵌入層(embedding layer)作為我們神經網路中的第一層。
在本教程中,我們使用第二種方法。
由於電影評論的長度必須相同,我們使用 pad_sequences
函式對長度進行標準化:
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
我們來看現在影評的長度:
len(train_data[0]), len(train_data[1])
輸出:
(256, 256)
檢視填充後的第一篇影評:
print(train_data[0])
輸出:
[114221643530973 1622 138565458 446866 3941 417336256525100438381125067029 35480284515041721121672336385394 172 4536 1111175463813447419250166147 20251914224 1920 461346942271871216 4353038761513 124742217515171216 626182562386128316810654 2223 52441648066 3785334130121638619525 12451361354825 141533622122152877 525144071682284107117 595215256 427 3766572336714353047626400317 46742 102913104884381152979832 207156261416194 74861842262221134476 26480514430 5535185136282249225104 4226651638 1334881216283516 4472113 103321516 53451917832000000 00000000000000 00000000000000 0000]
4
構建模型
神經網路是由層的疊加來實現的,因此我們需要做兩個架構性決策:
-
模型中要使用多少層?
-
每層要使用多少隱藏單元?
在本例中,輸入資料由單詞索引陣列組成,要預測的標籤不是0就是1。我們可以建立這樣一個模型來解決這個問題:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()
輸出:
_________________________________________________________________ Layer (type)Output ShapeParam # ================================================================= embedding (Embedding)(None, None, 16)160000 _________________________________________________________________ global_average_pooling1d (Gl (None, 16)0 _________________________________________________________________ dense (Dense)(None, 16)272 _________________________________________________________________ dense_1 (Dense)(None, 1)17 ================================================================= Total params: 160,289 Trainable params: 160,289 Non-trainable params: 0 _________________________________________________________________
在該模型中,以下4層按順序堆疊以構建分類器:
-
第一層是嵌入層(Embedding layer)。該層採用整數編碼的詞彙表,並查詢每個詞索引的嵌入向量。這些向量是作為模型訓練學習的。向量為輸出陣列新增維度,生成的維度為:(batch, sequence, embedding)。
-
接下來,全域性平均池化層(GlobalAveragePooling1D layer)通過對序列維度求平均,為每個評論返回固定長度的輸出向量。這允許模型以最簡單的方式處理可變長度的輸入。
-
這個固定長度的輸出向量通過一個帶有16個隱藏單元的全連線層(稠密層,Dense layer)進行傳輸。
-
最後一層與單個輸出節點緊密連線。使用sigmoid啟用函式,輸出值是介於0和1之間的浮點數,表示概率或置信水平。
隱藏單元:
上述模型在輸入和輸出之間有兩個中間或“隱藏”層。輸出(單元、節點或神經元)的數量是層的表示空間的維度。換句話說,網路在學習內部表示時允許的自由度。
如果模型具有更多隱藏單元(更高維度的表示空間)和/或更多層,那麼網路可以學習更復雜的表示。但是,它使網路的計算成本更高,並且可能導致學習不需要的模式——這些模式可以提高在訓練資料上的表現,而不會提高在測試資料上的表現。這就是所謂的過度擬合,稍後我們將對此進行探討。
損失函式和優化器:
模型需要一個損失函式和一個用於訓練的優化器。由於這是二分類問題和概率輸出模型(一個帶有 sigmoid 啟用的單個單元層),我們將使用 binary_crossentropy 損失函式。
這不是損失函式的唯一選擇,例如您也可以選擇 mean_squared_error 函式。但是通常 binary_crossentropy 在處理概率上表現更好——它測量概率分佈之間的“距離”,或者測量真實分佈和預測之間的“距離”(我們的例子中)。
日後,當我們探索迴歸問題(比如預測房價)時,我們將看到如何使用另一種稱為均方誤差(Mean Squared Error)的損失函式。
現在,使用優化器和損失函式來配置模型:
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])
5
創造驗證集
在訓練時,我們想要檢查模型在以前沒有見過的資料上的準確性。因而我們通過從原始訓練資料中分離10,000個影評來建立驗證集。(為什麼現在不使用測試集呢?我們的目標是隻使用訓練資料開發和調整我們的模型,然後僅使用一次測試資料來評估我們模型的準確性)。
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
6
訓練模型
本教程採用小批量梯度下降法訓練模型,每個mini—batches含有512個樣本(影評),模型共訓練了40個epoch。這就意味著在 x_train 和 y_train 張量上對所有樣本進行了40次迭代。在訓練期間,模型在驗證集(含10,000個樣本)上的損失值和準確率同樣會被記錄。
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
輸出:
Train on 15000 samples, validate on 10000 samples Epoch 1/40 15000/15000 [==============================] - 1s 57us/step - loss: 0.6914 - acc: 0.5662 - val_loss: 0.6886 - val_acc: 0.6416 Epoch 2/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.6841 - acc: 0.7016 - val_loss: 0.6792 - val_acc: 0.6751 Epoch 3/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.6706 - acc: 0.7347 - val_loss: 0.6627 - val_acc: 0.7228 Epoch 4/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.6481 - acc: 0.7403 - val_loss: 0.6376 - val_acc: 0.7774 Epoch 5/40 15000/15000 [==============================] - 1s 40us/step - loss: 0.6150 - acc: 0.7941 - val_loss: 0.6017 - val_acc: 0.7862 Epoch 6/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.5719 - acc: 0.8171 - val_loss: 0.5596 - val_acc: 0.7996 Epoch 7/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.5230 - acc: 0.8400 - val_loss: 0.5145 - val_acc: 0.8266 Epoch 8/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.4738 - acc: 0.8559 - val_loss: 0.4717 - val_acc: 0.8407 Epoch 9/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.4288 - acc: 0.8671 - val_loss: 0.4343 - val_acc: 0.8500 Epoch 10/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.3889 - acc: 0.8794 - val_loss: 0.4034 - val_acc: 0.8558 Epoch 11/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.3558 - acc: 0.8875 - val_loss: 0.3805 - val_acc: 0.8607 Epoch 12/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.3285 - acc: 0.8942 - val_loss: 0.3585 - val_acc: 0.8675 Epoch 13/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.3039 - acc: 0.9001 - val_loss: 0.3432 - val_acc: 0.8707 Epoch 14/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.2836 - acc: 0.9056 - val_loss: 0.3299 - val_acc: 0.8739 Epoch 15/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.2661 - acc: 0.9102 - val_loss: 0.3197 - val_acc: 0.8766 Epoch 16/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.2512 - acc: 0.9145 - val_loss: 0.3114 - val_acc: 0.8780 Epoch 17/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.2368 - acc: 0.9196 - val_loss: 0.3046 - val_acc: 0.8800 Epoch 18/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.2244 - acc: 0.9235 - val_loss: 0.2991 - val_acc: 0.8820 Epoch 19/40 15000/15000 [==============================] - 1s 44us/step - loss: 0.2129 - acc: 0.9279 - val_loss: 0.2950 - val_acc: 0.8825 Epoch 20/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.2027 - acc: 0.9313 - val_loss: 0.2912 - val_acc: 0.8826 Epoch 21/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1929 - acc: 0.9357 - val_loss: 0.2884 - val_acc: 0.8836 Epoch 22/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1840 - acc: 0.9394 - val_loss: 0.2868 - val_acc: 0.8843 Epoch 23/40 15000/15000 [==============================] - 1s 40us/step - loss: 0.1758 - acc: 0.9429 - val_loss: 0.2856 - val_acc: 0.8840 Epoch 24/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1677 - acc: 0.9475 - val_loss: 0.2842 - val_acc: 0.8850 Epoch 25/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1606 - acc: 0.9503 - val_loss: 0.2838 - val_acc: 0.8847 Epoch 26/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.1535 - acc: 0.9526 - val_loss: 0.2839 - val_acc: 0.8853 Epoch 27/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.1475 - acc: 0.9547 - val_loss: 0.2851 - val_acc: 0.8841 Epoch 28/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.1414 - acc: 0.9571 - val_loss: 0.2848 - val_acc: 0.8862 Epoch 29/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.1356 - acc: 0.9585 - val_loss: 0.2859 - val_acc: 0.8860 Epoch 30/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1307 - acc: 0.9617 - val_loss: 0.2877 - val_acc: 0.8864 Epoch 31/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1248 - acc: 0.9645 - val_loss: 0.2893 - val_acc: 0.8856 Epoch 32/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1202 - acc: 0.9660 - val_loss: 0.2916 - val_acc: 0.8844 Epoch 33/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1149 - acc: 0.9685 - val_loss: 0.2936 - val_acc: 0.8853 Epoch 34/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1107 - acc: 0.9695 - val_loss: 0.2971 - val_acc: 0.8845 Epoch 35/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.1069 - acc: 0.9707 - val_loss: 0.2987 - val_acc: 0.8854 Epoch 36/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.1021 - acc: 0.9731 - val_loss: 0.3019 - val_acc: 0.8842 Epoch 37/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.0984 - acc: 0.9747 - val_loss: 0.3050 - val_acc: 0.8833 Epoch 38/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.0951 - acc: 0.9753 - val_loss: 0.3089 - val_acc: 0.8826 Epoch 39/40 15000/15000 [==============================] - 1s 43us/step - loss: 0.0911 - acc: 0.9773 - val_loss: 0.3111 - val_acc: 0.8829 Epoch 40/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.0876 - acc: 0.9795 - val_loss: 0.3149 - val_acc: 0.8829
7
評估模型
通過測試集來檢驗模型的表現。檢驗結果將返回兩個值:損失值(表示我們的誤差,值越低越好)和準確率。
results = model.evaluate(test_data, test_labels)
print(results)
輸出:
25000/25000 [==============================] - 1s 36us/step [0.33615295355796815, 0.87196]
本文中使用了相當簡單的方法便可達到約87%的準確率。若採用更先進的方法,模型準確率應該接近95%。
8
繪圖檢視精確率和損失值隨時間變化情況
model.fit()
函式會返回一個 History
物件,該物件包含一個字典,記錄了訓練期間發生的所有 事情。
history_dict = history.history
history_dict.keys()
輸出:
dict_keys(['acc', 'val_loss', 'loss', 'val_acc'])
字典中共有四個條目,每個條目對應訓練或驗證期間一個受監控的指標。我們可以使用這些條目來繪製訓練和驗證期間的損失值、訓練和驗證期間的準確率,以進行對比。
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
輸出:
plt.clf()# clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
輸出:
在上面2張圖中,點表示訓練集的損失值和準確度,實線表示驗證集的損失值和準確度。
圖中,訓練集的損失值隨著epoch增大而減少,訓練集的準確度隨著epoch增大而增大。這在使用梯度下降優化時是符合預期的——在每次迭代時最小化期望數量。
但圖中驗證集的損失值和準確率似乎在大約二十個epoch後便已達到峰值,這是不應該出現的情況。這是過度擬合的一個例子:模型在訓練資料上的表現比它在以前從未見過的資料上的表現要好。在此之後,模型由於在訓練集上過度優化,將不適合應用於測試集。
對於這種特殊情況,我們可以通過在二十個左右的epoch後停止訓練來防止過度擬合。在以後的教程中,您會看到如何使用回撥自動執行此操作。
#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
原文標題:Text classification with movie reviews 原文URL:https://www.tensorflow.org/tutorials/keras/basic_text_classification 翻譯、校對和排版:李雪明、朝樂門