CNN模型和RNN模型在分類問題中的應用（Tensorflow實現）

阿新 • • 發佈：2018-12-17

在這篇文章中，我們將實現一個卷積神經網路和一個迴圈神經網路語句分類模型。本文提到的模型（rnn和cnn）在一系列文字分類任務（如情緒分析）中實現了良好的分類效能，並且由於模型簡單，方便實現，成為了競賽和實戰中常用的baseline。

cnn-text-classification-tf部落格，使用CNN做文字分類專案，start兩千多。閱讀這個專案原始碼，可以瞭解tensorflow構建專案的關鍵步驟，可以養成良好的程式碼習慣，這在初學者來說是很重要的。原始論文Convolutional Neural Networks for Sentence Classification

rnn-text-classification-tf，是我以CNN的原始碼基礎，使用RNN做文字分類專案，實現了類似的分類效能。下面只講解CNN，略過RNN，感興趣的同學可以把RNN也clone下來自己跑一邊。自行給出兩個程式碼的效能比較。

資料處理資料集是 Movie Review data from Rotten Tomatoes，也是原始文獻中使用的資料集之一。資料集包含,包含5331個積極的評論和5331個消極評論，正負向各佔一半。資料集不附帶拆分的訓練/測試集，因此我們只需將10％的資料用作 dev set。資料集過小容易過擬合，可以進行10交叉驗證。在github專案中只是crude將資料集以9:1的比例拆為訓練集和驗證集。步驟： 1. 載入兩類資料 2. 文字資料清洗 3. 把每個句子填充到最大的句子長度，填充字元是，使得每個句子都包含59個單詞。相同的長度有利於進行高效的批處理 4. 根據所有單詞的詞表，建立一個索引，用一個整數代表一個詞，則每個句子由一個整數向量表示

# Load data print("Loading data...") x_text, y = data_helpers.load_data_and_labels(FLAGS.positive_data_file, FLAGS.negative_data_file)

# Build vocabulary max_document_length = max([len(x.split(" ")) for x in x_text]) vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) x = np.array(list(vocab_processor.fit_transform(x_text)))

# Randomly shuffle data np.random.seed(10) shuffle_indices = np.random.permutation(np.arange(len(y))) x_shuffled = x[shuffle_indices] y_shuffled = y[shuffle_indices]

# Split train/test set # TODO: This is very crude, should use cross-validation dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y))) x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:] y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:] print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_))) print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev))) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 模型構建模型結構圖為：

在github專案中，對該模型有適當的改變。第一層把詞嵌入到低維向量；第二層使用多個不同大小的filter進行卷積（分別為3,4,5）；第三層用max-pool把第二層多個filter的結果轉換成一個長的特徵向量並加入dropout正規化；第四層用softmax進行分類。

# Embedding layer with tf.device('/cpu:0'), tf.name_scope("embedding"): self.W = tf.Variable( tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), name="W") self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

# Create a convolution + maxpool layer for each filter size pooled_outputs = [] for i, filter_size in enumerate(filter_sizes): with tf.name_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, embedding_size, 1, num_filters] W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") conv = tf.nn.conv2d( self.embedded_chars_expanded, W, strides=[1, 1, 1, 1], padding="VALID", name="conv") # Apply nonlinearity h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool( h, ksize=[1, sequence_length - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") pooled_outputs.append(pooled)

# Combine all the pooled features num_filters_total = num_filters * len(filter_sizes) self.h_pool = tf.concat(pooled_outputs, 3) self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

# Add dropout with tf.name_scope("dropout"): self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

# Final (unnormalized) scores and predictions with tf.name_scope("output"): W = tf.get_variable( "W", shape=[num_filters_total, num_classes], initializer=tf.contrib.layers.xavier_initializer()) b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") l2_loss += tf.nn.l2_loss(W) l2_loss += tf.nn.l2_loss(b) self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") self.predictions = tf.argmax(self.scores, 1, name="predictions") 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 模型訓練模型訓練部分的程式碼都是固定的套路，熟悉以後非常方便寫出來。 1. summaries彙總 tensorflow提供了各方面的彙總資訊，方便跟蹤和視覺化訓練和預測的過程。summaries是一個序列化的物件，通過SummaryWriter寫入到光碟 2. checkpointing檢查點用於儲存訓練引數，方便選擇最優的引數，使用tf.train.saver()進行儲存 3. 變數初始化 sess.run(tf.initialize_all_variables())，用於初始化所有我們定義的變數，也可以對特定的變數手動呼叫初始化，如預訓練好的詞向量 4. 定義單一的訓練步驟定義一個函式用於模型評價、更新批量資料和更新模型引數 feed_dict中包含了我們在網路中定義的佔位符的資料，必須要對所有的佔位符進行賦值，否則會報錯 train_op不返回結果，只是更新網路的引數 5. 訓練迴圈遍歷資料並對每次遍歷資料呼叫train_step函式，並定期列印模型評價和檢查點

訓練結果這裡上傳部落格中的兩個結果圖，上圖為loss變化，下圖為accuracy的變化。實際模擬結果和下圖相同，在測試集上的準確率為0.6-0.7之間，效果並不是很好。原因如下：

1. 訓練的指標不是平滑的，原因是我們每個批處理的資料過少 2. 訓練集正確率過高，測試集正確率過低，過擬合。避免過擬合：更多的資料；更強的正規化；更少的模型引數。例如對最後一層的權重進行L2懲罰，使得正確率提升到76%，接近原始paper。 ---------------------

CNN模型和RNN模型在分類問題中的應用（Tensorflow實現）

CNN模型和RNN模型在分類問題中的應用（Tensorflow實現）

關於訓練深度學習模型deepNN時，訓練精度維持固定值，模型不收斂的解決辦法（tensorflow實現）

另類容斥和尤拉函式巧妙應用（HDU--5514）

用最大熵模型進行字標註中文分詞（Python實現）

數制轉換-棧的應用（C++實現）

陣列和物件這2種資料結構的儲存和轉換的2道題（JS實現）

LeeCode中ReverseInteger（PHP實現）

LeeCode中StringtoInteger（PHP實現）

TCP和UDP套接字程式設計（java實現）

夫妻過河問題---圖論演算法的應用（Python實現）

程式碼，邏輯迴歸(logistic_regression)實現mnist分類（TensorFlow實現）

樹狀陣列與其應用（Python實現）（1）

K-means聚類演算法的典型簡單應用（Matlab實現）

NLP中的CNN和RNN模型對比

語言模型和RNN CS244n 大作業 Natural Language Processing

文字分類之CNN模型（TensorFlow實現版本）

Python機器學習筆記：深入理解Keras中序貫模型和函式模型

網路程式設計中select模型和poll模型學習(linux)

監督學習中的“生成模型”和“判別模型”

GCN和GCN在文字分類中應用

CNN模型和RNN模型在分類問題中的應用（Tensorflow實現）

相關推薦