基於 LSTM 電影評論情感分析

阿新 • • 發佈：2018-11-12

0、前言

RNN網路因為使用了單詞的序列資訊，所以準確率要比前向傳遞神經網路要高。
網路結構：
在這裡插入圖片描述

首先，將單詞傳入 embedding層，之所以使用嵌入層，是因為單詞數量太多，使用嵌入式詞向量來表示單詞更有效率。在這裡我們使用word2vec方式來實現，而且特別神奇的是，我們只需要加入嵌入層即可，網路會自主學習嵌入矩陣。

通過embedding 層, 新的單詞表示傳入 LSTM cells。這將是一個遞迴連結網路，所以單詞的序列資訊會在網路之間傳遞。最後， LSTM cells連線一個sigmoid output layer 。使用sigmoid可以預測該文字是積極的還是消極的情感。輸出層只有一個單元節點（使用sigmoid啟用）。

只需要關注最後一個sigmoid的輸出，損失只計算最後一步的輸出和標籤的差異。

檔案說明：
（1）reviews.txt 是原始文字檔案，共25000條，一行是一篇英文電影影評文字
（2）labels.txt 是標籤檔案，共25000條，一行是一個標籤，positive 或者 negative

1、Data preprocessing

建任何模型的第一步，永遠是資料清洗。因為使用embedding 層,需要將單詞編碼成整數。

我們要去除標點符號。同時，去除不同文字之間有分隔符號 \n，我們先把\n當成分隔符號，分割所有評論。然後在將所有評論再次連線成為一個大的文字。

import 
 numpy as np
import tensorflow as tf

with open('./data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('./data/labels.txt', 'r') as f:
    labels = f.read()

from string import punctuation
#移除所有標點符號
all_text = ''.join([c for c in reviews if c not in punctuation])
print(all_text[:1000])
# 以'\n'為分隔符，拆分文字 

reviews = all_text.split('\n')

all_text = ' '.join(reviews)
# 文字拆分為單獨的單詞列表
words = all_text.split()

處理結果示例：
在這裡插入圖片描述

2、Encoding the words

embedding lookup要求輸入的網路資料是整數。最簡單的方法就是建立資料字典：{單詞：整數}。然後將評論全部一一對應轉換成整數，傳入網路。

from collections import Counter
count = Counter(words)

#按技術進行排序
vocab = sorted(count,key=count.get,reverse=True)
# 生成字典：{單詞：整數}

vocab_to_int = {word:i for i,word in enumerate(vocab,1)}
# 將文字列表 轉換為 整數列表same shape ==reviews list
reviews_ints = []
for each in reviews:
    reviews_ints.append([vocab_to_int[word] for word in each.split()])

補充enumerate函式用法:
在enumerate函式內寫上int整型數字，則以該整型數字作為起始去迭代生成結果。

a = {"a":4,"b":3}
for i,e in enumerate(a,1):
    print(i,e)

輸出:

1 a
2 b

3、Encoding the labels

將標籤 “positive” or "negative"轉換為數值。

# 將標籤轉換為數值：positive==1 和 negative ==0
labels = labels.split('\n')
labels = np.array([1 if each=='positive' else 0 for each in labels])

統計已經轉乘詞id的句子的長度:

from collections import Counter
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

輸出:

Zero-length reviews: 1
Maximum review length: 2514

將所以句子統一長度為200個單詞：
1、評論長度小於200的，我們對其左邊填充0
2、對於大於200的，我們只擷取其前200個單詞

但發現這些評論裡面有一個評論長度為0，則在做以上處理前先將這評論移除。

# 從  reviews_ints列表中移除0長度的評論
non_zero_idx = [i for i,review in enumerate(reviews_ints) if len(review)>0]
#len(non_zero_idx)
#為了防止出現bug,此處用了in的判斷來去除空值,當然還有別的方法可以用,此處不討論。
reviews_ints = [reviews_ints[i] for i in non_zero_idx]
labels = [labels[i] for i in non_zero_idx]

#選擇每個句子長為200
seq_len = 200
from tensorflow.contrib.keras import preprocessing
features = np.zeros((len(reviews_ints),seq_len),dtype=int)
#將reviews_ints值逐行 賦值給features
features = preprocessing.sequence.pad_sequences(reviews_ints,200)
features.shape

輸出:

(25000, 200)

4、Training, Test劃分

0.2測試資料集,0.8訓練集資料

from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=1,test_size=0.2,random_state=0)
for train_index,test_index in ss.split(np.array(reviews_ints)):
    train_x = features[train_index]
    train_y = labels[train_index]
    test_x = features[test_index]
    test_y = labels[test_index]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nTrain_Y set: \t{}".format(train_y.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

5、Build the graph

開始建立模型圖，第一步：定義超引數。

lstm_size: 隱藏層 LSTM cells節點的數量。一般來說越大越好：比如： 128, 256, 512。
lstm_layers: 隱藏層 LSTM 層的層數。從1開始,如果效果不好（underfitting）逐漸增加。
batch_size: 每次訓練傳入評論的數量。只要不記憶體溢位，一般來說越大越好。
learning_rate:0.001

lstm_size = 256
lstm_layers = 1
batch_size = 128
learning_rate = 0.001

n_words = len(vocab_to_int)

tf.reset_default_graph()
X = tf.placeholder(tf.int32,[None,200],name='inputs')
labels_ = tf.placeholder(tf.int32,[None,1],name='labels')
keep_prob = tf.placeholder(tf.float32,name='keep_prob')

Embedding
新增embedding 層。因為原始單詞總量有72000個，直接one-hot編碼後輸入網路太不效率了，所以我們通過word2vec方法訓練一個嵌入權重矩陣。

# 嵌入向量大小embedding vectors(既嵌入層節點數量)
embed_size = 300 

embedding = tf.Variable(tf.random_uniform((n_words,embed_size),-1,1))
embed = tf.nn.embedding_lookup(embedding,X)

6、LSTM cell

下面，開始建立 LSTM cells 。 (TensorFlow documentation). 先定義單元節點的型別（ type of cells ）。

建立基礎的 LSTM cell , 可以使用 tf.contrib.rnn.BasicLSTMCell函式. 文件說明如下：

tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, input_size=None, state_is_tuple=True, activation=None)
函式中的引數：num_units（指該層單元節點的個數）, 在我們的程式碼中用 lstm_size 來表示。例子如下：

lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
下面，我們需要對cell新增dropout。使用函式：tf.contrib.rnn.DropoutWrapper。這等於是將單元(cell) 包裹在另一個單元(cell)中，也等於在輸入或者輸出中添加了dropout。程式碼如下：

drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
一般而言，隱藏層越多模型效果越好。隱藏層較多的話，會讓網路學習到更多的複雜關係。建立多個 LSTM 隱藏層，可以使用tf.contrib.rnn.MultiRNNCell:

cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)

#建立基礎的LSTM cell
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

#對cell新增dropout
drop = tf.contrib.rnn.DropoutWrapper(lstm,output_keep_prob=keep_prob)

#堆疊多個LSTM layers
cell = tf.contrib.rnn.MultiRNNCell([drop]*lstm_layers)

## 將所有cell初始化為0狀態。
initial_state = cell.zero_state(batch_size,tf.float32)

真正的執行 RNN 節點，需要使用函式 tf.nn.dynamic_rnn 。需要傳入2個引數：多層LSTM單元(multiple layered LSTM cell),以及輸入（inputs）。
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
同時我們將上面定義的 initial_state傳給了 RNN網路。這是在隱藏層之間傳遞的單元狀態。 tf.nn.dynamic_rnn 函式幫我們完成了絕大多數工作。並返回每一步的輸出和隱藏層最終狀態。

outputs,final_state = tf.nn.dynamic_rnn(cell=cell,inputs=embed,initial_state=initial_state)

7、output

在這裡我們只關心序列最後一個輸出，我們據此來預測情感。

max_pool = tf.reduce_max(outputs,reduction_indices=[1])
predictions = tf.contrib.layers.fully_connected(max_pool, 1, activation_fn=tf.sigmoid)
with tf.name_scope('cost'):
    cost = tf.losses.mean_squared_error(labels_, predictions)
tf.summary.scalar('cost',cost)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

8、Validation accuracy

with tf.name_scope('accuracy'):
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_), tf.float32))
tf.summary.scalar('accuracy',accuracy)

9、Batching

下面定義了一個函式，從資料集中獲取batches。1、我們移除了最後一個batch，以便我們的batches是齊整的。 2、迭代 x 和 y 陣列，以 [batch_size]為單位，返回上述陣列的切片。

def get_batches(x, y, batch_size=100):
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

merged = tf.summary.merge_all()
direc = 'C:\\Users\\1\\Desktop\\summary'
train_writer = tf.summary.FileWriter(direc+'\\train',graph)
test_writer = tf.summary.FileWriter(direc+'\\test',graph)

10、Training

epochs = 6
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {X: x,
                    labels_: y[:,None],
                    keep_prob:0.6}
            loss, _, summary1 = sess.run([cost, optimizer, merged], feed_dict=feed)
            
            if iteration%5==0:
                train_writer.add_summary(summary1,iteration)
                print("Epoch: {}/{}".format(e+1, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                for x, y in get_batches(test_x, test_y, batch_size):
                    feed = {X: x,
                            labels_: y[:,None],
                            keep_prob:1.0}
                    batch_acc, summary2 = sess.run([accuracy, merged], feed_dict=feed)
                    val_acc.append(batch_acc)
                test_writer.add_summary(summary2,iteration)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

11、結果

輸出就不輸出了，直接上圖參看。
測試集的結果：
在這裡插入圖片描述

基於 LSTM 電影評論情感分析

0、前言

1、Data preprocessing

2、Encoding the words

3、Encoding the labels

4、Training, Test劃分

5、Build the graph

6、LSTM cell

7、output

8、Validation accuracy

9、Batching

10、Training

11、結果

基於 LSTM 電影評論情感分析

從爬取豆瓣影評到基於樸素貝葉斯的電影評論情感分析(上)

kaggle 電影評論情感分析貝葉斯分類

20行程式碼實現電影評論情感分析

基於Keras的imdb資料集電影評論情感二分類

大眾點評----評論情感分析

基於TextBlob簡單文字情感分析

Deeplearning4j 實戰（6）：基於LSTM的文字情感識別及其Spark實現

[TensorFlow深度學習深入]實戰三·分別使用DNN,CNN與RNN(LSTM)做文字情感分析(機器如何讀懂人心)

基於keras 的 python情感分析案例IMDB影評情感分析

機器學習基於語義特徵的情感分析

NLP入門（十）使用LSTM進行文字情感分析

實訓項目：基於TextCNN汽車行業評論文本的情感分析

基於LVD、貝葉斯模型演算法實現的電商行業商品評論與情感分析案例

基於LDA對電商商品評論進行情感分析

情感分析背後的樸素貝葉斯及實現基於評論語料庫的影評情感分析(附程式碼)

深度學習----基於keras的LSTM三分類的文字情感分析原理及程式碼

【Python專案】基於文字情感分析的電商評論重排序（以京東為例）（附程式碼）

電影評論的情感極性分析

基於tensorflow的CNN和LSTM文字情感分析對比（附完整程式碼）

基於 LSTM 電影評論情感分析

0、前言

1、Data preprocessing

2、Encoding the words

3、Encoding the labels

4、Training, Test劃分

5、Build the graph

6、LSTM cell

7、output

8、Validation accuracy

9、Batching

10、Training

11、結果

相關推薦