1. 程式人生 > >TensorFlow-RNN迴圈神經網路 Example 2:文字情感分析

TensorFlow-RNN迴圈神經網路 Example 2:文字情感分析

TensorFlow-RNN文字情感分析

Step 1 資料處理

import numpy as np
# 讀取資料
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()
# 每一個 \n 表示一條review
reviews[:2000]
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  \nhomelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter . most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets .  br    br   but what if y'
from string import punctuation

# 去除標點符號
all_text = ''.join([c for c in reviews if c not in punctuation])
# 每一個\n表示一條review
reviews = all_text.split('\n')

all_text = ' '.join(reviews)
# 獲得所有單詞
words = all_text.split()
all_text[:2000]
'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly    homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter  most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets   br    br   but what if you were given a bet to live on the st'
words[:100]
['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 .....
 .....
 'at',
 'high']

Step 2 將文字轉換為數字

  • 神經網路無法處理字串,因此要將字串轉換為數字
  • 具體方法就是給每個單詞貼上一個數字下標
  • 同時,為了接下來的訓練,我們將訓練資料中的每一條review從字串轉為數字
from collections import Counter
def get_vocab_to_int(words):

    # 統計每個單詞出現的次數
    counts = Counter(words)

    # 按出現次數,從多到少排序
    vocab = sorted(counts, key=counts.get, reverse=True)

    # 建立單詞到數字的對映,也就是給單詞貼上一個數字下標,在網路中用數字標籤表示單詞
    # 例如,'apple'在網路中就是一個數字,比如是500.
    # 數字標籤從 1 開始, 0 作特殊作用(下面會說)
    vocab_to_int = { word : i for i, word in enumerate(vocab, 1)}

    return vocab_to_int
def get_reviews_ints(vocab_to_int, reviews):
    # 將review轉換為數字,也就是將review中每個單詞,通過vocab_to_int轉換為數字
    # 例如,"I love this moive" 可能被轉換為 [5 36 45 12354]
    reviews_ints = []
    for each in reviews:
        reviews_ints.append( [ vocab_to_int[word] for word in each.split()] )

    return reviews_ints
vocab_to_int = get_vocab_to_int(words)

reviews_ints = get_reviews_ints(vocab_to_int, reviews)
# 舉個例子 看看"i love this moive" 被轉換為什麼樣
get_reviews_ints(vocab_to_int, ['i love this moive'])
[[10, 115, 11, 59320]]
# 共有74072個不重複的單詞
len(vocab_to_int)
74072

Step 3 輸出標籤編碼

  • 標籤中包含’negative’和’positive’兩類,我們將’negative’轉換為0,’positive’為1
labels = np.array([0 if label=='negative' else 1 for label in labels.split('\n')])

Step 4 清理垃圾資料

  • 出於不知名的原因,在reviews_ints中居然有長度為0的資料存在,這是無意義的資料,進行清除
  • 同時,最長的review有2514個單詞,這對於我們網路而言實在是太長了,要砍掉一部分
review_lens = Counter([len(x) for x in reviews_ints])
print('Zero-length reviews:{}'.format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
Zero-length reviews:1
Maximum review length: 2514
# 獲取長度不為0的review的下標
non_zeros_idx = [ ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zeros_idx)
25000
# 將長度為0的review從reviews_ints中清除
reviews_ints = [ reviews_ints[ii] for ii in non_zeros_idx]
labels = np.array( [ labels[ii] for ii in non_zeros_idx] )

Step 5 多退少補

  • 上面提到,有的review太長了,要裁剪,有的又太短了,要填充
  • 我們固定每次輸入字元序列長度為200, 對超過200的review進行裁剪,少於200的review在左邊填0
  • 例如,’i love this movie’是[10, 115, 11, 59320],那麼需要在左邊填196個0,變成這樣:[0,0,…,0, 10, 115, 11, 59320]
# 字元序列長度
seq_len = 200
# 大小為 reviews的數量 * seq_len
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i,review in enumerate(reviews_ints):
    features[i, -len(review):] = np.array(review)[:seq_len]

Step 6 建立訓練集、測試集、驗證集

# 用於訓練的比例
split_frac = 0.8
# 將訓練集劃分出來
split_index = int(len(features)*split_frac)
train_x, val_x = features[:split_index], features[split_index:]
train_y, val_y = labels[:split_index], labels[split_index:]

# 除去訓練集,剩下的部分被分為測試集和驗證集,一半一半
test_index = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_index], val_x[test_index:]
val_y, test_y = val_y[:test_index], val_y[test_index:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))
            Feature Shapes:
Train set:      (20000, 200) 
Validation set:     (2500, 200) 
Test set:       (2500, 200)

Step 7 建立網路

設定基本引數

# LSTM 個數
lstm_size = 256
# LSTM 層數
lstm_layers = 1
batch_size = 512
learning_rate = 0.001

定義輸入輸出

n_words = len(vocab_to_int)

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    # 輸入變數,就是一條reviews,
    # 大小為[None, None],第一個None表示batch_size,可以改為batch_size
    # 第二個None,表示輸入review的長度,可以改成seq_len
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    # 輸入標籤
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    # dropout的概率,例如 0.8 表示80%不進行dropout
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

新增Embeding層

embed_size = 300

with graph.as_default():
    embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), stddev=0.01))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

建立LSTM層

with graph.as_default():

    # 建立lstm層。這一層中,有 lstm_size 個 LSTM 單元
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

    # 新增dropout
    drop = tf.contrib.rnn.DropoutWrapper(lstm, keep_prob)

    # 如果一層lsmt不夠,多來幾層
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)

    # 對於每一條輸入資料,都要有一個初始狀態
    # 每次輸入batch_size 個數據,因此有batch_size個初始狀態
    initial_state = cell.zero_state(batch_size, tf.float32)

RNN 向前傳播

with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# outputs 大小為 (512, ?, 256)
# 512 為batch_size
# ? 為 seq_len
# 256 為lstm單元個數
outputs
<tf.Tensor 'rnn/transpose:0' shape=(512, ?, 256) dtype=float32>

定義輸出

with graph.as_default():
    # 我們只關心lstm最後的輸出結果,因此outputs[:, -1]獲取每條review最後一個單詞的lstm層的輸出
    # outputs[:, -1] 大小為 batch_size * lstm_size
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)

    optimizer = tf.train.AdamOptimizer().minimize(cost)

驗證準確率

with graph.as_default():
    correct_pred = tf.equal( tf.cast(tf.round(predictions), tf.int32), labels_ )
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

獲取Batch

def get_batches(x, y, batch_size=100):
    n_batches = len(x) // batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]

    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

訓練

epochs = 10

# 持久化,儲存訓練的模型
with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    iteration = 1

    for e in range(epochs):
        state = sess.run(initial_state)

        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_ : x,
                    labels_ : y[:,None],
                    keep_prob : 0.5,
                    initial_state : state}

            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)

            if iteration % 5 == 0:
                print('Epoch: {}/{}'.format(e, epochs),
                      'Iteration: {}'.format(iteration),
                      'Train loss: {}'.format(loss))

            if iteration % 25 == 0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))

                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_ : x,
                            labels_ : y[:,None], 
                            keep_prob : 1,
                            initial_state : val_state}

                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)

                print('Val acc: {:.3f}'.format(np.mean(val_acc)))

            iteration += 1

    saver.save(sess, "checkpoints/sentiment.ckpt")
Epoch: 0/10 Iteration: 5 Train loss: 0.24799075722694397
Epoch: 0/10 Iteration: 10 Train loss: 0.24164661765098572
Epoch: 0/10 Iteration: 15 Train loss: 0.23779860138893127
Epoch: 0/10 Iteration: 20 Train loss: 0.23155733942985535
Epoch: 0/10 Iteration: 25 Train loss: 0.19295498728752136
Val acc: 0.694
Epoch: 0/10 Iteration: 30 Train loss: 0.16817498207092285
Epoch: 0/10 Iteration: 35 Train loss: 0.14103104174137115
Epoch: 1/10 Iteration: 40 Train loss: 0.4157596230506897
Epoch: 1/10 Iteration: 45 Train loss: 0.25596609711647034
Epoch: 1/10 Iteration: 50 Train loss: 0.14873309433460236
Val acc: 0.759
Epoch: 1/10 Iteration: 55 Train loss: 0.2219633162021637
Epoch: 1/10 Iteration: 60 Train loss: 0.22595466673374176
Epoch: 1/10 Iteration: 65 Train loss: 0.22170156240463257
Epoch: 1/10 Iteration: 70 Train loss: 0.21362364292144775
Epoch: 1/10 Iteration: 75 Train loss: 0.21025851368904114
Val acc: 0.637
Epoch: 2/10 Iteration: 80 Train loss: 0.197884202003479
Epoch: 2/10 Iteration: 85 Train loss: 0.18369686603546143
Epoch: 2/10 Iteration: 90 Train loss: 0.15401005744934082
Epoch: 2/10 Iteration: 95 Train loss: 0.08480044454336166
Epoch: 2/10 Iteration: 100 Train loss: 0.21809038519859314
Val acc: 0.555
Epoch: 2/10 Iteration: 105 Train loss: 0.2156117707490921
Epoch: 2/10 Iteration: 110 Train loss: 0.2078854888677597
Epoch: 2/10 Iteration: 115 Train loss: 0.17866834998130798
Epoch: 3/10 Iteration: 120 Train loss: 0.2278885841369629
Epoch: 3/10 Iteration: 125 Train loss: 0.23644667863845825
Val acc: 0.574
Epoch: 3/10 Iteration: 130 Train loss: 0.15737152099609375
Epoch: 3/10 Iteration: 135 Train loss: 0.2996417284011841
Epoch: 3/10 Iteration: 140 Train loss: 0.3013457655906677
Epoch: 3/10 Iteration: 145 Train loss: 0.29811352491378784
Epoch: 3/10 Iteration: 150 Train loss: 0.29609352350234985
Val acc: 0.539
Epoch: 3/10 Iteration: 155 Train loss: 0.29265934228897095
Epoch: 4/10 Iteration: 160 Train loss: 0.3259274959564209
Epoch: 4/10 Iteration: 165 Train loss: 0.1977640688419342
Epoch: 4/10 Iteration: 170 Train loss: 0.10309533774852753
Epoch: 4/10 Iteration: 175 Train loss: 0.20305077731609344
Val acc: 0.722
Epoch: 4/10 Iteration: 180 Train loss: 0.21348100900650024
Epoch: 4/10 Iteration: 185 Train loss: 0.1976686418056488
Epoch: 4/10 Iteration: 190 Train loss: 0.17928491532802582
Epoch: 4/10 Iteration: 195 Train loss: 0.17746716737747192
Epoch: 5/10 Iteration: 200 Train loss: 0.12238124758005142
Val acc: 0.814
Epoch: 5/10 Iteration: 205 Train loss: 0.07527816295623779
Epoch: 5/10 Iteration: 210 Train loss: 0.05444170534610748
Epoch: 5/10 Iteration: 215 Train loss: 0.028456348925828934
Epoch: 5/10 Iteration: 220 Train loss: 0.02309001237154007
Epoch: 5/10 Iteration: 225 Train loss: 0.02358683943748474
Val acc: 0.544
Epoch: 5/10 Iteration: 230 Train loss: 0.0281759575009346
Epoch: 6/10 Iteration: 235 Train loss: 0.36734506487846375
Epoch: 6/10 Iteration: 240 Train loss: 0.27041739225387573
Epoch: 6/10 Iteration: 245 Train loss: 0.06518629193305969
Epoch: 6/10 Iteration: 250 Train loss: 0.27379676699638367
Val acc: 0.683
Epoch: 6/10 Iteration: 255 Train loss: 0.17366482317447662
Epoch: 6/10 Iteration: 260 Train loss: 0.11729621887207031
Epoch: 6/10 Iteration: 265 Train loss: 0.156696617603302
Epoch: 6/10 Iteration: 270 Train loss: 0.15894444286823273
Epoch: 7/10 Iteration: 275 Train loss: 0.14083260297775269
Val acc: 0.653
Epoch: 7/10 Iteration: 280 Train loss: 0.131819948554039
Epoch: 7/10 Iteration: 285 Train loss: 0.1406235545873642
Epoch: 7/10 Iteration: 290 Train loss: 0.12142431735992432
Epoch: 7/10 Iteration: 295 Train loss: 0.10793609172105789
Epoch: 7/10 Iteration: 300 Train loss: 0.1138591319322586
Val acc: 0.778
Epoch: 7/10 Iteration: 305 Train loss: 0.10069040209054947
Epoch: 7/10 Iteration: 310 Train loss: 0.08547944575548172
Epoch: 8/10 Iteration: 315 Train loss: 0.0743105486035347
Epoch: 8/10 Iteration: 320 Train loss: 0.08303466439247131
Epoch: 8/10 Iteration: 325 Train loss: 0.07770203053951263
Val acc: 0.749
Epoch: 8/10 Iteration: 330 Train loss: 0.05231660231947899
Epoch: 8/10 Iteration: 335 Train loss: 0.05823827162384987
Epoch: 8/10 Iteration: 340 Train loss: 0.06528615206480026
Epoch: 8/10 Iteration: 345 Train loss: 0.06311675161123276
Epoch: 8/10 Iteration: 350 Train loss: 0.07824704796075821
Val acc: 0.809
Epoch: 9/10 Iteration: 355 Train loss: 0.04236128553748131
Epoch: 9/10 Iteration: 360 Train loss: 0.03875266760587692
Epoch: 9/10 Iteration: 365 Train loss: 0.045075297355651855
Epoch: 9/10 Iteration: 370 Train loss: 0.05201151967048645
Epoch: 9/10 Iteration: 375 Train loss: 0.051657453179359436
Val acc: 0.805
Epoch: 9/10 Iteration: 380 Train loss: 0.040323011577129364
Epoch: 9/10 Iteration: 385 Train loss: 0.03481965512037277
Epoch: 9/10 Iteration: 390 Train loss: 0.061715394258499146

測試

test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))
INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.785