1. 程式人生 > >『TensorFlow』測試項目_對評論分類

『TensorFlow』測試項目_對評論分類

分布 一個 get string argmax 重復 view normalize blog

數據介紹

  • neg.txt:5331條負面電影評論
  • pos.txt:5331條正面電影評論

函數包

自然語言工具庫 Natural Language Toolkit

下載nltk相關數據:

import nltk
nltk.download()

測試安裝是否成功:

from nltk.corpus import brown
print(brown.words())

常用的函數有兩個:

from nltk.tokenize import word_tokenize
"""
‘I‘m super man‘
tokenize:
[‘I‘, ‘‘m‘, ‘super‘,‘man‘ ] 
""" from nltk.stem import WordNetLemmatizer """ 詞形還原(lemmatizer),即把一個任何形式的英語單詞還原到一般形式,與詞根還原不同(stemmer),後者是抽取一個單詞的詞根。 """

調用形式如下:

words = word_tokenize(line.lower())

lemmatizer = WordNetLemmatizer()
lex = [lemmatizer.lemmatize(word) for word in lex]

程序介紹

載入函數庫以及數據文件名

import numpy as np
import
tensorflow as tf import random import pickle from collections import Counter import nltk from nltk.tokenize import word_tokenize """ ‘I‘m super man‘ tokenize: [‘I‘, ‘‘m‘, ‘super‘,‘man‘ ] """ from nltk.stem import WordNetLemmatizer """ 詞形還原(lemmatizer),即把一個任何形式的英語單詞還原到一般形式,與詞根還原不同(stemmer),後者是抽取一個單詞的詞根。
""" import os os.environ[TF_CPP_MIN_LOG_LEVEL] = 2 pos_file = pos.txt neg_file = neg.txt

詞匯表建立

詞匯表,

  • 利用nltk函數:分詞 & 去掉變形
  • 篩選掉出現頻率過高 & 過低的詞,過低沒有參考意義,過高的一般都是the、is這種不表褒貶的詞
  • 無重復
def creat_lexicon(pos_file,neg_file):
    ‘‘‘建立詞匯表‘‘‘
    lex = []

    def process_file(f):
        with open(pos_file,r) as f:
            lex = []
            lines = f.readlines()
            for line in lines:
                words = word_tokenize(line.lower())
                lex += words
            return lex

    lex += process_file(pos_file)
    lex += process_file(neg_file)

    lemmatizer = WordNetLemmatizer()
    lex = [lemmatizer.lemmatize(word) for word in lex]

    word_count = Counter(lex)
    # print(word_count)
    # {‘.‘: 13944, ‘,‘: 10536, ‘the‘: 10120, ‘a‘: 9444, ‘and‘: 7108, ‘of‘: 6624, ‘it‘: 4748, ‘to‘: 3940......}

    lex = []
    for word in word_count:
        if word_count[word] < 2000 and word_count[word] > 20:
            lex.append(word)
    return lex

lex = creat_lexicon(pos_file,neg_file)

把評論轉換為向量

到這裏思路就很清晰了,就是把詞匯表作為一個計數器集合,不同的評論對應不同的分布,去學習這個分布

# 把每條評論轉換為向量, 轉換原理:
# 假設lex為[‘woman‘, ‘great‘, ‘feel‘, ‘actually‘, ‘looking‘, ‘latest‘, ‘seen‘, ‘is‘] 當然實際上要大的多
# 評論‘i think this movie is great‘ 轉換為 [0,1(great有一個),0,0,0,0,0,1(is有一個)], 把評論中出現的字在lex中標記,出現過的標記為1,其余標記為0
def normalize_dataset(lex):
    dataset = []

    def string_to_vector(lex,review):
        words = word_tokenize(review.lower())
        lemmatizer = WordNetLemmatizer()
        words = [lemmatizer.lemmatize(word) for word in words]

        features = np.zeros(len(lex))
        for word in words:
            if word in lex:
                features[lex.index(word)] += 1
        return features

    with open(pos_file, r) as f:
        lines = f.readlines()
        for line in lines:
            one_sample = [string_to_vector(lex, line), [1,0]]
            dataset.append(one_sample)
    with open(neg_file, r) as f:
        lines = f.readlines()
        for line in lines:
            one_sample = (string_to_vector(lex, line), [0,1])
            dataset.append(one_sample)
    return dataset

dataset = normalize_dataset(lex)
random.shuffle(dataset)

神經網絡

沒什麽好說的

n_input_layer = len(lex)  # 輸入層
n_layer_1 = 1000  # hide layer
n_layer_2 = 1000  # hide layer
n_output_layer = 2  # 輸出層


def neural_network(data):
    # 定義第一層"神經元"的權重和biases
    layer_1_w_b = {w_: tf.Variable(tf.random_normal([n_input_layer,n_layer_1])),
                   b_: tf.Variable(tf.random_normal([n_layer_1]))}
    # 定義第二層"神經元"的權重和biases
    layer_2_w_b = {w_: tf.Variable(tf.random_normal([n_layer_1,n_layer_2])),
                   b_: tf.Variable(tf.random_normal([n_layer_2]))}
    # 定義輸出層"神經元"的權重和biases
    layer_output_w_b = {w_: tf.Variable(tf.random_normal([n_layer_2,n_output_layer])),
                        b_: tf.Variable(tf.random_normal([n_output_layer]))}

    # w·x+b
    layer_1 = tf.add(tf.matmul(data,layer_1_w_b[w_]),layer_1_w_b[b_])
    layer_1 = tf.nn.relu(layer_1)  # 激活函數
    layer_2 = tf.add(tf.matmul(layer_1,layer_2_w_b[w_]),layer_2_w_b[b_])
    layer_2 = tf.nn.relu(layer_2)  # 激活函數
    layer_output = tf.add(tf.matmul(layer_2,layer_output_w_b[w_]),layer_output_w_b[b_])
    return layer_output

數據集劃分&訓練

feed的時候會給batch加上list(),不加會報錯,我記得原因應該是之前的batch數據都是取的切片,不是一個單獨的數據體,所以要加上把它變成一個獨立的結構,其他沒什麽註意的了

test_size = int(len(dataset)*0.1)
dataset = np.array(dataset)
train_dataset = dataset[:-test_size]
test_dataset = dataset[-test_size:]

batch_size = 50
X = tf.placeholder(tf.float32,[None,len(train_dataset[0][0])])
Y = tf.placeholder(tf.float32)

def train_neural_network(X, Y):
    predict = neural_network(X)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=predict, labels=Y))
    optimizer = tf.train.AdamOptimizer().minimize(loss)

    epochs = 13
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        epoch_loss = 0

        i = 0
        random.shuffle(train_dataset)
        train_x = train_dataset[:,0]
        train_y = train_dataset[:,1]
        for epoch in range(epochs):
            while i < len(train_y):
                start = i
                end = i + batch_size

                batch_x = train_x[start:end]
                batch_y = train_y[start:end]

                _,l = sess.run([optimizer, loss], feed_dict={X:list(batch_x), Y:list(batch_y)})
                epoch_loss += l
                i += batch_size
            print(epoch,:,epoch_loss)

        test_x = test_dataset[:,0]
        test_y = test_dataset[:,1]
        correct = tf.equal(tf.argmax(predict,axis=1), tf.argmax(Y,axis=1))
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
        print(準確率:,accuracy.eval({X:list(test_x),Y:list(test_y)}))

train_neural_network(X, Y)

結果:

0 : 62576.5236931
1 : 62576.5236931
2 : 62576.5236931
3 : 62576.5236931
4 : 62576.5236931
5 : 62576.5236931
6 : 62576.5236931
7 : 62576.5236931
8 : 62576.5236931
9 : 62576.5236931
10 : 62576.5236931
11 : 62576.5236931
12 : 62576.5236931
準確率: 0.603189

效果一般,比瞎猜強一點... ...

『TensorFlow』測試項目_對評論分類