『TensorFlow』測試項目_對評論分類
阿新 • • 發佈:2017-08-20
分布 一個 get string argmax 重復 view normalize blog
數據介紹
- neg.txt:5331條負面電影評論
- pos.txt:5331條正面電影評論
函數包
自然語言工具庫 Natural Language Toolkit
下載nltk相關數據:
import nltk nltk.download()
測試安裝是否成功:
from nltk.corpus import brown print(brown.words())
常用的函數有兩個:
from nltk.tokenize import word_tokenize """ ‘I‘m super man‘ tokenize: [‘I‘, ‘‘m‘, ‘super‘,‘man‘ ]""" from nltk.stem import WordNetLemmatizer """ 詞形還原(lemmatizer),即把一個任何形式的英語單詞還原到一般形式,與詞根還原不同(stemmer),後者是抽取一個單詞的詞根。 """
調用形式如下:
words = word_tokenize(line.lower()) lemmatizer = WordNetLemmatizer() lex = [lemmatizer.lemmatize(word) for word in lex]
程序介紹
載入函數庫以及數據文件名
import numpy as np importtensorflow as tf import random import pickle from collections import Counter import nltk from nltk.tokenize import word_tokenize """ ‘I‘m super man‘ tokenize: [‘I‘, ‘‘m‘, ‘super‘,‘man‘ ] """ from nltk.stem import WordNetLemmatizer """ 詞形還原(lemmatizer),即把一個任何形式的英語單詞還原到一般形式,與詞根還原不同(stemmer),後者是抽取一個單詞的詞根。""" import os os.environ[‘TF_CPP_MIN_LOG_LEVEL‘] = ‘2‘ pos_file = ‘pos.txt‘ neg_file = ‘neg.txt‘
詞匯表建立
詞匯表,
- 利用nltk函數:分詞 & 去掉變形
- 篩選掉出現頻率過高 & 過低的詞,過低沒有參考意義,過高的一般都是the、is這種不表褒貶的詞
- 無重復
def creat_lexicon(pos_file,neg_file): ‘‘‘建立詞匯表‘‘‘ lex = [] def process_file(f): with open(pos_file,‘r‘) as f: lex = [] lines = f.readlines() for line in lines: words = word_tokenize(line.lower()) lex += words return lex lex += process_file(pos_file) lex += process_file(neg_file) lemmatizer = WordNetLemmatizer() lex = [lemmatizer.lemmatize(word) for word in lex] word_count = Counter(lex) # print(word_count) # {‘.‘: 13944, ‘,‘: 10536, ‘the‘: 10120, ‘a‘: 9444, ‘and‘: 7108, ‘of‘: 6624, ‘it‘: 4748, ‘to‘: 3940......} lex = [] for word in word_count: if word_count[word] < 2000 and word_count[word] > 20: lex.append(word) return lex lex = creat_lexicon(pos_file,neg_file)
把評論轉換為向量
到這裏思路就很清晰了,就是把詞匯表作為一個計數器集合,不同的評論對應不同的分布,去學習這個分布
# 把每條評論轉換為向量, 轉換原理: # 假設lex為[‘woman‘, ‘great‘, ‘feel‘, ‘actually‘, ‘looking‘, ‘latest‘, ‘seen‘, ‘is‘] 當然實際上要大的多 # 評論‘i think this movie is great‘ 轉換為 [0,1(great有一個),0,0,0,0,0,1(is有一個)], 把評論中出現的字在lex中標記,出現過的標記為1,其余標記為0 def normalize_dataset(lex): dataset = [] def string_to_vector(lex,review): words = word_tokenize(review.lower()) lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] features = np.zeros(len(lex)) for word in words: if word in lex: features[lex.index(word)] += 1 return features with open(pos_file, ‘r‘) as f: lines = f.readlines() for line in lines: one_sample = [string_to_vector(lex, line), [1,0]] dataset.append(one_sample) with open(neg_file, ‘r‘) as f: lines = f.readlines() for line in lines: one_sample = (string_to_vector(lex, line), [0,1]) dataset.append(one_sample) return dataset dataset = normalize_dataset(lex) random.shuffle(dataset)
神經網絡
沒什麽好說的
n_input_layer = len(lex) # 輸入層 n_layer_1 = 1000 # hide layer n_layer_2 = 1000 # hide layer n_output_layer = 2 # 輸出層 def neural_network(data): # 定義第一層"神經元"的權重和biases layer_1_w_b = {‘w_‘: tf.Variable(tf.random_normal([n_input_layer,n_layer_1])), ‘b_‘: tf.Variable(tf.random_normal([n_layer_1]))} # 定義第二層"神經元"的權重和biases layer_2_w_b = {‘w_‘: tf.Variable(tf.random_normal([n_layer_1,n_layer_2])), ‘b_‘: tf.Variable(tf.random_normal([n_layer_2]))} # 定義輸出層"神經元"的權重和biases layer_output_w_b = {‘w_‘: tf.Variable(tf.random_normal([n_layer_2,n_output_layer])), ‘b_‘: tf.Variable(tf.random_normal([n_output_layer]))} # w·x+b layer_1 = tf.add(tf.matmul(data,layer_1_w_b[‘w_‘]),layer_1_w_b[‘b_‘]) layer_1 = tf.nn.relu(layer_1) # 激活函數 layer_2 = tf.add(tf.matmul(layer_1,layer_2_w_b[‘w_‘]),layer_2_w_b[‘b_‘]) layer_2 = tf.nn.relu(layer_2) # 激活函數 layer_output = tf.add(tf.matmul(layer_2,layer_output_w_b[‘w_‘]),layer_output_w_b[‘b_‘]) return layer_output
數據集劃分&訓練
feed的時候會給batch加上list(),不加會報錯,我記得原因應該是之前的batch數據都是取的切片,不是一個單獨的數據體,所以要加上把它變成一個獨立的結構,其他沒什麽註意的了
test_size = int(len(dataset)*0.1) dataset = np.array(dataset) train_dataset = dataset[:-test_size] test_dataset = dataset[-test_size:] batch_size = 50 X = tf.placeholder(tf.float32,[None,len(train_dataset[0][0])]) Y = tf.placeholder(tf.float32) def train_neural_network(X, Y): predict = neural_network(X) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=predict, labels=Y)) optimizer = tf.train.AdamOptimizer().minimize(loss) epochs = 13 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) epoch_loss = 0 i = 0 random.shuffle(train_dataset) train_x = train_dataset[:,0] train_y = train_dataset[:,1] for epoch in range(epochs): while i < len(train_y): start = i end = i + batch_size batch_x = train_x[start:end] batch_y = train_y[start:end] _,l = sess.run([optimizer, loss], feed_dict={X:list(batch_x), Y:list(batch_y)}) epoch_loss += l i += batch_size print(epoch,‘:‘,epoch_loss) test_x = test_dataset[:,0] test_y = test_dataset[:,1] correct = tf.equal(tf.argmax(predict,axis=1), tf.argmax(Y,axis=1)) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) print(‘準確率:‘,accuracy.eval({X:list(test_x),Y:list(test_y)})) train_neural_network(X, Y)
結果:
0 : 62576.5236931
1 : 62576.5236931
2 : 62576.5236931
3 : 62576.5236931
4 : 62576.5236931
5 : 62576.5236931
6 : 62576.5236931
7 : 62576.5236931
8 : 62576.5236931
9 : 62576.5236931
10 : 62576.5236931
11 : 62576.5236931
12 : 62576.5236931
準確率: 0.603189
效果一般,比瞎猜強一點... ...
『TensorFlow』測試項目_對評論分類