1. 程式人生 > >Word2vec之情感語義分析實戰(part3)--利用分散式詞向量完成監督學習任務

Word2vec之情感語義分析實戰(part3)--利用分散式詞向量完成監督學習任務

引言

這篇部落格將基於前面一篇部落格Part2做進一步的探索與實戰。
demo程式碼與資料:傳送門

單詞的數值化表示

前面我們訓練了單詞的語義理解模型。如果我們深入研究就會發現,Part2中訓練好的模型是由詞彙表中單詞的特徵向量所組成的。這些特徵向量儲存在叫做syn0的numpy陣列中:

# Load the model that we created in Part 2
from gensim.models import Word2Vec
model = Word2Vec.load("300features_40minwords_10context")
#type(model.syn0)
#model.syn0.shape type(model.wv.syn0) model.wv.syn0.shape

[output] numpy.ndarray
[output] (16490, 300)

很明顯這個numpy陣列大小為(16490,300)分別代表詞彙表單詞數目及每個單詞對應的特徵數。單個單詞向量可以直接通過下面的形式訪問:

model["flower"]

從單詞到段落,嘗試1:向量平均

在IMDB資料集中,每段評論的長度都是不一樣的,在這裡我們需要先將一個獨立的單詞向量轉換成等長的特徵集合。因為每個單詞都是個三百維的特徵向量,我們就能夠使用向量操作將每段評論中的單詞結合在一起。在這個例子中,我們就簡單地將單詞向量做個平均,並去除停用詞,因為加入停用詞只會增加噪聲。程式碼如下:

import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32"
) # nwords = 0. # # Index2word is a list that contains the names of the words in # the model's vocabulary. Convert it to a set, for speed index2word_set = set(model.index2word) # # Loop over each word in the review and, if it is in the model's # vocaublary, add its feature vector to the total for word in words: if word in index2word_set: nwords = nwords + 1. featureVec = np.add(featureVec,model[word]) # # Divide the result by the number of words to get the average featureVec = np.divide(featureVec,nwords) return featureVec def getAvgFeatureVecs(reviews, model, num_features): # Given a set of reviews (each one a list of words), calculate # the average feature vector for each one and return a 2D numpy array # # Initialize a counter counter = 0 # # Preallocate a 2D numpy array, for speed reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") # # Loop through the reviews for review in reviews: # # Print a status message every 1000th review if counter%1000 == 0: print "Review %d of %d" % (counter, len(reviews)) # # Call the function (defined above) that makes average feature vectors reviewFeatureVecs[counter] = makeFeatureVec(review, model, \ num_features) # # Increment the counter counter = counter + 1 return reviewFeatureVecs

接下來我們利用Part2中讀取到的訓練集與測試集,分別對其做向量平均:

# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.
import pandas as pd

# Read data from files 
train = pd.read_csv( "./data/labeledTrainData.tsv", header=0, 
 delimiter="\t", quoting=3 )
test = pd.read_csv( "./data/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "./data/unlabeledTrainData.tsv", header=0, 
 delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
print("Read %d labeled train reviews, %d labeled test reviews, " \
 "and %d unlabeled reviews\n" % (train["review"].size,  
 test["review"].size, unlabeled_train["review"].size ))

# Import various modules for string cleaning
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)
# Download the punkt tokenizer for sentence splitting
num_features = 300    # Word vector dimensionality

clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

接下來我們使用隨機森林來做預測,程式碼如下:

# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

我們發現,這一結果比偶然發現的結果好得多,但卻比我們在Part1中使用詞袋模型準確率降低了幾個百分點。
由於向量平均沒有產生驚人的結果,也許我們可以用更聰明的方法來做?加權詞向量的一種標準方法是應用“tf - idf”權重,它衡量一個給定單詞在給定文件集合中的重要性。在Python中提取tf - idf權重的一種方法是使用scikitt - learn的TfidfVectorizer,它的介面與我們在Part1中使用的CountVectorizer類似。然而,增加權重依然沒有太大的改變。
因此向量平均及tf-idf都沒啥重大改善,接下來我們來嘗試利用聚類看看能夠改善效果

從單詞到段落,嘗試2:聚類

Word2Vec建立語義相關單詞的聚類,因此另一種可能的方法是利用聚類中單詞的相似性。以這種方式對向量進行分組稱為“向量量化”。為了實現這一點,我們首先需要找到單詞簇的中心,我們可以通過使用諸如k - means這樣的聚類演算法來實現。

在K - means中,我們需要設定的一個引數是“K”,即簇的數量。我們應該如何決定要建立多少個叢集?試驗和錯誤表明,平均只有5個單詞的小簇比使用多個單詞的大型簇具有更好的結果。聚類程式碼如下所示。我們使用scikit-learn來執行我們的k - means。

from sklearn.cluster import KMeans
import time

start = time.time() # Start time

# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.wv.syn0
num_clusters = word_vectors.shape[0] / 5

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = num_clusters )
idx = kmeans_clustering.fit_predict( word_vectors )

# Get the end time and print how long the process took
end = time.time()
elapsed = end - start
print("Time taken for K Means clustering: ", elapsed, "seconds.")

為每個單詞分配的簇被儲存在idx中,我們原始Word2Vec模型中的詞彙表仍然儲存在model.wv.index2word中。為了方便起見,我們將這些內容壓縮成一個字典,如下所示:

# Create a Word / Index dictionary, mapping each vocabulary word to
# a cluster number                                                                                            
word_centroid_map = dict(zip( model.wv.index2word, idx ))

我們打印出前10個聚類中心,看下效果:

# For the first 10 clusters
for cluster in range(0,10):
    #
    # Print the cluster number  
    print("\nCluster %d" % cluster)
    #
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in xrange(0,len(word_centroid_map.values())):
        if( list(word_centroid_map.values())[i] == cluster ):
            words.append(list(word_centroid_map.keys())[i])
    print(words)

我們可以看到,聚類質量參差不齊。有一些是有意義的——聚類3主要包含名稱,而聚類6 - 8包含相關的形容詞(聚類6是我所需要的情感形容詞)。另一方面,聚類5有一點神祕:龍蝦和鹿有什麼共同之處(除了是兩種動物之外)?聚類0更糟糕:頂層公寓和套房似乎屬於同一類,但它們似乎不屬於蘋果和護照。聚類2包含了戰爭相關的單詞?也許我們的聚類演算法在形容詞上最好用。
無論如何,現在我們對每個單詞都有一個聚類(或“centroid”)賦值,我們可以定義一個函式來將評論轉換成聚類袋。這就像詞袋模型,但這使用語義相關的簇而不是單個單詞:

def create_bag_of_centroids( wordlist, word_centroid_map ):
    #
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    #
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    #
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    #
    # Return the "bag of centroids"
    return bag_of_centroids

上面的函式將為每段評論提供一個numpy陣列,每段評論的特徵數量與簇數量相等。最後,我們為我們的訓練和測試集建立了聚類袋,然後訓練隨機森林並提取結果:

from sklearn.ensemble import RandomForestClassifier
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (train["review"].size, num_clusters), \
    dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1

# Repeat for test reviews 
test_centroids = np.zeros(( test["review"].size, num_clusters), \
    dtype="float32" )

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1
# Fit a random forest and extract predictions 
forest = RandomForestClassifier(n_estimators = 100)

# Fitting the forest may take a few minutes
print("Fitting a random forest to labeled training data...")
forest = forest.fit(train_centroids,train["sentiment"])
result = forest.predict(test_centroids)

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )

總結

我們發現,上面的程式碼與Part1中詞袋模型的結果大致相同。這並不是說咱們的Word2vec沒啥用,只是在這個應用上情感分析上Google出的doc2vec更好而已。
demo程式碼與資料:傳送門