1. 程式人生 > >基於Word2Vec Doc2Vec 進行文字情感分類

基於Word2Vec Doc2Vec 進行文字情感分類

這篇文章介紹了使用Word2Vec和Doc2Vec進行文字情感分類,等後面有時間了再翻譯一下:

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.

Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.

Word2Vec and Doc2Vec

Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method.

CBOW and Skip-gram

Figure 1: Architecture for the CBOW and Skip-gram method, taken from Efficient Estimation of Word Representations in Vector SpaceW(t) is the current word, while w(t-2)w(t-1), etc. are the surrounding words.

These word vectors now capture the context of surrounding words. This can be seen by using basic algebra to find word relations (i.e. “king” – “man” + “woman” = “queen”). These word vectors can be fed into a classification algorithm, as opposed to bag-of-words, to predict sentiment. The advantage is that we now have some word context, and our feature space is much lower (typically ~300 as opposed to ~100,000, which is the size of our vocabulary). We also had to do very little manual feature creation since the neural network was able to extract those features for us. Since text have varying length, one might take the average of all word vectors as the input to a classification algorithm to classify whole text documents.

However, even with the above method of averaging word vectors, we are ignoring word order. As a way to summarize bodies of text of varying length, Quoc Le and Tomas Mikolov came up with theDoc2Vec method. This method is almost identical to Word2Vec, except we now generalize the method by adding a paragraph/document vector. Like Word2Vec, there are two methods: Distributed Memory (DM) and Distributed Bag of Words (DBOW). DM attempts to predict a word given its previous words and a paragraph vector. Even though the context window moves across the text, the paragraph vector does not (hence distributed memory) and allows for some word-order to be captured. DBOW predicts a random group of words in a paragraph given only its paragraph vector (see Figure 2).

Doc2Vec

Once it has been trained, these paragraph vectors can be fed into a sentiment classifier without the need to aggregate words. This method is currently the state-of-the-art when it comes to sentiment classification on the IMDB movie review data set, achieving only a 7.42% error rate. Of course, none of this is useful if we cannot actually implement them. Luckily, a very-well optimized version of Word2Vec and Doc2Vec is available in gensim, a Python library.

Word2Vec Example in Python

In this section we show how one might use word vectors in a sentiment classification task. The gensim library comes standard with the Anaconda distribution or can be installed using pip. From there you can train word vectors on your own corpus (a dataset of text documents) or import pre-trained vectors from C text or binary format:

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('vectors.txt', binary=False) #C text format
model = Word2Vec.load_word2vec_format('vectors.bin', binary=True) #C binary format

I find this especially useful when loading Google’s pre-trained word vectors trained over ~100 billion words from the Google News dataset found in the "Pre-trained word and phrase vectors" section here. Note that the file is ~3.5 GB unzipped. Using the Google word vectors we can see some interesting relationships between words:

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[(u'queen', 0.711819589138031),
 (u'monarch', 0.618967592716217),
 (u'princess', 0.5902432799339294),
 (u'crown_prince', 0.5499461889266968),
 (u'prince', 0.5377323031425476)]

What's interesting is that it can find grammatical relationships, for example identifying superlatives or verb stems:
"biggest" - "big" + "small" = "smallest"

model.most_similar(positive=['biggest','small'], negative=['big'], topn=5)

[(u'smallest', 0.6086569428443909),
 (u'largest', 0.6007465720176697),
 (u'tiny', 0.5387299656867981),
 (u'large', 0.456944078207016),
 (u'minuscule', 0.43401968479156494)]

"ate" - "eat" + "speak" = "spoke"

model.most_similar(positive=['ate','speak'], negative=['eat'], topn=5)

[(u'spoke', 0.6965223550796509),
 (u'speaking', 0.6261293292045593),
 (u'conversed', 0.5754593014717102),
 (u'spoken', 0.570488452911377),
 (u'speaks', 0.5630602240562439)]

It's clear from the above examples that Word2Vec is able to learn non-trivial relationships between words. This is what makes them powerful for many NLP tasks, and in our case sentiment analysis. Before we move on to using them in sentiment analysis, let us first examine Word2Vec's ability to separate and cluster words. We will use three example word sets: food, sports, and weather words taken from a wonderful website called Enchanted Learning. Since these vectors are 300 dimensional, we will use Scikit-Learn's implementation of a dimensionality reduction algorithm called t-SNE in order to visualize them in 2D.

First we have to obtain the word vectors as follows:

import numpy as np

with open('food_words.txt', 'r') as infile:
    food_words = infile.readlines()

with open('sports_words.txt', 'r') as infile:
    sports_words = infile.readlines()

with open('weather_words.txt', 'r') as infile:
    weather_words = infile.readlines()

def getWordVecs(words):
    vecs = []
    for word in words:
        word = word.replace('\n', '')
        try:
            vecs.append(model[word].reshape((1,300)))
        except KeyError:
            continue
    vecs = np.concatenate(vecs)
    return np.array(vecs, dtype='float') #TSNE expects float type values

food_vecs = getWordVecs(food_words)
sports_vecs = getWordVecs(sports_words)
weather_vecs = getWordVecs(weather_words)

We can then use TSNE and matplotlib to visualize the clusters with the following code:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

ts = TSNE(2)
reduced_vecs = ts.fit_transform(np.concatenate((food_vecs, sports_vecs, weather_vecs)))

#color points by word group to see if Word2Vec can separate them
for i in range(len(reduced_vecs)):
    if i < len(food_vecs):
        #food words colored blue
        color = 'b'
    elif i >= len(food_vecs) and i < (len(food_vecs) + len(sports_vecs)):
        #sports words colored red
        color = 'r'
    else:
        #weather words colored green
        color = 'g'
    plt.plot(reduced_vecs[i,0], reduced_vecs[i,1], marker='o', color=color, markersize=8)

The result is as follows:

t-SNE

Figure 3: T-SNE projected clusters of food words (blue), sports words (red), and weather words (green).

We can see from the above that Word2Vec does a good job of separating unrelated words, as well as clustering together like words.

Analyzing the Sentiment of Emoji Tweets

Now we will move on to an example in sentiment analysis with tweets gathered using emojis as search terms. We use these emojis as "fuzzy" labels for our data; a smiley emoji (:-)) corresponds to positive sentiment, and a frowny (:-() to negative. The data consists of an even split between positive and negative with a total of ~400,000 tweets. We randomly sample positive and negative tweets to construct an 80/20, train/test, split. We then train the Word2Vec model on the train tweets. In order to prevent data leakage from the test set, we do not train Word2Vec on the test set until after our classifier has been fit on the training set. To construct inputs for our classifier, we take the average of all word vectors in a tweet. We will be using Scikit-Learn to do a lot of the machine learning.

First we import our data and train the Word2Vec model.

from sklearn.cross_validation import train_test_split
from gensim.models.word2vec import Word2Vec

with open('twitter_data/pos_tweets.txt', 'r') as infile:
    pos_tweets = infile.readlines()

with open('twitter_data/neg_tweets.txt', 'r') as infile:
    neg_tweets = infile.readlines()

#use 1 for positive sentiment, 0 for negative
y = np.concatenate((np.ones(len(pos_tweets)), np.zeros(len(neg_tweets))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.2)

#Do some very minor text preprocessing
def cleanText(corpus):
    corpus = [z.lower().replace('\n','').split() for z in corpus]
    return corpus

x_train = cleanText(x_train)
x_test = cleanText(x_test)

n_dim = 300
#Initialize model and build vocab
imdb_w2v = Word2Vec(size=n_dim, min_count=10)
imdb_w2v.build_vocab(x_train)

#Train the model over train_reviews (this may take several minutes)
imdb_w2v.train(x_train)

Next we have to build word vectors for input text in order to average the value of all word vectors in the tweet using the following function:

#Build word vector for training set by using the average value of all word vectors in the tweet, then scale
def buildWordVector(text, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in text:
        try:
            vec += imdb_w2v[word].reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

Scaling moves our data set is part of the process of standardization where we move our dataset into a gaussian distribution with a mean of zero, meaning that values above the mean will be positive, and those below the mean will be negative. Many ML models require scaled datasets to perform effectively, especially those with many features (like text classifiers).

from sklearn.preprocessing import scale
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
train_vecs = scale(train_vecs)

#Train word2vec on test tweets
imdb_w2v.train(x_test)

Finally we have to build our test set vectors and scale them for evaluation.

#Build test tweet vectors then scale
test_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_test])
test_vecs = scale(test_vecs)

Next we want to validate our classifier by calculating the prediction accuracy on test data, as well as examining its Receiver Operating Characteristic (ROC) curve. ROC curves measure the true-positive rate vs. the false-positive rate of a classifier while adjusting a parameter of the model. In our case, we adjust the cut-off threshold probability for classifying a tweet as positive or negative sentiment. Generally, the larger the Area Under the Curve (AUC), the better our model does at maximizing true positives while minimizing false positives. More on ROC curves can be found here.

To start we'll train our classifier, in this case using Stochastic Gradient Descent for Logistic Regression.

#Use classification algorithm (i.e. Stochastic Logistic Regression) on training set, then assess model performance on test set
from sklearn.linear_model import SGDClassifier

lr = SGDClassifier(loss='log', penalty='l1')
lr.fit(train_vecs, y_train)

print 'Test Accuracy: %.2f'%lr.score(test_vecs, y_test)

We'll then create the ROC curve for evaluation using matplotlib and the roc_curve method of Scikit-Learn's metric package.

#Create ROC curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

pred_probas = lr.predict_proba(test_vecs)[:,1]

fpr,tpr,_ = roc_curve(y_test, pred_probas)
roc_auc = auc(fpr,tpr)
plt.plot(fpr,tpr,label='area = %.2f' %roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc='lower right')

plt.show()

The resulting curve is as follows:

ROC curve

Figure 4: ROC Curve for a logistic classifier on our training data of tweets.

Without any type of feature creation and minimal text preprocessing we can achieve 73% test accuracy using a simple linear model provided by Scikit-Learn. Interestingly, removing punctuation actually causes the accuracy to suffer, suggesting Word2Vec can find interesting features when characters such as "?" and "!" are present. Treating these as individual words, training for longer, doing more preprocessing, and adjusting parameters in both Word2Vec and the classifier could all help in improving accuracy. I have found that using Artificial Neural Networks (ANNs) can improve the accuracy by about 5% when using word vectors. Note that Scikit-Learn does not provide an implementation of ANN classifiers so I used a custom library I created:

from NNet import NeuralNet

nnet = NeuralNet(100, learn_rate=1e-1, penalty=1e-8)
maxiter = 1000
batch = 150
_ = nnet.fit(train_vecs, y_train, fine_tune=False, maxiter=maxiter, SGD=True, batch=batch, rho=0.9)

print 'Test Accuracy: %.2f'%nnet.score(test_vecs, y_test)

The resulting accuracy is 77%. As with any machine learning task, picking the right model is usually more a matter of art than science. If you'd like to use my custom library you can find it on my github. Be warned, it is likely very messy and not regularly maintained! If you would like to contribute please feel free to fork the repository. It could definitely use some TLC!

Using Doc2Vec to Analyze Movie Reviews

Using the averages of word vectors worked fine in the case of tweets. This is because tweets are typically only a few to tens of words in length, which allows us to preserve the relevant features even when averaging. Once we go to the paragraph scale, however, we risk throwing away rich features when we ignore word order and context. In this case it is better to use Doc2Vec to create our input features. As an example we will use the IMDB movie review dataset to test the usefulness of Doc2Vec in sentiment analysis. The data consists of 25,000 positive movie reviews, 25,000 negative, and 50,000 unlabeled reviews. We first train Doc2Vec over the unlabeled reviews. The methodology then identically follows that of the Word2Vec example above, except now we will use both DM and DBOW vectors as inputs by concatenating them.

import gensim

LabeledSentence = gensim.models.doc2vec.LabeledSentence

from sklearn.cross_validation import train_test_split
import numpy as np

with open('IMDB_data/pos.txt','r') as infile:
    pos_reviews = infile.readlines()

with open('IMDB_data/neg.txt','r') as infile:
    neg_reviews = infile.readlines()

with open('IMDB_data/unsup.txt','r') as infile:
    unsup_reviews = infile.readlines()

#use 1 for positive sentiment, 0 for negative
y = np.concatenate((np.ones(len(pos_reviews)), np.zeros(len(neg_reviews))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_reviews, neg_reviews)), y, test_size=0.2)

#Do some very minor text preprocessing
def cleanText(corpus):
    punctuation = """.,?!:;(){}[]"""
    corpus = [z.lower().replace('\n
            
           

相關推薦

基於Word2Vec Doc2Vec 進行文字情感分類

這篇文章介紹了使用Word2Vec和Doc2Vec進行文字情感分類,等後面有時間了再翻譯一下: Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particula

######好好好,本質#####基於LSTM搭建一個文字情感分類的深度學習模型:準確率往往有95%以上

基於情感詞典的文字情感分類 傳統的基於情感詞典的文字情感分類,是對人的記憶和判斷思維的最簡單的模擬,如上圖。我們首先通過學習來記憶一些基本詞彙,如否定詞語有“不”,積極詞語有“喜歡”、“愛”,消極詞語有“討厭”、“恨”等,從而在大腦中形成一個基本的語料庫。然後,我們再對輸入的句子進行最直接

自然語言處理課程作業 中文文字情感分類

摘要:20世紀初以來,文字的情感分析在自然語言處理領域成為了研究的熱點,吸引了眾多學者越來越多的關注。對於中文文字的情感傾向性研究在這樣一大環境下也得到了顯著的發展。本文主要是基於機器學習方法的中文文字情感分類,主要包括:使用開源的Markup處理程式對XML檔案進行分析處理、中科院計算所開源的中文分詞處理

文字情感分類---搭建LSTM(深度學習模型)做文字情感分類的程式碼

來源:http://mp.weixin.qq.com/s?__biz=MzA3MDg0MjgxNQ==&mid=2652391534&idx=1&sn=901d5e55971349697e023f196037675d&chksm=84da48

kaggle之電影文字情感分類

電影文字情感分類 這個任務主要是對電影評論文字進行情感分類,主要分為正面評論和負面評論,所以是一個二分類問題,二分類模型我們可以選取一些常見的模型比如貝葉斯、邏輯迴歸等,這裡挑戰之一是文字內容的向量化,因此,我們首先嚐試基於TF-IDF的向量化方法,然後嘗

python--電影評論文字情感分類

為了記錄kaggle學習心得。參考了大神文章。1.http://www.cnblogs.com/lijingpeng/p/5787549.html2.python機器學習及實戰from sklearn.datasets import fetch_20newsgroupsX,

NLP入門(十)使用LSTM進行文字情感分析

情感分析簡介   文字情感分析(Sentiment Analysis)是自然語言處理(NLP)方法中常見的應用,也是一個有趣的基本任務,尤其是以提煉文字情緒內容為目的的分類。它是對帶有情感色彩的主觀性文字進行分析、處理、歸納和推理的過程。   本文將介紹情感分析中的情感極性(傾向)分析。所謂情感極性分析,指的

python進行文字分類基於word2vec,sklearn-svm對微博垃圾評論分類

差不多一年前的第一個分類任務,記錄一下 語料庫是關於微博的垃圾使用者評論,分為兩類,分別在normal,和spam資料夾下。裡面是很多個txt檔案,一個txt是一條使用者評論。 一、進行分詞 利用Jieba分詞和去除停用詞(這裡我用的是全模式分詞),每一篇文件為一行

利用jieba,word2vec,LR進行搜狐新聞文字分類 基於jieba,TfidfVectorizer,LogisticRegression進行搜狐新聞文字分類

一、簡介  1)jieba   中文叫做結巴,是一款中文分詞工具,https://github.com/fxsjy/jieba  2)word2vec   單詞向量化工具,https://radimrehurek.com/gensim/models/word2vec.html  3)LR   Lo

python進行文字分類基於word2vec,sklearn-svm對微博性別分類

第一個分類任務,記錄一下 語料庫下載 一、進行手工分類 導師給的資料是兩個資料夾,一個包含了以使用者ID名為標題的一大堆txt(未分類),還有一個資料夾裡面是已經分類好的男女性別ID的集合txt。 先要做的任務就是將未分類的txt分成兩類(根據給

深度學習----基於keras的LSTM三分類文字情感分析原理及程式碼

文章目錄 背景介紹 理論介紹 RNN應用場景 word2vec 演算法 Word2Vec:高維來了 句向量 資料預處理與詞向量模型訓練 LS

基於word2vecdoc2vec情感分析

       情感分析是一種常見的自然語言處理(NLP)方法的應用,特別是在以提取文字的情感內容為目標的分類方法中。通過這種方式,情感分析可以被視為利用一些情感得分指標來量化定性資料的方法。儘管情緒在很大程度上是主觀的,但是情感量化分析已經有很多有用的實踐,比如企業分析

Python使用doc2vec和LR進行文字分類

(1)資料預處理 a.對文字資料進行貼標籤處理,標籤資料類似入下: 平素體質:健康狀況:良,既往有“高血壓病史”多年。#1 其中1表示患有高血壓,0表示沒有患有高血壓。 然後進行分開,文字儲存在一個檔案,標籤儲存在一個檔案,文字內容和標籤行對行對應。

文字處理——基於 word2vec 和 CNN 的文字分類 :綜述 & 實踐(一)

導語傳統的向量空間模型(VSM)假設特徵項之間相互獨立,這與實際情況是不相符的,為了解決這個問題,可以採用文字的分散式表示方式(例如 word embedding形式),通過文字的分散式表示,把文字表示成類似影象和語音的連續、稠密的資料。這樣我們就可以把深度學習方法遷移到文字

文字情感分析(二):基於word2vec和glove詞向量的文字表示

上一篇部落格用詞袋模型,包括詞頻矩陣、Tf-Idf矩陣、LSA和n-gram構造文字特徵,做了Kaggle上的電影評論情感分類題。 這篇部落格還是關於文字特徵工程的,用詞嵌入的方法來構造文字特徵,也就是用word2vec詞向量和glove詞向量進行文字表示,訓練隨機森林分類器。 一、訓練word2vec詞

基於Huggingface使用BERT進行文字分類的fine-tuning

隨著BERT大火之後,很多BERT的變種,這裡借用Huggingface工具來簡單實現一個文字分類,從而進一步通過Huggingface來認識BERT的工程上的實現方法。 1、load data train_df = pd.read_csv('../data/train.tsv',delimiter='\t

Word2vec進行中文情感分析

''' Chinese sentiment analysis ''' from sklearn.cross_validation import train_test_split from gensim.models.word2vec import Word2Vec import numpy

基於機器學習的文字分類演算法的研究

1. 簡述 文字分類的方法屬於有監督的學習方法,分類過程包括文字預處理、特徵抽取、降維、分類和模型評價。本文首先研究了文字分類的背景,中文分詞演算法。然後是對各種各樣的特徵抽取進行研究,包括詞項頻率-逆文件頻率和word2vec,降維方法有主成分分析法和潛在索引分析,最後是對分類演算法進行研究,

基於樸素貝葉斯算法的情感分類

set 求最大值 記錄 變焦 def ... rop ros 結果 環境 win8, python3.7, jupyter notebook 正文 什麽是情感分析?(以下引用百度百科定義) 情感分析(Sentiment analysis),又稱傾向性分析,意見抽取(Opi

【NLP】【八】基於keras與imdb影評資料集做情感分類

【一】本文內容綜述 1. keras使用流程分析(模型搭建、模型儲存、模型載入、模型使用、訓練過程視覺化、模型視覺化等) 2. 利用keras做文字資料預處理 【二】環境準備 1. 資料集下載:http://ai.stanford.edu/~amaas/data/sentiment/