現代情感分析方法

情感分析（Sentiment analysis）是自然語言處理（NLP）方法中常見的應用，尤其是以提煉文字情緒內容為目的的分類。利用情感分析這樣的方法，可以通過情感評分對定性資料進行定量分析。雖然情感充滿了主觀性，但情感定量分析已經有許多實用功能，例如企業藉此瞭解使用者對產品的反映，或者判別線上評論中的仇恨言論。

情感分析最簡單的形式就是藉助包含積極和消極詞的字典。每個詞在情感上都有分值，通常 +1 代表積極情緒，-1 代表消極。接著，我們簡單累加句子中所有詞的情感分值來計算最終的總分。顯而易見，這樣的做法存在許多缺陷，最重要的就是忽略了語境（context）和鄰近的詞。例如一個簡單的短語“not good”最終的情感得分是 0，因為“not”是 -1，“good”是 +1。正常人會將這個短語歸類為消極情緒，儘管有“good”的出現。

另一個常見的做法是以文字進行“詞袋（bag of words）”建模。我們把每個文字視為 1 到 N 的向量，N 是所有詞彙（vocabulary）的大小。每一列是一個詞，對應的值是這個詞出現的次數。比如說短語“bag of bag of words”可以編碼為 [2, 2, 1]。這個值可以作為諸如邏輯迴歸（logistic regression）、支援向量機（SVM）的機器學習演算法的輸入，以此來進行分類。這樣可以對未知的（unseen）資料進行情感預測。注意這需要已知情感的資料通過監督式學習的方式（supervised fashion）來訓練。雖然和前一個方法相比有了明顯的進步，但依然忽略了語境，而且資料的大小會隨著詞彙的大小增加。

Word2Vec 和 Doc2Vec

近幾年，Google 開發了名為 Word2Vec 新方法，既能獲取詞的語境，同時又減少了資料大小。Word2Vec 實際上有兩種不一樣的方法：CBOW（Continuous Bag of Words，連續詞袋）和 Skip-gram。對於 CBOW，目標是在給定鄰近詞的情況下預測單獨的單詞。Skip-gram 則相反：我們希望給定一個單獨的詞（見圖 1）來預測某個範圍的詞。兩個方法都使用人工神經網路（Artificial Neural Networks）來作為它們的分類演算法。首先，詞彙表中的每個單詞都是隨機的 N 維向量。在訓練過程中，演算法會利用 CBOW 或者 Skip-gram 來學習每個詞的最優向量。

圖 1：CBOW 以及 Skip-Gram 結構圖，選自《Efficient Estimation of Word Representations in Vector Space》。W(t) 代表當前的單詞，而w(t-2)， w(t-1) 等則是鄰近的單詞。

這些詞向量現在可以考慮到上下文的語境了。這可以看作是利用基本的代數式來挖掘詞的關係（例如：“king” – “man” + “woman” = “queen”）。這些詞向量可以作為分類演算法的輸入來預測情感，有別於詞袋模型的方法。這樣的優勢在於我們可以聯絡詞的語境，並且我們的特徵空間（feature space）的維度非常低（通常約為 300，相對於約為 100000 的詞彙）。在神經網路提取出這些特徵之後，我們還必須手動建立一小部分特徵。由於文字長度不一，將以全體詞向量的均值作為分類演算法的輸入來歸類整個文件。

然而，即使使用了上述對詞向量取均值的方法，我們仍然忽略了詞序。Quoc Le 和 Tomas Mikolov 提出了 Doc2Vec 的方法對長度不一的文字進行描述。這個方法除了在原有基礎上新增 paragraph / document 向量以外，基本和 Word2Vec 一致，也存在兩種方法：DM（Distributed Memory，分散式記憶體）和分散式詞袋（DBOW）。DM 試圖在給定前面部分的詞和 paragraph 向量來預測後面單獨的單詞。即使文字中的語境在變化，但 paragraph 向量不會變化，並且能儲存詞序資訊。DBOW 則利用paragraph 來預測段落中一組隨機的詞（見圖 2）。

一旦經過訓練，paragraph 向量就可以作為情感分類器的輸入而不需要所有單詞。這是目前對 IMDB 電影評論資料集進行情感分類最先進的方法，錯誤率只有 7.42%。當然，如果這個方法不實用，說這些都沒有意義。幸運的是，一個 Python 第三方庫 gensim 提供了 Word2Vec 和 Doc2Vec 的優化版本。

基於 Python 的 Word2Vec 舉例

在本節我們將會展示怎麼在情感分類任務中使用詞向量。gensim 這個庫是 Anaconda 發行版中的標配，你同樣可以利用 pip 來安裝。利用它你可以在自己的語料庫（一個文件資料集）中訓練詞向量或者匯入 C text 或二進位制格式的已經訓練好的向量。

Python

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('vectors.txt', binary=False) #C text 格式
model = Word2Vec.load_word2vec_format('vectors.bin', binary=True) #二進位制格式

12345

fromgensim.models.word2vec importWord2Vecmodel=Word2Vec.load_word2vec_format('vectors.txt',binary=False)#C text 格式model=Word2Vec.load_word2vec_format('vectors.bin',binary=True)#二進位制格式

我發現讀取谷歌已經訓練好的詞向量尤其管用，這些向量來自谷歌新聞（Google News），由超過千億級別的詞訓練而成，“已經訓練過的詞和短語向量”可以在這裡找到。注意未壓縮的檔案有 3.5 G。通過 Google 詞向量我們能夠發現詞與詞之間有趣的關聯：

Python

from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[(u'queen', 0.711819589138031),
 (u'monarch', 0.618967592716217),
 (u'princess', 0.5902432799339294),
 (u'crown_prince', 0.5499461889266968),
 (u'prince', 0.5377323031425476)]

123456789101112

fromgensim.models.word2vec importWord2Vecmodel=Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)model.most_similar(positive=['woman','king'],negative=['man'],topn=5)[(u'queen',0.711819589138031),(u'monarch',0.618967592716217),(u'princess',0.5902432799339294),(u'crown_prince',0.5499461889266968),(u'prince',0.5377323031425476)]

有趣的是它可以發現語法關係，例如識別最高階（superlatives）和動詞詞幹（stems）：

“biggest” – “big” + “small” = “smallest”

Python

model.most_similar(positive=['biggest','small'], negative=['big'], topn=5)

[(u'smallest', 0.6086569428443909),
 (u'largest', 0.6007465720176697),
 (u'tiny', 0.5387299656867981),
 (u'large', 0.456944078207016),
 (u'minuscule', 0.43401968479156494)]

12345678

model.most_similar(positive=['biggest','small'],negative=['big'],topn=5)[(u'smallest',0.6086569428443909),(u'largest',0.6007465720176697),(u'tiny',0.5387299656867981),(u'large',0.456944078207016),(u'minuscule',0.43401968479156494)]

“ate” – “eat” + “speak” = “spoke”

Python

model.most_similar(positive=['ate','speak'], negative=['eat'], topn=5)

[(u'spoke', 0.6965223550796509),
 (u'speaking', 0.6261293292045593),
 (u'conversed', 0.5754593014717102),
 (u'spoken', 0.570488452911377),
 (u'speaks', 0.5630602240562439)]

12345678

model.most_similar(positive=['ate','speak'],negative=['eat'],topn=5)[(u'spoke',0.6965223550796509),(u'speaking',0.6261293292045593),(u'conversed',0.5754593014717102),(u'spoken',0.570488452911377),(u'speaks',0.5630602240562439)]

由以上例子可以清楚認識到 Word2Vec 能夠學習詞與詞之間的有意義的關係。這也就是為什麼它對於許多 NLP 任務有如此大的威力，包括在本文中的情感分析。在我們用它解決起情感分析問題以前，讓我們先測試一下 Word2Vec 對詞分類（separate）和聚類（cluster）的本事。我們會用到三個示例詞集：食物類（food）、運動類（sports）和天氣類（weather），選自一個非常棒的網站 Enchanted Learning。因為這些向量有 300 個維度，為了在 2D 平面上視覺化，我們會用到 Scikit-Learn’s 中叫作“t-SNE”的降維演算法操作

首先必須像下面這樣取得詞向量：

Python

import numpy as np

with open('food_words.txt', 'r') as infile:
    food_words = infile.readlines()

with open('sports_words.txt', 'r') as infile:
    sports_words = infile.readlines()

with open('weather_words.txt', 'r') as infile:
    weather_words = infile.readlines()

def getWordVecs(words):
    vecs = []
    for word in words:
        word = word.replace('n', '')
        try:
            vecs.append(model[word].reshape((1,300)))
        except KeyError:
            continue
    vecs = np.concatenate(vecs)
    return np.array(vecs, dtype='float') #TSNE expects float type values

food_vecs = getWordVecs(food_words)
sports_vecs = getWordVecs(sports_words)
weather_vecs = getWordVecs(weather_words)

1234567891011121314151617181920212223242526

importnumpy asnpwithopen('food_words.txt','r')asinfile:food_words=infile.readlines()withopen('sports_words.txt','r')asinfile:sports_words=infile.readlines()withopen('weather_words.txt','r')asinfile:weather_words=infile.readlines()defgetWordVecs(words):vecs=[]forword inwords:word=word.replace('n','')try:vecs.append(model[word].reshape((1,300)))exceptKeyError:continuevecs=np.concatenate(vecs)returnnp.array(vecs,dtype='float')#TSNE expects float type valuesfood_vecs=getWordVecs(food_words)sports_vecs=getWordVecs(sports_words)weather_vecs=getWordVecs(weather_words)

我們接著使用 TSNE 和 matplotlib 視覺化聚類，程式碼如下：

Python

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

ts = TSNE(2)
reduced_vecs = ts.fit_transform(np.concatenate((food_vecs, sports_vecs, weather_vecs)))

#color points by word group to see if Word2Vec can separate them
for i in range(len(reduced_vecs)):
    if i &lt; len(food_vecs):
        #food words colored blue
        color = 'b'
    elif i &gt;= len(food_vecs) and i &lt; (len(food_vecs) + len(sports_vecs)):
        #sports words colored red
        color = 'r'
    else:
        #weather words colored green
        color = 'g'
    plt.plot(reduced_vecs[i,0], reduced_vecs[i,1], marker='o', color=color, markersize=8)

12345678910111213141516171819

fromsklearn.manifold importTSNEimportmatplotlib.pyplot aspltts=TSNE(2)reduced_vecs=ts.fit_transform(np.concatenate((food_vecs,sports_vecs,weather_vecs)))#color points by word group to see if Word2Vec can separate themforiinrange(len(reduced_vecs)):ifi<len(food_vecs):#food words colored bluecolor='b'elifi>=len(food_vecs)andi<(len(food_vecs)+len(sports_vecs)):#sports words colored redcolor='r'else:#weather words colored greencolor='g'plt.plot(reduced_vecs[i,0],reduced_vecs[i,1],marker='o',color=color,markersize=8)

Python

import numpy as np

with open('food_words.txt', 'r') as infile:
    food_words = infile.readlines()

with open('sports_words.txt', 'r') as infile:
    sports_words = infile.readlines()

with open('weather_words.txt', 'r') as infile:
    weather_words = infile.readlines()

def getWordVecs(words):
    vecs = []
    for word in words:
        word = word.replace('n', '')
        try:
            vecs.append(model[word].reshape((1,300)))
        except KeyError:
            continue
    vecs = np.concatenate(vecs)
    return np.array(vecs, dtype='float') #TSNE 要求浮點型的值

food_vecs = getWordVecs(food_words)
sports_vecs = getWordVecs(sports_words)
weather_vecs = getWordVecs(weather_words)

1234567891011121314151617181920212223242526

importnumpy asnpwithopen('food_words.txt','r')asinfile:food_words=infile.readlines()withopen('sports_words.txt','r')asinfile:sports_words=infile.readlines()withopen('weather_words.txt','r')asinfile:weather_words=infile.readlines()defgetWordVecs(words):vecs=[]forword inwords:word=word.replace('n','')try:vecs.append(model[word].reshape((1,300)))exceptKeyError:continuevecs=np.concatenate(vecs)returnnp.array(vecs,dtype='float')#TSNE 要求浮點型的值food_vecs=getWordVecs(food_words)sports_vecs=getWordVecs(sports_words)weather_vecs=getWordVecs(weather_words)

結果如下：

圖 3：食物類單詞（藍色），運動類單詞（紅色）和天氣類單詞（綠色）T-SNE 叢集效果圖。

我們可以從上面的例子看到，Word2Vec 不僅能有效分類不相關的單詞，同樣也能聚類類似的詞。

推特 Emoji 情感分析

現在我們進入下一個例程，利用符號表情作為搜尋詞的推特情感分析。我們把這些符號表情作為我們資料的“模糊（fuzzy）”標籤；微笑表情（:-)）與積極情緒對應，而皺眉表情（:-(）則對應消極情緒。在大約 400,000 條推特資料中，積極和消極的各佔一半（even split）。我們對積極和消極情緒的推特進行了隨機取樣，並按80 / 20 的比例分為了訓練集/ 測試集。我們接著在 Word2Vec 模型上訓練推特。為了避免資料洩露（data leakage），在訓練資料集分類完成以前我們都不會在 Word2Vec 上訓練。為了結構化分類器的輸入，我們對所有推特詞向量取均值。我們會用到 Scikit-Learn 這個第三方庫做大量的機器學習。

我們首先匯入我們的資料並訓練 Word2Vec 模型

Python

from sklearn.cross_validation import train_test_split
from gensim.models.word2vec import Word2Vec

with open('twitter_data/pos_tweets.txt', 'r') as infile:
    pos_tweets = infile.readlines()

with open('twitter_data/neg_tweets.txt', 'r') as infile:
    neg_tweets = infile.readlines()

# 1 代表積極情緒，0 代表消極情緒
y = np.concatenate((np.ones(len(pos_tweets)), np.zeros(len(neg_tweets))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.2)

# 零星的預處理
def cleanText(corpus):
    corpus = [z.lower().replace('n','').split() for z in corpus]
    return corpus

x_train = cleanText(x_train)
x_test = cleanText(x_test)

n_dim = 300
# 初始化模型並建立詞彙表（vocab）
imdb_w2v = Word2Vec(size=n_dim, min_count=10)
imdb_w2v.build_vocab(x_train)

# 訓練模型 (會花費幾分鐘) 
imdb_w2v.train(x_train)

123456789101112131415161718192021222324252627282930

fromsklearn.cross_validation importtrain_test_splitfromgensim.models.word2vec importWord2Vecwithopen('twitter_data/pos_tweets.txt','r')asinfile:pos_tweets=infile.readlines()

現代情感分析方法

Word2Vec 和 Doc2Vec

基於 Python 的 Word2Vec 舉例

推特 Emoji 情感分析

現代情感分析方法

情感分析方法之snownlp和貝葉斯分類器（三）

文本情感分析的基礎在於自然語言處理、情感詞典、機器學習方法等內容。以下是我總結的一些資源。

情感分析的新方法

Python 文字挖掘：使用機器學習方法進行情感分析（一、特徵提取和選擇）

【從傳統方法到深度學習】情感分析

【轉】軟件需求分析方法

短文本情感分析

情感分析簡述

mysql慢查詢分析工具和分析方法

聚類分析方法

vcpkg錯誤分析方法

ICONIX方法(用例分析方法實例教程)

第二節課：功能測試需求分析方法

北京賽車8碼滾雪球規律走勢技巧分析方法

情感分析 | 一份就職宣誓也許就可以預測一個國家未來幾年的政治形勢

spark scala word2vec 和多層分類感知器在情感分析中的實際應用

粗糙的情感分析

【統計分析方法】1.統計學知識圖譜

深度學習情感分析（隨機梯度下降代碼實現）

現代情感分析方法

Word2Vec 和 Doc2Vec

基於 Python 的 Word2Vec 舉例

推特 Emoji 情感分析

相關推薦