[自然語言處理] (4) Word2Vec

阿新 • • 發佈：2018-12-31

摘要

關鍵詞： Glove，word2vec，NNLM，餘弦相似度

參考：

核心思路： [用詞附近的詞來表示該詞]

Harris 老爺子在 1954 年提出的分佈假說( distributional hypothesis)指出，一個詞的詞義由其所在的上下文決定。因此，現階段所有的所有Embedding都是想著各種辦法來對詞的上下文進行建模，RNN、CNN模型學習，也是在學習上下文來獲得句子的語義向量。

Embedding的學習現在主要有幾種流派【來源】：
1. 基於矩陣的方式（比如LSA、Glove）
2. 基於聚類的方式（比如布朗聚類）
3. 基於神經網路的方式（這個很多，NNLM，CBOW，SKip-gram，RNNLM等）

這類似於小時候我們學習詞語的方式，通過不斷的造句，不斷的將詞語放到特定的上下文中，我們就可以學習到這個詞真正的意思

舉個例子：
句子1 ：山東的蘋果紅又甜
句子2 ：我手裡的蘋果是圓的
句子3 ：梨和蘋果都是水果

一開始，我們不知道“蘋果”是個什麼東西，但是通過上面三個樣本，在置信率很高( 假設是100% )，那麼我們可以很確定的得知，“蘋果”擁有以下的屬性[山東的，紅，圓的，水果 … ]，當樣本數量足夠大時，這個特徵向量的表達將更加的準確，反過來，我們將可以通過這個矩陣進行上下文的預測，從而實現NLP中的各類應用。

再舉個例子：
句子1 ：我愛北京天安門
句子2 ：我喜歡北京天安門

那麼，在詞向量表示中，由於他們兩個出現的上下文一致性很高，所以可以判斷這兩個詞的相似度應該也很高，‘愛’和‘喜歡’將被判斷為“關聯度”很高的詞，注意不一定是近義詞~ 只能說兩者很相似

第一節：共現矩陣

共現矩陣 Word - Word

句子1：I like deep learning
句子2：I like NLP
句子3：I enjoy flying

counts	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	0	1	0	1	0	0
enjoy	1	0	0	0	0	0	1	0
deep	0	1	0	0	1	0	0	0
learning	0	0	0	1	0	0	0	1
NLP	0	1	0	0	0	0	0	1
flying	0	0	1	0	0	0	0	1
.	0	0	0	0	1	1	1	0

我們可以發現，在第一行中，I 經常和like 和 enjoy一起出現，是不是like 和enjoy是比較接近的呢？

缺點在於當辭典變大時，也會出現維度災難的問題，也會出現Sparse的問題

最簡單的解決辦法：使用 [ 奇異值分解 (SVD)] 進行降維

奇異值分解與特徵值分解其實是差不多的，不過SVD可以分解mxn，而特徵值分解只能分解mxm

import numpy as np
import matplotlib.pyplot as plt


la = np.linalg
words = ["I" , "like" , "enjoy" , "deep" , "learning" , "NLP" , "flying" , "."]
X = np.array([
    [0,2,1,0,0,0,0,0],
    [2,0,0,1,0,1,0,0],
    [1,0,0,0,0,0,1,0],
    [0,1,0,0,1,0,0,0],
    [0,0,0,1,0,0,0,1],
    [0,1,0,0,0,0,0,1],
    [0,0,1,0,0,0,0,1],
    [0,0,0,0,1,1,1,0]
])
U, s, Vh = la.svd(X, full_matrices=False)

for i in range(len(words)):
    plt.scatter(U[i, 0], U[i, 1])
    plt.text(U[i, 0], U[i, 1], words[i])

plt.show()

結果如下：

這裡寫圖片描述

問題1 ：SVD分解演算法是O(n^3)

問題2 ：加入新的詞後需要重新計算

問題3 ：無法與深度網路進行整合

Glove：基於共現矩陣的詞向量學習

第二節：基於神經網路的詞向量學習

祖師爺：NNLM

核心思想：用一個固定大小的視窗從後向前滑動，遍歷整個語料庫求和，預測時，用同樣大小的視窗進行預測，做最大似然後作出決定

例子：比如說“我愛北京天安門”出現佔語料庫的大多數，分詞後是“我愛北京天安門”，假設視窗大小為4，那麼，進行訓練之後，當遇到“我愛北京”的時候，“天安門”出現的概率就會偏大，從而被選擇出來作為預測結果。

步驟如下：

定義一個矩陣C，作為從One-hot到dense的Projection
將window的dense詞向量做簡單的concate，送入隱層
對隱層進行Softmax分類，傳遞導數進行更新
然後將中間隱層的結果（算是副產品）作為詞的詞向量

Word2Vec

W2V相較於NNLM來說，去除了隱層

視窗大小

在 [核心思路] 我們講到，Word2Vec通過詞語周邊的詞來判斷該詞的意思，那麼該選擇多少這樣的詞呢，即我們該選擇多少朋友來判斷這個人說話的真偽呢，這個就是視窗大小。

word2vec 中的視窗隨機

Google實現的word2vec中，每個batch取的window都不一樣，取的是個比window小的隨機數！！！！驚呆了，然後聽說有論文論證過其有效性，等找到再發上來~

CBOW 連續詞袋

特點
1. 無隱層
2. 使用雙向上下文視窗
3. 上下文詞序無關
4. 輸入層直接使用低維稠密表示
5. 投影層簡化為求和（研究人員嘗試過求平均）

與NNLM相比，CBOW去除了第一步的編碼隱層，省去了投影矩陣C，在入口處直接使用隨機編碼詞向量，然後通過SUM操作直接進行投影，通過視窗滑動學習優化詞向量，最後在輸出的時候使用層次Softmax和負例取樣進行降維

層次Softmax

假設詞表中有10萬個詞，那麼傳統的Softmax就需要計算10萬個維度的概率後取argmax，計算量很大，層次Softmax通過對詞進行Huffman編碼，然後在每個節點進行若干次LogisticRegresion，最終實現與Softmax一致的多分類，但是由於Huffman樹將資訊量減小了，因此需要的分類次數極大減小，為log2(n)次

具體流程是：
1. 從根節點開始，根據當前詞的編碼向下遞迴
2. 通常來說是以編碼為1作為正例，但是Google不是，是以編碼為0的作為負例，導致損失函式有點不一樣
3. 在每一層更新兩部分：該節點的權重，根據導數更新詞向量

負例取樣

具體流程是：

取當前詞，很明顯，是正例，置label為1，計算梯度，更新
迴圈的從table中取負例，置label為0，計算梯度，更新
如果遇到自己，則跳過，繼續迴圈

使用gensim進行word2vec訓練

遇到的問題 : 資料維度

gensim所需要的資料是這個樣子的：

這裡寫圖片描述

但是在分句和清洗之後，df中是一個有三個維度的Series，處理方式是使用sum函式降維

sentences = sum(sentences, [])

二維空間中顯示詞向量

#!/usr/bin/env python
# coding=utf-8
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
import word2vec
# load the word2vec model
model = word2vec.load('corpusWord2Vec.bin')
rawWordVec=model.vectors

# reduce the dimension of word vector
X_reduced = PCA(n_components=2).fit_transform(rawWordVec)

# show some word(center word) and it's similar words
index1,metrics1 = model.cosine(u'中國')
index2,metrics2 = model.cosine(u'清華')
index3,metrics3 = model.cosine(u'牛頓')
index4,metrics4 = model.cosine(u'自動化')
index5,metrics5 = model.cosine(u'劉亦菲')

# add the index of center word 
index01=np.where(model.vocab==u'中國')
index02=np.where(model.vocab==u'清華')
index03=np.where(model.vocab==u'牛頓')
index04=np.where(model.vocab==u'自動化')
index05=np.where(model.vocab==u'劉亦菲')

index1=np.append(index1,index01)
index2=np.append(index2,index03)
index3=np.append(index3,index03)
index4=np.append(index4,index04)
index5=np.append(index5,index05)

# plot the result
zhfont = matplotlib.font_manager.FontProperties(fname='/usr/share/fonts/truetype/wqy/wqy-microhei.ttc')
fig = plt.figure()
ax = fig.add_subplot(111)

for i in index1:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='r')

for i in index2:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='b')

for i in index3:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='g')

for i in index4:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='k')

for i in index5:
    ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='c')

ax.axis([0,0.8,-0.5,0.5])
plt.show()

這裡寫圖片描述

使用word2vec向量進行RF分類，實現情感分析

程式碼

import os
import re
import time
import numpy as np
import pickle
import pandas as pd
import gensim.models.word2vec as word2vec

import nltk
from nltk.corpus import stopwords  # 停用詞
from bs4 import BeautifulSoup

from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans


def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    datafile = os.path.join('./', 'Data', datasets[name])
    df = pd.read_csv(datafile, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of data: {}'.format(len(df)))
    return df

eng_stopwords = stopwords.words('english')
def clean_text(text, remove_stopwords=False):
    # 1. 去除HTML標籤的資料
    text = BeautifulSoup(text, 'html.parser').get_text()
    # 2. 去除怪異符號
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # 3. 分詞
    text = text.lower().split()
    # 4. 去除停用詞
    if remove_stopwords:
        text = [e for e in text if e not in eng_stopwords]
    return text


# 設定詞向量訓練的引數
num_features = 300    # Word Vector Dimension
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)

def word2vec_Training(sentences=None):
    import logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    print('Traing word2vec model...')
    model = word2vec.Word2Vec(
        sentences,
        workers=num_workers,
        size=num_features,
        min_count=min_word_count,
        window=context,
        sample=downsampling
    )
    model.init_sims(replace=True)
    model.save(os.path.join('./', 'Model', model_name))

    return model



word2vec_model_pkl_filepath = os.path.join('./', 'Model', model_name)
if os.path.exists(word2vec_model_pkl_filepath):
    model = word2vec.Word2Vec.load(word2vec_model_pkl_filepath)
    print(model)
else:
    print('Use the model_2 to do the text cleaning')

def to_review_vector(review):
    words = clean_text(review, remove_stopwords=True)
    array = np.array([model[w] for w in words if w in model])
    return pd.Series(array.mean(axis=0))

def main():
    df = load_dataset('labeled_train')
    train_data_features = df.review.apply(to_review_vector)
    forest = RandomForestClassifier(n_estimators=100, random_state=42)
    forest.fit(train_data_features, df.sentiment)
    test_data = 'This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let\'s see, word to the wise, lava burns you; steam burns you. You can\'t stand next to lava. Diverting a minor lava flow is difficult, let alone a significant one. Scares me to think that some might actually believe what they saw in this movie.<br /><br />Even worse is the significant amount of talent that went into making this film. I mean the acting is actually very good. The effects are above average. Hard to believe somebody read the scripts for this and allowed all this talent to be wasted. I guess my suggestion would be that if this movie is about to start on TV ... look away! It is like a train wreck: it is so awful that once you know what is coming, you just have to watch. Look away and spend your time on more meaningful content.'
    test_data = to_review_vector(test_data).tolist()
    result = forest.predict(test_data)
    print(result)

if __name__ == '__main__':
    main()

[自然語言處理] (4) Word2Vec

摘要

核心思路： [用詞附近的詞來表示該詞]

第一節：共現矩陣

共現矩陣 Word - Word

Glove：基於共現矩陣的詞向量學習

第二節：基於神經網路的詞向量學習

祖師爺：NNLM

步驟如下：

Word2Vec

視窗大小

word2vec 中的視窗隨機

CBOW 連續詞袋

層次Softmax

負例取樣

使用gensim進行word2vec訓練

二維空間中顯示詞向量

使用word2vec向量進行RF分類，實現情感分析

[自然語言處理] (4) Word2Vec

精通Python自然語言處理 4 ：詞性標註--單詞識別

自然語言處理之word2vec原理詞向量生成

深度學習筆記之自然語言處理（word2vec）

自然語言處理4 -- 句法分析

自然語言處理詞向量模型-word2vec

python自然語言處理-讀書筆記4

python自然語言處理——1.4 回到python：決策和控制

自然語言處理（3）——Word2Vec理論

自然語言處理word2vec的視訊筆記-理論篇

Python與自然語言處理（二）基於Gensim的Word2Vec

自然語言處理Word2Vec詞向量模型

TensorFlow實現經典深度學習網路（5）：TensorFlow實現自然語言處理基礎網路Word2Vec

word2vec 在非自然語言處理 (NLP) 領域的應用

利用Tensorflow進行自然語言處理（NLP）系列之二高階Word2Vec

自然語言處理入門（4）——中文分詞原理及分詞工具介紹

自然語言處理中傳統詞向量表示VS深度學習語言模型（三）：word2vec詞向量

第六章（1.4）自然語言處理實戰——時間語義抽取

自然語言處理中的Attention Model：是什麽及為什麽

gensim自然語言處理

[自然語言處理] (4) Word2Vec

摘要

核心思路 ： [用詞附近的詞來表示該詞]

第一節：共現矩陣

共現矩陣 Word - Word

Glove：基於共現矩陣的詞向量學習

第二節：基於神經網路的詞向量學習

祖師爺：NNLM

步驟如下：

Word2Vec

視窗大小

word2vec 中的視窗隨機

CBOW 連續詞袋

層次Softmax

負例取樣

使用gensim進行word2vec訓練

二維空間中顯示詞向量

使用word2vec向量進行RF分類，實現情感分析

相關推薦

核心思路： [用詞附近的詞來表示該詞]