利用Tensorflow進行自然語言處理(NLP)系列之二高階Word2Vec
一、概述
在上一篇中,我們介紹了Word2Vec即詞向量,對於Word Embeddings即詞嵌入有了些基礎,同時也闡述了Word2Vec演算法的兩個常見模型 :Skip-Gram模型和CBOW模型,本篇會對兩種演算法做出比較分析並給出其擴充套件模型-GloVe模型。
首先,我們將比較下原Skip-gram演算法和優化後的新Skip-gram演算法情況。對比下Skip-gram與CBOW之間的差異以及兩者隨迭代次數變化而表現出的不同,利用現有資料,分析一下哪種方法更有利於工作的開展。
其次, 討論一些有助於提高工作效率的Word2Vec的擴充套件方法。在學習的過程中,Word2Vec擴充套件方法涉及負例取樣、忽略無效資訊等等。當然,還會涉及到一種新的詞嵌入技術---Global Vectors(GloVe)及GloVe與Skip-gram和CBOW的比較。
最後,將學習如何使用Word2VEC來解決現實世界的問題:文件分類。
二、原始Skip-gram模型優化前後比較
1、理論說明
原始Skip-gram模型實際是因為沒有中間隱含層(Hidden Layers),而是使用兩個不同的embedding 層(嵌入層)或projection層(投影層),且定義了由嵌入層本身派生的代價函式。這裡可以對原始Skip-gram和改進後的Skip-gram模型圖做個對比。圖2-1 是原始Skip-gram模型圖,圖2-2是改進後的Skip-gram模型圖(在上一篇系列一中也有出現)。
圖2-1 不含隱藏層的原始Skip-gram模型圖
圖2-2 含有隱含層的改進型Skip-gram模型圖
2、Tensorflow實施對比
由於原始Skip-gram模型不含有隱藏層,所以我們無法像上一篇實現的版本那樣簡單,因為這裡的損失函式需要利用TensorFlow手工編制,不像改進版的那樣可以直接使用內建函式。實際上就是,沒有隱藏層的無法通過Softmax weights和Softmax biases去計算自身的loss。這樣在程式碼實現過程中,主要有兩處需要注意,一是 定義模型引數和其他變數;二是 模型計算。
相關資料及步驟與上一篇(系列之一)一樣 ,這裡重點給出二者的不同之處、以及隨Iterations變化的對比圖。
2.1定義模型引數和其他變數
原始Skip-gram實現的程式碼
# Variables
# Embedding layers, contains the word embeddings
# We define two embedding layers
# in_embeddings is used to lookup embeddings corresponding to target words (inputs)
in_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)
)
# out_embeddings is used to lookup embeddings corresponding to contect words (labels)
out_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)
)
改進(或者優化)後的Skip-gram實現的程式碼
# Variables
# Embedding layer, contains the word embeddings
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
# Softmax Weights and Biases
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=0.5 / math.sqrt(embedding_size))
)
softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))
2.2 模型計算方面 --Defining the Model Computations
原始Skip-gram實現的程式碼
# 1. Compute negative sampels for a given batch of data
# Returns a [num_sampled] size Tensor
negative_samples, _, _ = tf.nn.log_uniform_candidate_sampler(train_labels, num_true=1, num_sampled=num_sampled,
unique=True, range_max=vocabulary_size)
# 2. Look up embeddings for inputs, outputs and negative samples.
in_embed = tf.nn.embedding_lookup(in_embeddings, train_dataset)
out_embed = tf.nn.embedding_lookup(out_embeddings, tf.reshape(train_labels,[-1]))
negative_embed = tf.nn.embedding_lookup(out_embeddings, negative_samples)
# 3. Manually defining negative sample loss
# As Tensorflow have a limited amount of flexibility in the built-in sampled_softmax_loss function,
# we have to manually define the loss fuction.
# 3.1. Computing the loss for the positive sample
# Exactly we compute log(sigma(v_o * v_i^T)) with this equation
loss = tf.reduce_mean(
tf.log(
tf.nn.sigmoid(
tf.reduce_sum(
tf.diag([1.0 for _ in range(batch_size)])*
tf.matmul(out_embed,tf.transpose(in_embed)),
axis=0)
)
)
)
# 3.2. Computing loss for the negative samples
# We compute sum(log(sigma(-v_no * v_i^T))) with the following
# Note: The exact way this part is computed in TensorFlow library appears to be
# by taking only the weights corresponding to true samples and negative samples
# and then computing the softmax_cross_entropy_with_logits for that subset of weights.
# More infor at: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/ops/nn_impl.py
# Though the approach is different, the idea remains the same
loss += tf.reduce_mean(
tf.reduce_sum(
tf.log(tf.nn.sigmoid(-tf.matmul(negative_embed,tf.transpose(in_embed)))),
axis=0
)
)
# The above is the log likelihood.
# We would like to transform this to the negative log likelihood
# to convert this to a loss. This provides us with
# L = - (log(sigma(v_o * v_i^T))+sum(log(sigma(-v_no * v_i^T))))
loss *= -1.0
改進(或優化)的Skip-gram實現程式碼
# Model.
# Look up embeddings for a batch of inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(
weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)
2.3、 原始Skip-gram和改進Skip-gram的對比
二者對比實現程式碼
# Load the skip-gram losses from the calculations we did in Chapter 3
# So you need to make sure you have this csv file before running the code below
skip_loss_path = os.path.join('..','ch3','skip_losses.csv')
with open(skip_loss_path, 'rt') as f:
reader = csv.reader(f,delimiter=',')
for r_i,row in enumerate(reader):
if r_i == 0:
skip_gram_loss = [float(s) for s in row]
pylab.figure(figsize=(15,5)) # figure in inches
# Define the x axis
x = np.arange(len(skip_gram_loss))*2000
# Plot the skip_gram_loss (loaded from chapter 3)
pylab.plot(x, skip_gram_loss, label="Skip-Gram (Improved)",linestyle='--',linewidth=2)
# Plot the original skip gram loss from what we just ran
pylab.plot(x, skip_gram_loss_original, label="Skip-Gram (Original)",linewidth=2)
# Set some text around the plot
pylab.title('Original vs Improved Skip-Gram Loss Decrease Over Time',fontsize=24)
pylab.xlabel('Iterations',fontsize=22)
pylab.ylabel('Loss',fontsize=22)
pylab.legend(loc=1,fontsize=22)
# use for saving the figure if needed
pylab.savefig('loss_skipgram_original_vs_impr.jpg')
pylab.show()
輸出的對比圖
圖2-3 The original skip-gram algorithm versus the improved skip-gram algorithm
從圖2-3中的對比,我們不難看出,含有隱藏層的Skip-grm演算法比沒有隱藏層的Skip-gram演算法表現更佳,同時也顯示在深度Word2Vec模型處理方面改進後的Skip-gram演算法表現更優。
三、Skip-gram模型和CBOW模型比較分析
1、區別
圖2-4和圖2-5分別給出了Skip-gram、CBOW模型實施圖(在上一篇系列之一中也提到過)。
像圖中顯示的那樣,給定上下文和目標單詞,Skip-gram模型只關注單個輸入/輸出元組中的目標詞和上下文的單個單詞,而CBOW則關注目標單詞和單個樣本中上下文的所有單詞。例如 ,短語“狗正對郵遞員狂叫”,Skip-gram給出的輸入/輸出元組是以["dog", "at"]的形式出現,而CBOW則是[["dog","barked","the","mailman"],"at"]。因此,在給定資料集中,對於指定單詞的上下文而言,CBOW比Skip-gram會獲取更多的資訊。下面看下這種差異如何影響兩種演算法的效能。
圖2-4 Skip-gram模型實施圖
圖2-5 CBOW模型實施圖
2、效能比較
現在讓我們畫出上一篇利用SKip-gram和CBOW演算法進行的模型訓練任務中的損失隨在時間上的表現情況,來看下哪個演算法的損失函式下降更快。詳見圖2-6所示。
實現程式碼
# Load the skip-gram losses from the calculations we did in Chapter 3
# So you need to make sure you have this csv file before running the code below
cbow_loss_path = os.path.join('..','ch3','cbow_losses.csv')
with open(cbow_loss_path, 'rt') as f:
reader = csv.reader(f,delimiter=',')
for r_i,row in enumerate(reader):
if r_i == 0:
cbow_loss = [float(s) for s in row]
pylab.figure(figsize=(15,5)) # in inches
# Define the x axis
x = np.arange(len(skip_gram_loss))*2000
# Plot the skip_gram_loss (loaded from chapter 3)
pylab.plot(x, skip_gram_loss, label="Skip-Gram",linestyle='--',linewidth=2)
# Plot the cbow_loss (loaded from chapter 3)
pylab.plot(x, cbow_loss, label="CBOW",linewidth=2)
# Set some text around the plot
pylab.title('Skip-Gram vs CBOW Loss Decrease Over Time',fontsize=24)
pylab.xlabel('Iterations',fontsize=22)
pylab.ylabel('Loss',fontsize=22)
pylab.legend(loc=1,fontsize=22)
# use for saving the figure if needed
pylab.savefig('loss_skipgram_vs_cbow.png')
pylab.show()
輸出結果
圖2-6 Loss decrease: skip-gram versus CBOW
3、相關分析
如圖2-6所示,與Skip-gram模型相比,CBOW模型的損失下降更快,進一步能夠獲得更多給定輸入-輸出元組下目標詞的上下文資訊。然而,模型損失自身還是足以充分度量模型的效能,因為訓練資料過度擬合時損失可能迅速減少。所以,這裡再通過一個視覺化的角度去檢查學習嵌入,以使得Skip-gram模型和CBOW模型在語義上有更顯著的區別。這裡還是使用比較流行的視覺化技術:t-Distributed Stochastic Neighbor Embedding (t-SNE)。
這裡給出部分程式碼和最終輸出結果
Plotting the Embeddings
def plot_embeddings_side_by_side(sg_embeddings, cbow_embeddings, sg_labels, cbow_labels):
''' Plots word embeddings of skip-gram and CBOW side by side as subplots
'''
# number of clusters for each word embedding
# clustering is used to assign different colors as a visual aid
n_clusters = 20
# automatically build a discrete set of colors, each for cluster
print('Define Label colors for %d',n_clusters)
label_colors = [pylab.cm.spectral(float(i) /n_clusters) for i in range(n_clusters)]
# Make sure number of embeddings and their labels are the same
assert sg_embeddings.shape[0] >= len(sg_labels), 'More labels than embeddings'
assert cbow_embeddings.shape[0] >= len(cbow_labels), 'More labels than embeddings'
print('Running K-Means for skip-gram')
# Define K-Means
sg_kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(sg_embeddings)
sg_kmeans_labels = sg_kmeans.labels_
sg_cluster_centroids = sg_kmeans.cluster_centers_
print('Running K-Means for CBOW')
cbow_kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(cbow_embeddings)
cbow_kmeans_labels = cbow_kmeans.labels_
cbow_cluster_centroids = cbow_kmeans.cluster_centers_
print('K-Means ran successfully')
print('Plotting results')
pylab.figure(figsize=(25,20)) # in inches
# Get the first subplot
pylab.subplot(1, 2, 1)
# Plot all the embeddings and their corresponding words for skip-gram
for i, (label,klabel) in enumerate(zip(sg_labels,sg_kmeans_labels)):
center = sg_cluster_centroids[klabel,:]
x, y = cbow_embeddings[i,:]
# This is just to spread the data points around a bit
# So that the labels are clearer
# We repel datapoints from the cluster centroid
if x < center[0]:
x += -abs(np.random.normal(scale=2.0))
else:
x += abs(np.random.normal(scale=2.0))
if y < center[1]:
y += -abs(np.random.normal(scale=2.0))
else:
y += abs(np.random.normal(scale=2.0))
pylab.scatter(x, y, c=label_colors[klabel])
x = x if np.random.random()<0.5 else x + 10
y = y if np.random.random()<0.5 else y - 10
pylab.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points',
ha='right', va='bottom',fontsize=16)
pylab.title('t-SNE for Skip-Gram',fontsize=24)
# Get the second subplot
pylab.subplot(1, 2, 2)
# Plot all the embeddings and their corresponding words for CBOW
for i, (label,klabel) in enumerate(zip(cbow_labels,cbow_kmeans_labels)):
center = cbow_cluster_centroids[klabel,:]
x, y = cbow_embeddings[i,:]
# This is just to spread the data points around a bit
# So that the labels are clearer
# We repel datapoints from the cluster centroid
if x < center[0]:
x += -abs(np.random.normal(scale=2.0))
else:
x += abs(np.random.normal(scale=2.0))
if y < center[1]:
y += -abs(np.random.normal(scale=2.0))
else:
y += abs(np.random.normal(scale=2.0))
pylab.scatter(x, y, c=label_colors[klabel])
x = x if np.random.random()<0.5 else x + np.random.randint(0,10)
y = y + np.random.randint(0,5) if np.random.random()<0.5 else y - np.random.randint(0,5)
pylab.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points',
ha='right', va='bottom',fontsize=16)
pylab.title('t-SNE for CBOW',fontsize=24)
# use for saving the figure if needed
pylab.savefig('tsne_skip_vs_cbow.png')
pylab.show()
# Run the function
sg_words = [reverse_dictionary[i] for i in sg_selected_ids]
cbow_words = [reverse_dictionary[i] for i in cbow_selected_ids]
plot_embeddings_side_by_side(sg_two_d_embeddings, cbow_two_d_embeddings, sg_words,cbow_words)
最終輸出結果(含對比圖片) 。
說明:實際輸出圖片是兩張並排顯示的,筆者為了儘可能讓圖片顯示清楚又單獨做的截圖。
Define Label colors for %d 20
Running K-Means for skip-gram
Running K-Means for CBOW
K-Means ran successfully
Plotting results
由圖中所示,我們可以發現,CBOW模型對單詞的聚類分析效果更佳,所以,可以說,在這部分例子中,CBOW 模型比Skip-gram模型更優。
4、CBOW及其擴充套件的對比
這裡本來想展開分析,但考慮的本文篇幅問題,就不做過多解讀,簡要給出CBOW、CBOW(Unigram)、CBOW (Unigram+Subsampling)之間的對比,網上還沒找到關於三者之間對比的深入解讀,感興趣的讀者可以細看Thushan Ganegedara寫的《Natural Language Processing with TensorFlow》。
基於負取樣例和子取樣例損失三者的對比圖
pylab.figure(figsize=(15,5)) # in inches
# Define the x axis
x = np.arange(len(skip_gram_loss))*2000
# Plotting standard CBOW loss, CBOW loss with unigram sampling and
# CBOW loss with unigram sampling + subsampling here in one plot
pylab.plot(x, cbow_loss, label="CBOW",linestyle='--',linewidth=2)
pylab.plot(x, cbow_loss_unigram, label="CBOW (Unigram)",linestyle='-.',linewidth=2,marker='^',markersize=5)
pylab.plot(x, cbow_loss_unigram_subsampled, label="CBOW (Unigram+Subsampling)",linewidth=2)
# Some text around the plots
pylab.title('Original CBOW vs Various Improvements Loss Decrease Over-Time',fontsize=24)
pylab.xlabel('Iterations',fontsize=22)
pylab.ylabel('Loss',fontsize=22)
pylab.legend(loc=1,fontsize=22)
# Use for saving the figure if needed
pylab.savefig('loss_cbow_vs_all_improvements.png')
pylab.show()
輸出結果
這裡發現一個有意思的現象,CBOW(Unigram)和CBOW (Unigram+Subsampling)給出了幾乎一樣的損失值。然而,這不應該被錯誤地理解為Subsampling在學習問題上優勢有缺失。這種特殊現象產生的原因如下:和二次取樣(Subsampling)一樣,我們去掉了一些無效的單詞(這些單詞具有資訊意義),引起文字質量上升(就資訊質量而言)。這樣就反過來使得學習的問題變得更加困難。在之前的問題設定中,詞向量本來有機會在優化處理中對無效單詞(就資訊意義而言)加以利用處理,而現在新的問題設定中,這些機會已經非常小了,這就帶來更大的損失,但語義上的聲音詞向量還在。
四、GloVe模型
學習單詞向量的方法分為兩類: 基於全域性矩陣分解的方法或基於區域性上下文視窗的方法。潛在語義分析(LSA)是一種基於全域性矩陣分解的方法,Skip-gram和CBOW是基於區域性上下文視窗的方法。作為一種文件 析技術,LSA將文件中的單詞對映成一種“概念”,這“概念”在文件中以一種常見的單詞模式呈現出來。而基於全域性矩陣分解的方法則有效地利用了語料庫的全域性統計(例如,全域性範圍內單詞的共現情形),但這種在詞類類比任務中效果一般。另一方面,基於上下文視窗的方法已在詞語類比任務中表現良好,但卻沒有充分使用語料庫的全域性統計,這就為後續的改進工作留出了空間。
1、通過例子增強對GloVe的理解
在對GloVe進行實施之前,我們先看個例子,增強對於GloVe的理解。
1)、這裡有兩個單詞 : i="dog" 和 j ="cat".
2)、 定義任一探測詞k;
3)、 用Pik 單詞i和單詞k 表示單詞i和單詞k同時出現的概率 ,Pjk分別表示單詞j和單詞k同時出現的概率。
現在看下,在k取不同值的情況下pik/pjk的變化情況。
對於k=“bark”而言,這裡k與i一起出現的概率很高,與j同時出現的可能性極小,因此Pik/Pjk >>1。
當k="purr"時,k不太可能出現在i附近,則Pik較小;而k卻與j高度相關,則Pjk值較高。所以 Pik/Pjk的近似值為0。
對於K=“PET”這樣的詞,它與I和J都有很強的關係,或者K=“politics”,與兩者都具有最小的相關性,所以這時我們得到: Pik/Pjk的值為1。
可以看出,實體Pik/Pjk是一種通過近距離兩個詞的共現頻率去度量二者的關係一種衡量方法。因此,它對於學習好的詞向量是一種不錯的備選方法。那麼,下定義一個損失函式就成為我們開啟相關工作的不錯的起點,
由
有關資料經過認真推導得到如下損失函式。
2、GloVe模型實施
2.1 資料集
資料與上一篇使用的資料一樣,即dataset.
2.2 相關步驟
與上一篇Skip-gram模型流程類似。
-
用NLTK對資料進行預處理;
-
建立相關Dictionaries;
-
給出資料的Batches;
-
明確超引數、輸出樣本、輸入樣本、模型引數及其他變數;
-
確定模型計算、計算單詞相似性;
-
模型優化、執行;
2.3 給出部分程式碼及最終輸出結果
模型執行程式碼
num_steps = 100001
glove_loss = []
average_loss = 0
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
# generate a single batch (data,labels,co-occurance weights)
batch_data, batch_labels, batch_weights = generate_batch(
batch_size, skip_window)
# Computing the weights required by the loss function
batch_weights = [] # weighting used in the loss function
batch_xij = [] # weighted frequency of finding i near j
# Compute the weights for each datapoint in the batch
for inp,lbl in zip(batch_data,batch_labels.reshape(-1)):
point_weight = (cooc_mat[inp,lbl]/100.0)**0.75 if cooc_mat[inp,lbl]<100.0 else 1.0
batch_weights.append(point_weight)
batch_xij.append(cooc_mat[inp,lbl])
batch_weights = np.clip(batch_weights,-100,1)
batch_xij = np.asarray(batch_xij)
# Populate the feed_dict and run the optimizer (minimize loss)
# and compute the loss. Specifically we provide
# train_dataset/train_labels: training inputs and training labels
# weights_x: measures the importance of a data point with respect to how much those two words co-occur
# x_ij: co-occurence matrix value for the row and column denoted by the words in a datapoint
feed_dict = {train_dataset : batch_data.reshape(-1), train_labels : batch_labels.reshape(-1),
weights_x:batch_weights,x_ij:batch_xij}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
# Update the average loss variable
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
glove_loss.append(average_loss)
average_loss = 0
# Here we compute the top_k closest words for a given validation word
# in terms of the cosine distance
# We do this for all the words in the validation set
# Note: This is an expensive step
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
輸出(給出部分內容,中間刪除一部分)
Initialized
Average loss at step 0: 9.578778
Nearest to it: karol, burgh, destabilise, armchair, crook, roguery, one-sixth, swains,
Nearest to that: wmap, partake, ahmadi, armstrong, memberships, forza, director-general, condo,
Nearest to has: mentality, vastly, approaches, bulwark, enzymes, originally, privatize, reunify,
Nearest to but: inhabited, potrero, trust, memory, curran, philips, p.m.s, pagoda,
Nearest to city: seals, counter-revolution, tubular, kayaking, central, 1568, override, buckland,
Nearest to this: dispersion, intermarriage, dialysis, moguls, aldermen, alcoholic, codes, farallon,
Nearest to UNK: 40.3, tatsam, jupiter, verify, unequal, berliners, march, 1559,
Nearest to by: functionalists, synthesised, palladius, chiapas, synaptic, sumner, raining, valued,
Nearest to or: amherst, 'mother, epiglottis, wen, stanislaus, trafford, cuticle, reminded,
Nearest to been: 640,961., depression-era, uniquely, mami, 375,000, stickiness, medium-sized, amor,
Nearest to with: anti-statist, pitigliano, branches, reparations, acquittal, frowned, pishpek, left-leaning,
Nearest to be: i-20, kevin, greased, rightly, conductors, hypercholesterolemia, pedro, douaumont,
Nearest to as: gabon, horda, mead, protruding, soundtrack, algeria, 48, macon,
Nearest to at: kambula, tisa, spelled, 130,000, 2008, organisers, |jul_rec_lo_°f, arrows,
Nearest to ,: is, of, its, malton, martinů, retiree, reliant, uri,
Nearest to its: of, ,, galleon, gitlow, rugby-playing, varanasi, fono, clusters,
Average loss at step 2000: 0.739107
Average loss at step 4000: 0.091107
Average loss at step 6000: 0.068614
Average loss at step 8000: 0.076040
Average loss at step 10000: 0.058149
Nearest to it: was, is, that, not, a, in, to, .,
Nearest to that: is, was, the, a, ., ,, to, in,
Nearest to has: is, it, that, a, been, was, to, mentality,
Nearest to but: with, said, trust, mating, not, squamous, war—the, r101,
Nearest to city: of, 's, counter-revolution, the, professed, ., equilibrium, seals,
Nearest to this: is, ., for, in, was, the, a, that,
Nearest to UNK: and, ,, (, in, the, ., ), a,
Nearest to by: the, and, ,, ., in, was, of, a,
Nearest to or: UNK, ,, and, a, cuticle, donnchad, ``, 'mother,
Nearest to been: have, had, to, has, be, was, that, it,
Nearest to with: ,, and, a, the, in, of, for, .,
Nearest to by: the, was, ,, in, ., and, a, of,
Nearest to or: (, UNK, ), ``, a, ,, and, with,
Nearest to been: have, has, had, also, be, that, was, to,
Nearest to with: and, ,, a, the, of, in, for, .,
Nearest to be: to, have, can, not, that, from, is, would,
Nearest to as: a, an, ,, such, for, and, is, the,
Nearest to at: of, the, ., in, 's, ,, and, by,
Nearest to ,: and, in, the, ., a, with, of, UNK,
Nearest to its: for, and, their, with, his, ,, the, of,
Average loss at step 92000: 0.019305
Average loss at step 94000: 0.019555
Average loss at step 96000: 0.019266
Average loss at step 98000: 0.018803
Average loss at step 100000: 0.018488
Nearest to it: is, was, also, that, not, has, this, a,
Nearest to that: was, is, to, it, the, a, ., ,,
Nearest to has: it, been, was, had, also, is, that, a,
Nearest to but: which, not, ,, it, with, was, and, a,
Nearest to city: of, 's, the, ., in, is, new, world,
Nearest to this: is, ., was, it, in, for, the, at,
Nearest to UNK: (, and, ), ,, or, a, the, .,
Nearest to by: the, ., was, ,, and, in, of, a,
Nearest to or: UNK, (, ``, a, ), ,, and, with,
Nearest to been: have, has, had, also, be, was, that, to,
Nearest to with: and, ,, a, the, of, in, for, .,
Nearest to be: to, have, can, not, would, from, that, a,
Nearest to as: a, such, an, ,, for, is, and, to,
Nearest to at: of, ., the, in, 's, by, ,, and,
Nearest to ,: and, in, the, ., a, with, UNK, of,
Nearest to its: for, their, and, with, his, ,, to, the,
五、Word2Vec解決文件分類
雖然詞嵌入給出了一種非常優雅的學習單詞數字化表示方法,但正如我們看到的那樣,這些孤立的定量(損失值)和定性(T-SNE嵌入)學習詞表示的效用在現實世界的詞表示方面是令人不能滿意的。字嵌入被用作許多工的詞特徵表示,例如影象字幕生成和機器翻譯。然而,這些任務涉及組合不同的學習模型(如卷積神經網路(CNNs)和長短時記憶(LSTM)模型或兩個LSTM模型),將在後面的篇幅中繼續討論。這裡就涉及到在現實世界中詞嵌入的實際應用情況,所以,這裡給出一個簡單的文件分類任務。
文件分類是NLP中最流行的任務之一,它對於處理海量資料(比如新聞網站、出版商、大學)的人員來說是非常有用的。所以,下面我們使用的來自BBC的新聞文章,每一檔案屬於以下類別:商業、娛樂、政治、體育或技術。每個類別使用其中的250個文件,詞彙量規模為25,000。另外,每個文件都將用一種“<文件> -<ID>”標籤來表示。例如,娛樂部的第五十份檔案將被表示為“娛樂版-50”。與現實世界中被分析應用的大型文字語料庫相比,這是一個非常小的資料集,但這個小的例子可以讓我們看到詞嵌入的威力。
1、資料集
來自BBC新聞:Dataset,給出。
2、相關步驟
-
用NLTK對資料進行預處理;
-
建立相關Dictionaries,包括word到ID、ID到word及單詞list(word,frequency)等;
-
用Skip-gram 給出資料的Batches;
-
明確超引數、輸出、輸入、模型引數及其他變數;
-
計算損失、單詞相似性;
-
模型優化、執行CBOW演算法模型;
-
利用 t-SNE Results給出視覺化結果;
-
執行文件分類。
3、這裡給出部分程式碼和輸出結果
3.1 Running the CBOW Algorithm on Document Data
num_steps = 100001
cbow_loss = []
config=tf.ConfigProto(allow_soft_placement=True)
# This is an important setting and with limited GPU memory,
# not using this option might lead to the following error.
# InternalError (see above for traceback): Blas GEMM launch failed : ...
config.gpu_options.allow_growth = True
with tf.Session(config=config) as session:
# Initialize the variables in the graph
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
# Train the Word2vec model for num_step iterations
for step in range(num_steps):
# Generate a single batch of data
batch_data, batch_labels = generate_batch(data, batch_size, window_size)
# Populate the feed_dict and run the optimizer (minimize loss)
# and compute the loss
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
# Update the average loss variable
average_loss += l
if (step+1) % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step+1, average_loss))
cbow_loss.append(average_loss)
average_loss = 0
# Evaluating validation set word similarities
if (step+1) % 10000 == 0:
sim = similarity.eval()
# Here we compute the top_k closest words for a given validation word
# in terms of the cosine distance
# We do this for all the words in the validation set
# Note: This is an expensive step
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
# Computing test documents embeddings by averaging word embeddings
# We take batch_size*num_test_steps words from each document
# to compute document embeddings
num_test_steps = 100
# Store document embeddings
# {document_id:embedding} format
document_embeddings = {}
print('Testing Phase (Compute document embeddings)')
# For each test document compute document embeddings
for k,v in test_data.items():
print('\tCalculating mean embedding for document ',k,' with ', num_test_steps, ' steps.')
test_data_index = 0
topic_mean_batch_embeddings = np.empty((num_test_steps,embedding_size),dtype=np.float32)
# keep averaging mean word embeddings obtained for each step
for test_step in range(num_test_steps):
test_batch_labels = generate_test_batch(test_data[k],batch_size)
batch_mean = session.run(mean_batch_embedding,feed_dict={test_labels:test_batch_labels})
topic_mean_batch_embeddings[test_step,:] = batch_mean
document_embeddings[k] = np.mean(topic_mean_batch_embeddings,axis=0)
輸出(給出部分內容)
Initialized
Average loss at step 2000: 3.914388
Average loss at step 4000: 3.562556
Average loss at step 6000: 3.545086
Average loss at step 8000: 3.549088
Average loss at step 10000: 3.481902
Nearest to .: ,, and, mp3s, friendlies, -, that, documentary, low,
Nearest to i: we, is, appreciate, outsiders, icann, modelling, onslaught, chennai,
Nearest to which: impromptu, israeli, skills, portuguese, ghanaian, lifetime, innocence, paisley,
Nearest to were: are, cryptography, heaped, 836m, 50mg, pervasively, 28,000, past,
Nearest to we: people, they, enormity, i, ranked, is, jacob, are,
Nearest to they: we, softer, to, not, revisions, 27.24, 'template, be,
Nearest to would: will, to, should, alleges, sleepless, jolie, also, could,
Nearest to that: not, about, it, ., change, get, politicians, gartner,
Nearest to had: has, have, was, streets, bulgaria, directory, nestle, binding,
Nearest to said: added, restriction-free, forgiven, breathing, allardyce, intends, vans, he,
Nearest to this: 2005/06, build, connectotel, it, short, greenback, last, diet,
Nearest to he: it, inaccurate, mr, she, '', 102, was, has,
Nearest to not: that, complained, phenomenon, sourced, they, 10.4, cliques, 'template,
Nearest to it: he, there, everyday, that, ``, 6gb, this, did,
Nearest to ,: ., 's, the, sleeves, and, singer/guitarist, legislative, observed,
Nearest to from: for, hermann, in, and, by, jeep, flights, asher,
Testing Phase (Compute document embeddings)
Calculating mean embedding for document tech-34 with 100 steps.
Calculating mean embedding for document sport-166 with 100 steps.
Calculating mean embedding for document sport-87 with 100 steps.
Calculating mean embedding for document entertainment-119 with 100 steps.
Calculating mean embedding for document business-161 with 100 steps.
Calculating mean embedding for document sport-129 with 100 steps.
Calculating mean embedding for document tech-145 with 100 steps.
Calculating mean embedding for document business-135 with 100 steps.
Calculating mean embedding for document sport-206 with 100 steps.
Calculating mean embedding for document tech-216 with 100 steps.
Calculating mean embedding for document entertainment-216 with 100 steps.
Calculating mean embedding for document politics-184 with 100 steps.
Calculating mean embedding for document sport-184 with 100 steps.
Calculating mean embedding for document sport-45 with 100 steps.
Calculating mean embedding for document sport-32 with 100 steps.
Calculating mean embedding for document politics-247 with 100 steps.
Calculating mean embedding for document business-240 with 100 steps.
Calculating mean embedding for document entertainment-98 with 100 steps.
Calculating mean embedding for document politics-171 with 100 steps.
Calculating mean embedding for document politics-8 with 100 steps.
Calculating mean embedding for document business-165 with 100 steps.
Calculating mean embedding for document politics-16 with 100 steps.
Calculating mean embedding for document business-44 with 100 steps.
Calculating mean embedding for document business-215 with 100 steps.
Calculating mean embedding for document tech-79 with 100 steps.
Calculating mean embedding for document tech-178 with 100 steps.
Calculating mean embedding for document entertainment-163 with 100 steps.
Calculating mean embedding for document entertainment-196 with 100 steps.
Calculating mean embedding for document politics-236 with 100 steps.
Calculating mean embedding for document entertainment-1 with 100 steps.
Calculating mean embedding for document sport-20 with 100 steps.
Calculating mean embedding for document tech-157 with 100 steps.
3.2 用t-SNE視覺化輸出結果下圖
3.3 文件分類
# Create and fit K-means
kmeans = KMeans(n_clusters=5, random_state=43643, max_iter=10000, n_init=100, algorithm='elkan')
kmeans.fit(np.array(list(document_embeddings.values())))
# Compute items fallen within each cluster
document_classes = {}
for inp, lbl in zip(list(document_embeddings.keys()), kmeans.labels_):
if lbl not in document_classes:
document_classes[lbl] = [inp]
else:
document_classes[lbl].append(inp)
for k,v in document_classes.items():
print('\nDocuments in Cluster ',k)
print('\t',v)
輸出
Documents in Cluster 0
['entertainment-216', 'business-240', 'business-44', 'tech-178', 'business-165', 'tech-238', 'business-171', 'business-144', 'business-107']
Documents in Cluster 1
['tech-34', 'tech-145', 'business-135', 'sport-206', 'tech-216', 'politics-184', 'politics-247', 'politics-171', 'politics-8', 'politics-78', 'entertainment-163', 'politics-16', 'business-141', 'business-215', 'tech-79', 'tech-157', 'sport-231', 'tech-42', 'politics-197', 'politics-98', 'tech-212']
Documents in Cluster 2
['sport-166', 'entertainment-119', 'business-161', 'sport-129', 'sport-45', 'entertainment-98', 'entertainment-196', 'politics-236', 'sport-26', 'entertainment-1', 'entertainment-74', 'entertainment-244', 'entertainment-154']
Documents in Cluster 3
['sport-184']
Documents in Cluster 4
['sport-87', 'sport-32', 'sport-20']
4、簡要分析
從聚類的結果來看效果可能一般,但最起碼該模型將檔案做了一個較為初步的分類,大體上還算靠譜,至於遇到的部分檔案沒有被劃分到正確的群簇內,是由於相關內容與目前群簇存在一定的內在聯絡,筆者大體上簡單看了下sport-206和tech-34,其他的沒來得及全看,各位讀者感興趣的話可以從資料集中仔細查驗,這裡先不做進一步的說明,以後有時間再回過頭看看。
備註說明:書給出的t-SNE視覺化 圖片與程式碼執行的結果不一致,尤其提到tech-42在圖中的位置明顯相反,至於提到的與sport-50和ertainment-115的分析情況,由於與程式碼執行有些差異,所以這裡就不針對書中的內容做過多的解釋,讀者感興趣的話可以自行查驗。
六、簡要總結
首先,我們分析了Skip-gram和CBOW演算法之間效能上的差異。為了更好地突出二者之間的區別,我們使用了常見的視覺化技術--t-SNE,得到了二者之間背後更直觀的區別。
其次,我們開展了詞向量(Word2Vec)方面的拓展,並基於Skip-gram和CBOW模型的新演算法的效能進行了對比分析。
接下來,我們對於著名的GloVe模型進行了相關介紹和分析, 由於GloVe模型納入了全域性優化統計,所以在整體效能上得到了很大提升。
最後,我們開展了一個文件分類方面的分析。
關於後續篇幅,是開展NN、RNN、LSTM後再進行文字生成、圖片主題詞提取、機器翻譯還是直接從文字生成開始,目前沒定,後續再定。自身水平有限,如有不足之處,請各位網友多指點。
近期有些事情要處理,部落格更新可能會延遲些。