pytorch：詞嵌入和n-gram

阿新 • • 發佈：2018-11-09

本文學習於《深度學習入門之Pytorch》

對於影象分類的問題，我們會使用one-hot方式進行分類，但是對於NLP中的問題，處理單詞這種十分多種類的問題時，使用one-hot是行不通的，這個時候就引入了詞嵌入。

詞向量簡單來說就是用一個向量去表示一個詞語，但是這個向量並不是隨機的，因為這樣並沒有任何意義，所以我們需要對每個詞有一個特定的向量去表示他們，而有一些詞的詞性是相近的，比如”(love)喜歡”和”(like)愛”，對於這種詞性相近的詞，我們需要他們的向量表示也能夠相近，如何去度量和定義向量之間的相近呢？非常簡單，就是使用兩個向量的夾角，夾角越小，越相近，這樣就有了一個完備的定義。

N-Gram是基於一個假設：第n個詞出現與前n-1個詞相關，而與其他任何詞不相關。（這也是隱馬爾可夫當中的假設。）整個句子出現的概率就等於各個詞出現的概率乘積。各個詞的概率可以通過語料中統計計算得到。假設句子T是有詞序列w1,w2,w3...wn組成，用公式表示N-Gram語言模型如下：

P(T)=P(w1)*p(w2)*p(w3)***p(wn)=p(w1)*p(w2|w1)*p(w3|w1w2)***p(wn|w1w2w3...)

貼一個n-gram的程式碼：

# -*- coding: utf-8 -*-
"""
Created on Thu Oct 11 10:06:50 2018

@author: www
"""
#CONTEXT_SIZE 表示我們希望由前面幾個單詞來預測這個單詞，這裡使用兩個單詞
CONTEXT_SIZE = 2 # 依據的單詞數
EMBEDDING_DIM = 10 # 詞向量的維度
# 我們使用莎士比亞的詩
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

##CONTEXT_SIZE 表示我們希望由前面幾個單詞來預測這個單詞，這裡使用兩個單詞，EMBEDDING_DIM 表示詞嵌入的維度
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2]) 
            for i in range(len(test_sentence)-2)]

# 總的資料量
len(trigram) #113

# 總的資料量
len(trigram)
#(('When', 'forty'), 'winters')

#建立每個詞與數字的編碼，據此構建詞嵌入
vocb = set(test_sentence) # 使用 set 將重複的元素去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}

#從上面可以看到每個詞都對應一個數字，且這裡的單詞都各不相同

import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable


# 定義模型
class n_gram(nn.Module):
    def __init__(self, vocab_size, context_size=CONTEXT_SIZE, n_dim=EMBEDDING_DIM):
        super(n_gram, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, n_dim)
        self.classify = nn.Sequential(
            nn.Linear(context_size * n_dim, 128),
            nn.ReLU(True),
            nn.Linear(128, vocab_size)
        )
        
    def forward(self, x):
        voc_embed = self.embed(x) # 得到詞嵌入
        voc_embed = voc_embed.view(1, -1) # 將兩個詞向量拼在一起
        out = self.classify(voc_embed)
        return out
        
net = n_gram(len(word_to_idx))

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, weight_decay=1e-5)

for e in range(100):
    train_loss = 0
    for word, label in trigram: # 使用前 100 個作為訓練集
        word = Variable(torch.LongTensor([word_to_idx[i] for i in word])) # 將兩個詞作為輸入
        label = Variable(torch.LongTensor([word_to_idx[label]]))
        # 前向傳播
        out = net(word)
        loss = criterion(out, label)
        train_loss += loss.item()
        # 反向傳播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.6f}'.format(e + 1, train_loss / len(trigram)))

net = net.eval()
# 測試一下結果
word, label = trigram[24]
print('input: {}'.format(word))
print('label: {}'.format(label))
print()
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)
pred_label_idx = out.max(1)[1].item()
predict_word = idx_to_word[pred_label_idx]
print('real word is {}, predicted word is {}'.format(label, predict_word))


#可以看到網路在訓練集上基本能夠預測準確，不過這裡樣本太少，特別容易過擬合

pytorch：詞嵌入和n-gram

pytorch：詞嵌入和n-gram

文字情感分析(一)：基於詞袋模型(VSM和LSA)和n-gram的文字表示

前沿綜述：細數2018年最好的詞嵌入和句嵌入技術

無監督學習：詞嵌入or詞向量（Word Embedding）

[機器學習入門] 李巨集毅機器學習筆記-15 （Unsupervised Learning: Word Embedding；無監督學習：詞嵌入）

詞嵌入和網路在NLP中貢獻

基於詞表和N-gram演算法的新詞識別實驗

NLP之WE之Skip-Gram：基於TF利用Skip-Gram模型實現詞嵌入並進行視覺化、過程全記錄

詞嵌入：探索解釋和利用

pytorch筆記：03)softmax和log_softmax，以及CrossEntropyLoss

pytorch筆記：06)requires_grad和volatile

BZOJ3994：約數個數和（莫比烏斯反演：求[1,N]*[1,M]的矩陣的因子個數）

詞向量的Distributed Representation與n元語法模型(n-gram model)

搜尋框架搭建1：elasticsearch安裝和視覺化工具kibana、分詞外掛jieba安裝

春天是鮮花的季節，水仙花就是其中最迷人的代表，數學上有個水仙花數，他是這樣定義的： “水仙花數”是指一個三位數，它的各位數字的立方和等於其本身，現在要求輸出所有在m和n範圍內的水仙花數。

劍指offer第32題JS演算法：輸入一個整數n，求從1到n這n個整數的十進位制表示中1出現的次數。例如輸入12，從1到12這些整數中包含1的數字有1，10，11和12，1一共出現了5次

【程式6】題目：輸入兩個正整數m和n，求其最大公約數和最小公倍數。

C語言：遞迴和非遞迴分別實現求n的階乘

PyTorch學習：動態圖和靜態圖

pytorch中embedding詞嵌入的作用

pytorch：詞嵌入和n-gram

相關推薦