1. 程式人生 > >No.2 Pytorch 實現RNN語言模型

No.2 Pytorch 實現RNN語言模型

最近使用Pytorch,搭建了一個RNNLM,目的是為了利用詞典中的每個詞的One-Hot編碼(高維的稀疏向量),來生成 Dense Vectors。這篇文章不講解RNN原理以及為什麼使用RNN語言模型,只是對pytorch中的程式碼使用進行講解。
目前Pytorch的資料還比較少,我主要還是通過學習Pytorch文件+使用Pytorch官方論壇的形式來入門Pytorch
全部程式碼如下:

import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.autograd import Variable
from
numpy import * from torch.utils.data import DataLoader from mydataset import MyDataset BATCH_SIZE = 5 sentence_set = """When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a totter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."""
.split() EMBDDING_DIM = len(sentence_set)+1 HIDDEN_UNITS = 200 word_to_ix = {} for word in sentence_set: if word not in word_to_ix: word_to_ix[word] = len(word_to_ix) print(word_to_ix) def make_word_to_ix(word,word_to_ix): vec = torch.zeros(EMBDDING_DIM) #vec = torch.LongTensor(EMBDDING_DIM,1).zero_()
if word in word_to_ix: vec[word_to_ix[word]] = 1 else: vec[len(word_to_ix)] = 1 return vec data_words = [] data_labels = [] for i in range(len(sentence_set) -2): word = sentence_set[i] label = sentence_set[i+1] data_words.append(make_word_to_ix(word,word_to_ix)) data_labels.append(make_word_to_ix(label,word_to_ix)) dataset = MyDataset(data_words, data_labels) train_loader = DataLoader(dataset, batch_size=BATCH_SIZE) ''' for _,batch in enumerate(train_loader): print("word_batch------------>\n") print(batch[0]) print("label batch----------->\n") print(batch[1]) ''' #''' class RNNModel(nn.Module): def __init__(self, embdding_size, hidden_size): super(RNNModel, self).__init__() self.rnn = nn.RNN(embdding_size, hidden_size,num_layers=1,nonlinearity='relu') self.linear = nn.Linear(hidden_size, embdding_size) def forward(self, x, hidden): #input = x.view(BATCH_SIZE, -1) output1, h_n = self.rnn(x, hidden) output2 = self.linear(output1) log_prob = F.log_softmax(output2) return log_prob, h_n rnnmodel = RNNModel(EMBDDING_DIM, HIDDEN_UNITS) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(rnnmodel.parameters(), lr=1e-3) #''' #testing #input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS)) #x = torch.autograd.Variable(torch.rand(BATCH_SIZE,EMBDDING_DIM)) #y,_ = rnnmodel(x,input_hidden) #print(y) #'''' for epoch in range(3): print('epoch: {}'.format(epoch + 1)) print('*' * 10) running_loss = 0 input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS)) for _,batch in enumerate(train_loader): x = torch.autograd.Variable(batch[0]) y = torch.autograd.Variable(batch[1]) # forward out, input_hidden = rnnmodel(x, input_hidden) trgt = torch.max(y, 1)[1] loss = criterion(out, trgt) running_loss += loss.data[0] # backward optimizer.zero_grad() loss.backward(retain_graph=True) optimizer.step() print('Loss: {:.6f}'.format(running_loss / len(word_to_ix))) #''' #print(rnnmodel.state_dict().keys()) f = open("res-0104-rnn.txt","w+") alpha = rnnmodel.state_dict()['rnn.weight_ih_l0'] for word in sentence_set: #print(word,torch.unsqueeze(alpha[word_to_ix[word]],0).numpy()) line = word + " " +str(torch.unsqueeze(alpha[word_to_ix[word]],0).numpy().tolist()[0])+"\n" #print(line) f.write(line) f.close()

語料的預處理

這裡的預處理主要指的是將字串分割成以單詞為單位的列表,同時統計詞典,為詞典中的單詞生成One-Hot編碼。對於其他型別或者用途的語料需要使用分詞工具,這裡只是分割成列表,比較簡單,不再贅述。

統計詞典

word_to_ix = {}
for word in sentence_set:
    if word not in word_to_ix:
        word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

首先宣告一個dict,然後對於列表中的每個單詞,如果不存在dict中,新增進dict

生成One-Hot編碼


def make_word_to_ix(word,word_to_ix):
    vec = torch.zeros(EMBDDING_DIM)
    #vec = torch.LongTensor(EMBDDING_DIM,1).zero_()
    if word in word_to_ix:
        vec[word_to_ix[word]] = 1
    else:
        vec[len(word_to_ix)] = 1
    return vec

EMBDDING_DIM 表示One-Hot編碼的維度,這裡設定為字典長度加1,為不存在於字典中的詞設定一位。

例如詞典為{ Apple:0 ,Banana:1,Orange:2}
Apple的One-Hot向量為 [1,0,0,0]
Banana為[0,1,0,0]
Orange為[0,0,1,0]
Lemon位[0,0,0,1]

返回結果的型別為torch.FloatTensor,在Pytorch中,我們使用張量Tensor來表示向量,矩陣

資料集(Dataset)

torch提供一個Dataset的抽象類,繼承torch.utils.data.Dataset來實現自己的dataset,然後使用Dataloader來載入資料集
關於Dataset和Dataloader的詳細介紹參照Pytoch英文文件Pytorch中文文件
簡單來說,我們將樣本(包含資料和標籤)封裝到Dataset中,使用Dataloader,來讀取資料集,進行訓練

data_words = []
data_labels = []
for i in range(len(sentence_set) -2):
    word = sentence_set[i]
    label = sentence_set[i+1]
    data_words.append(make_word_to_ix(word,word_to_ix))
    data_labels.append(make_word_to_ix(label,word_to_ix))

dataset = MyDataset(data_words, data_labels)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE)

依據之前的One-Hot編碼,生成樣本。這裡使用列表中的某個單詞作為樣本資料,其下一個單詞作為樣本標籤

神經網路模型

torch提供了torch.nn.Module作為所有神經網路模型的基類,自定義的神經網路應該繼承nn.Module同時實現init()方法和forward()方法。init()方法定義了神經網路的結構,forward()方法定義了神經網路模型是如何計算前饋的。

class RNNModel(nn.Module):

    def __init__(self, embdding_size, hidden_size):
        super(RNNModel, self).__init__()
        self.rnn = nn.RNN(embdding_size, hidden_size,num_layers=1,nonlinearity='relu')
        self.linear = nn.Linear(hidden_size, embdding_size)

    def forward(self, x, hidden):
        #input = x.view(BATCH_SIZE, -1)
        output1, h_n = self.rnn(x, hidden)
        output2 = self.linear(output1)
        log_prob = F.log_softmax(output2)
        return log_prob, h_n

rnnmodel = RNNModel(EMBDDING_DIM, HIDDEN_UNITS)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(rnnmodel.parameters(), lr=1e-3)

定義神經網路損失函式,以及optimizeroptimizer使用了torch.optim,這個物件主要是對計算得到的梯度,進行引數更新。同時,在建立optimizer的時候,我們需要設定其引數和學習率(learning rate)lr. SGD表示使用的是隨機梯度下降演算法

訓練過程

for epoch in range(3):
    print('epoch: {}'.format(epoch + 1))
    print('*' * 10)
    running_loss = 0
    input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS))
    for _,batch in enumerate(train_loader):
        x = torch.autograd.Variable(batch[0])
        y = torch.autograd.Variable(batch[1])
        # forward
        out, input_hidden = rnnmodel(x, input_hidden)
        trgt = torch.max(y, 1)[1]
        loss = criterion(out, trgt)
        running_loss += loss.data[0]
        # backward
        optimizer.zero_grad()
        loss.backward(retain_graph=True)
        optimizer.step()
    print('Loss: {:.6f}'.format(running_loss / len(word_to_ix)))

epoch表示我們將整個資料集訓練的次數,程式碼中是3次。
torch.autograd.Variable 是pytorch圖計算的基本單位,所有用於計算的張量,都需要放到Variable中。在這裡主要解釋一下損失函式部分

criterion = nn.CrossEntropyLoss()
…….
trgt = torch.max(y, 1)[1]
loss = criterion(out, trgt)

在我們建立Dataloader的時候,我們聲明瞭batch_size–批大小。就表示,我們在訓練過程中使用了minibatch,即我們不是將單個的樣本放到神經網路中,然後立刻計算損失值,而是將一批資料,放入神經網路,然後計算損失值
例如: 如果我們的資料(One-Hot向量為例)是100維的,batch_size3,那麼實際輸入到rnn中的x其實是 3 x 100 維
交叉熵損失函式 CrossEntropyLoss,常用於將資料分為C類問題。其兩個引數形式分別為:

input(N,C),Nbatch_size,C為類別個數
target(N) 0 <= targets[i] <= C-1 ,即target為一個N x 1的向量,代表每個每個樣本分類的結果為第幾類

所以在這裡,需要使用torch.max(y,1)函式,其挑選出y中每一行的最大值,同時返回其最大值所在列的索引
例如:

>> a = torch.randn(4, 4)
>> a

        0.0692  0.3142  1.2513 -0.5428
        0.9288  0.8552 -0.2073  0.6409
        1.0695 -0.0101 -2.4507 -1.2230
        0.7426 -0.7666  0.4862 -0.6628
        torch.FloatTensor of size 4x4]

        >>> torch.max(a, 1)
        (
         1.2513
         0.9288
         1.0695
         0.7426
        [torch.FloatTensor of size 4]
        ,
         2
         0
         0
         0
        [torch.LongTensor of size 4]
        )

torch.max(input, dim)中第二個引數diminput的維度有關。如果input是一個二維的,那麼dim=1,如果input是三維,那麼dim = 2

In[2]: import torch
In[3]: a = torch.randn(2,2,2)
In[4]: a
Out[4]: 

(0 ,.,.) = 
  0.4905 -0.2557
 -0.4251  0.1878

(1 ,.,.) = 
 -0.4327  0.0734
 -1.2723 -0.1210
[torch.FloatTensor of size 2x2x2]
In[5]: torch.max(a,2)
Out[5]: 
(
  0.4905  0.1878
  0.0734 -0.1210
 [torch.FloatTensor of size 2x2], 
  0  1
  1  1
 [torch.LongTensor of size 2x2])

接下來就是backward的過程,注意每次要將梯度清零,避免累加。使用optimizer.step()更新引數