1. 程式人生 > >字元級別word2vec

字元級別word2vec

論文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》在做詞性標註任務的時候,提到了對字元進行編碼,用卷積神經網路編碼字元級別資訊。

實驗中提到字元級別的embeddings 維度30,範圍在[-sqrt(3/dim),sqrt(3/dim)]。所以先用word2vec實驗了一下字元embedding。

#訓練字元級別詞向量
from gensim.models.word2vec import Word2Vec
from gensim.models import word2vec
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789,.)(; '
f = open('text').read() text = f.replace('\n', ' ').lower() chars = [ch for ch in text if ch in alphabet] filtered =''.join(chars) tokens = filtered.split(' ') words = [t for t in tokens if len(t) >=2] #print(words) char_sequences = [list(w) for w in words] print(char_sequences) model = Word2Vec(char_sequences,size=30
,window=5,min_count=1) model.save('char_embeddings.vec')

處理得到的字元序列為:
這裡寫圖片描述

得到的模型測試了一下:

print(model['a'])
print(model.most_similar('a',topn=5))
---------------------------------------
array([-0.01051879,  0.00305209,  0.00773612,  0.01362684,  0.01594807,
        0.01029609,  0.00346048,  0.00261297, -0.01034051,  0.00964036,
       -0.00509238
, 0.0021358 , -0.00605083, 0.0087046 , 0.00930654, 0.01411205, 0.00340451, -0.0071094 , -0.00138468, 0.00443402, 0.00809182, -0.00498053, -0.00288919, 0.01092559, -0.01460177, -0.00596451, -0.00200858, -0.01376272, 0.00229289, 0.01006972], dtype=float32) [('w', 0.5829492211341858), ('c', 0.34324681758880615), ('k', 0.3245270252227783), ('u', 0.20812581479549408), ('i', 0.15292495489120483)]