字元級別word2vec
阿新 • • 發佈:2019-01-05
論文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》在做詞性標註任務的時候,提到了對字元進行編碼,用卷積神經網路編碼字元級別資訊。
實驗中提到字元級別的embeddings 維度30,範圍在[-sqrt(3/dim),sqrt(3/dim)]。所以先用word2vec實驗了一下字元embedding。
#訓練字元級別詞向量
from gensim.models.word2vec import Word2Vec
from gensim.models import word2vec
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789,.)(; '
f = open('text').read()
text = f.replace('\n', ' ').lower()
chars = [ch for ch in text if ch in alphabet]
filtered =''.join(chars)
tokens = filtered.split(' ')
words = [t for t in tokens if len(t) >=2]
#print(words)
char_sequences = [list(w) for w in words]
print(char_sequences)
model = Word2Vec(char_sequences,size=30 ,window=5,min_count=1)
model.save('char_embeddings.vec')
處理得到的字元序列為:
得到的模型測試了一下:
print(model['a'])
print(model.most_similar('a',topn=5))
---------------------------------------
array([-0.01051879, 0.00305209, 0.00773612, 0.01362684, 0.01594807,
0.01029609, 0.00346048, 0.00261297, -0.01034051, 0.00964036,
-0.00509238 , 0.0021358 , -0.00605083, 0.0087046 , 0.00930654,
0.01411205, 0.00340451, -0.0071094 , -0.00138468, 0.00443402,
0.00809182, -0.00498053, -0.00288919, 0.01092559, -0.01460177,
-0.00596451, -0.00200858, -0.01376272, 0.00229289, 0.01006972], dtype=float32)
[('w', 0.5829492211341858), ('c', 0.34324681758880615), ('k', 0.3245270252227783), ('u', 0.20812581479549408), ('i', 0.15292495489120483)]