1. 程式人生 > >基於python的gensim word2vec訓練詞向量

基於python的gensim word2vec訓練詞向量

準備工作

當我們下載了anaconda後,可以在命令視窗通過命令

conda install gensim

安裝gensim

gensim介紹

gensim是一款強大的自然語言處理工具,裡面包括N多常見模型,我們體驗一下:

interfaces – Core gensim interfaces
utils – Various utility functions
matutils – Math utils
corpora.bleicorpus – Corpus in Blei’s LDA-C format
corpora.dictionary – Construct word<->id mappings
corpora.hashdictionary
– Construct word<->id mappings corpora.lowcorpus – Corpus in List-of-Words format corpora.mmcorpus – Corpus in Matrix Market format corpora.svmlightcorpus – Corpus in SVMlight format corpora.wikicorpus – Corpus from a Wikipedia dump corpora.textcorpus – Building corpora with dictionaries corpora.ucicorpus
– Corpus in UCI bag-of-words format corpora.indexedcorpus – Random access to corpus documents models.ldamodel – Latent Dirichlet Allocation models.ldamulticore – parallelized Latent Dirichlet Allocation models.ldamallet – Latent Dirichlet Allocation via Mallet models.lsimodel – Latent Semantic Indexing models.tfidfmodel
– TF-IDF model models.rpmodel – Random Projections models.hdpmodel – Hierarchical Dirichlet Process models.logentropy_model – LogEntropy model models.lsi_dispatcher – Dispatcher for distributed LSI models.lsi_worker – Worker for distributed LSI models.lda_dispatcher – Dispatcher for distributed LDA models.lda_worker – Worker for distributed LDA models.word2vec – Deep learning with word2vec models.doc2vec – Deep learning with paragraph2vec models.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM) models.phrases – Phrase (collocation) detection similarities.docsim – Document similarity queries How It Works simserver – Document similarity server

我們可以看到:
- 基本的語料處理工具
- LSI
- LDA
- HDP
- DTM
- DIM
- TF-IDF
- word2vec、paragraph2vec

以後用上其他模型的時候再介紹,今天我們來體驗:

word2vec

#encoding=utf-8
from gensim.models import word2vec
sentences=word2vec.Text8Corpus(u'分詞後的爽膚水評論.txt')
model=word2vec.Word2Vec(sentences, size=50)

y2=model.similarity(u"好", u"還行")
print(y2)

for i in model.most_similar(u"滋潤"):
    print i[0],i[1]

txt檔案是已經分好詞的5W條評論,訓練模型只需一句話:

model=word2vec.Word2Vec(sentences,min_count=5,size=50)

第一個引數是訓練語料,第二個引數是小於該數的單詞會被剔除,預設值為5,
第三個引數是神經網路的隱藏層單元數,預設為100

model.similarity(u"好", u"還行")#計算兩個詞之間的餘弦距離

model.most_similar(u"滋潤")#計算餘弦距離最接近“滋潤”的10個詞

執行結果:

0.642981583608
保溼 0.995047152042
溫和 0.985100984573
高 0.978088200092
舒服 0.969187200069
補水 0.967649161816
清爽 0.960570812225
水水 0.958645284176
一般 0.928643763065
一款 0.911774456501
真的 0.90943980217

效果不錯吧,雖然只有5W條評論的語料

當然還可以儲存和載入咱們辛辛苦苦訓練好的模型:

model.save('/model/word2vec_model')

new_model=gensim.models.Word2Vec.load('/model/word2vec_model')

也可以獲取每個詞的詞向量

model['computer'] 

訓練詞向量時傳入的兩個引數也對訓練效果有很大影響,需要根據語料來決定引數的選擇,好的詞向量對NLP的分類、聚類、相似度判別等任務有重要意義哦!