python︱gensim訓練word2vec及相關函式與功能理解

阿新 • • 發佈：2019-01-21

一、gensim介紹

gensim是一款強大的自然語言處理工具，裡面包括N多常見模型：

基本的語料處理工具
LSI
LDA
HDP
DTM
DIM
TF-IDF
word2vec、paragraph2vec
.

二、訓練模型

1、訓練

最簡單的訓練方式：

# 最簡單的開始
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence','is']]

# 模型訓練
model = gensim.models.Word2Vec(sentences, min_count=1)
    # min_count,頻數閾值，大於等於1的保留
    # size，神經網路 NN 層單元數，它也對應了訓練演算法的自由程度
    # workers=4，default = 1 worker = no parallelization 只有在機器已安裝 Cython 情況下才會起到作用。如沒有 Cython，則只能單核執行。

第二種訓練方式：

# 第二種訓練方式
new_model = gensim.models.Word2Vec(min_count=1)  # 先啟動一個空模型 an empty model
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.iter)                       
# can be a non-repeatable, 1-pass generator

案例：

#encoding=utf-8
from gensim.models import word2vec
sentences=word2vec.Text8Corpus(u'分詞後的爽膚水評論.txt')
model=word2vec.Word2Vec(sentences, size=50)

y2=model.similarity(u"好", u"還行")
print(y2)

for i in model.most_similar(u"滋潤"):
    print i[0],i[1]

txt檔案是已經分好詞的5W條評論，訓練模型只需一句話：

model=word2vec.Word2Vec(sentences,min_count=5,size=50)

第一個引數是訓練語料，第二個引數是小於該數的單詞會被剔除，預設值為5,
第三個引數是神經網路的隱藏層單元數，預設為100

2、模型使用

# 根據詞向量求相似
model.similarity('first','is')    # 兩個詞的相似性距離
model.most_similar(positive=['first', 'second'], negative=['sentence'], topn=1)  # 類比的防護四
model.doesnt_match("input is lunch he sentence cat".split())                   # 找出不匹配的詞語

如何檢視模型內部詞向量內容：

# 詞向量查詢
model['first']

.
3、模型匯出與匯入

最簡單的匯入與匯出

# 模型儲存與載入
model.save('/tmp/mymodel')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)  # 載入 .txt檔案
# using gzipped/bz2 input works too, no need to unzip:
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)  # 載入 .bin檔案

word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
word2vec.save('word2vec_wx')

word2vec.save即可匯出檔案，這邊沒有匯出為.bin
.

model = gensim.models.Word2Vec.load('xxx/word2vec_wx')
pd.Series(model.most_similar(u'微信',topn = 360000))

gensim.models.Word2Vec.load的辦法匯入

其中的Numpy,可以用numpy.load：

import numpy
word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy')

還有其他的匯入方式：

from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

匯入txt格式+bin格式。

其他匯出方式：

from gensim.keyedvectors import KeyedVectors
# save
model.save(fname) # 只有這樣存才能繼續訓練! 
model.wv.save_word2vec_format(outfile + '.model.bin', binary=True)  # C binary format 磁碟空間比上一方法減半
model.wv.save_word2vec_format(outfile + '.model.txt', binary=False) # C text format 磁碟空間大，與方法一樣

# load
model = gensim.models.Word2Vec.load(fname)  
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)

# 最省記憶體的載入方法
model = gensim.models.Word2Vec.load('model path')
word_vectors = model.wv
del model
word_vectors.init_sims(replace=True)

來源：簡書，其中：如果你不打算進一步訓練模型，呼叫init_sims將使得模型的儲存更加高效

4、增量訓練

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

不能對C生成的模型進行再訓練.
.

# 增量訓練
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

5、bow2vec + TFIDF模型

5.1 Bow2vec

主要內容為：
拆分句子為單詞顆粒，記號化；
生成詞典；
生成稀疏文件矩陣

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# 分詞並根據詞頻剔除
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

生成詞語列表：

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

生成詞典：

# 詞典生成
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference
print(dictionary)
print(dictionary.token2id)  # 檢視詞典中所有詞

稀疏文件矩陣的生成：

# 單句bow 生成
new_doc = "Human computer interaction Human"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  
    # the word "interaction" does not appear in the dictionary and is ignored
    # [(0, 1), (1, 1)] ，詞典（dictionary）中第0個詞，出現的頻數為1（當前句子），
    # 第1個詞，出現的頻數為1
    
# 多句bow 生成
[dictionary.doc2bow(text) for text in texts]  # 當前句子的詞ID + 詞頻

5.2 tfidf

from gensim import corpora, models, similarities
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
          [(3, 1.0), (5, 1.0), (6, 1.0)],
          [(9, 1.0)],
          [(9, 1.0), (10, 1.0)],
          [(9, 1.0), (10, 1.0), (11, 1.0)],
          [(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)

# 詞袋模型，實踐
vec = [(0, 1), (4, 1),(9, 1)]
print(tfidf[vec])
>>>  [(0, 0.695546419520037), (4, 0.5080429008916749), (9, 0.5080429008916749)]

查詢vec中，0,4,9號三個詞的TFIDF值。同時進行轉化，把之前的文件矩陣中的詞頻變成了TFIDF值。

利用tfidf求相似：

# 求相似
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
vec = [(0, 1), (4, 1),(9, 1)]
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
>>>[(0, 0.40157393), (1, 0.16485332), (2, 0.21189235), (3, 0.70710677), (4, 0.0), (5, 0.5080429), (6, 0.35924056), (7, 0.25810757), (8, 0.0)]

對corpus的9個文件建立文件級別的索引，vec是一個新文件的詞語的詞袋內容，sim就是該vec向量對corpus中的九個文件的相似性。

索引的匯出與載入：

index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

5.3 繼續轉換

潛在語義索引（LSI）將Tf-Idf語料轉化為一個潛在2-D空間

lsi = models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=2) # 初始化一個LSI轉換
corpus_lsi = lsi[tfidf[corpus]] # 在原始語料庫上加上雙重包裝: bow->tfidf->fold-in-lsi

設定了num_topics=2,
利用models.LsiModel.print_topics()來檢查一下這個過程到底產生了什麼變化吧：

lsi.print_topics(2)

根據LSI來看，“tree”、“graph”、“minors”都是相關的詞語（而且在第一主題的方向上貢獻最多），而第二主題實際上與所有的詞語都有關係。如我們所料，前五個文件與第二個主題的關聯更強，而其他四個文件與第一個主題關聯最強：

>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
...     print(doc)
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

三、gensim訓練好的word2vec使用

1、相似性

持數種單詞相似度任務:
相似詞+相似係數（model.most_similar）、model.doesnt_match、model.similarity（兩兩相似）

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]

model.most_similar(positive=‘woman’, topn=topn, restrict_vocab=restrict_vocab)  # 直接給入詞
model.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)  # 直接給入向量

model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

model.similarity('woman', 'man')
.73723527

.
2、詞向量

通過以下方式來得到單詞的向量:

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

3、詞向量表

model.wv.vocab.keys()

案例一：800萬微信語料訓練

訓練過程：

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import pymongo
import hashlib

db = pymongo.MongoClient('172.16.0.101').weixin.text_articles_words
md5 = lambda s: hashlib.md5(s).hexdigest()
class sentences:
    def __iter__(self):
        texts_set = set()
        for a in db.find(no_cursor_timeout=True):
            if md5(a['text'].encode('utf-8')) in texts_set:
                continue
            else:
                texts_set.add(md5(a['text'].encode('utf-8')))
                yield a['words']
        print u'最終計算了%s篇文章'%len(texts_set)

word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
word2vec.save('word2vec_wx')

這裡引入hashlib.md5是為了對文章進行去重（本來1000萬篇文章，去重後得到800萬），而這個步驟不是必要的。
.

案例二：字向量與詞向量的訓練

# 訓練詞向量
def train_w2v_model(type='article', min_freq=5, size=100):
    sentences = []

    if type == 'char':
        corpus = pd.concat((train_df['article'], test_df['article']))
    elif type == 'word':
        corpus = pd.concat((train_df['word_seg'], test_df['word_seg']))
    for e in tqdm(corpus):
        sentences.append([i for i in e.strip().split() if i])
    print('訓練集語料:', len(corpus))
    print('總長度: ', len(sentences))
    model = Word2Vec(sentences, size=size, window=5, min_count=min_freq)
    model.itos = {}
    model.stoi = {}
    model.embedding = {}
    
    print('儲存模型...')
    for k in tqdm(model.wv.vocab.keys()):
        model.itos[model.wv.vocab[k].index] = k
        model.stoi[k] = model.wv.vocab[k].index
        model.embedding[model.wv.vocab[k].index] = model.wv[k]

    model.save('../../data/word2vec-models/word2vec.{}.{}d.mfreq{}.model'.format(type, size, min_freq))
    return model
model = train_w2v_model(type='char', size=100)
model = train_w2v_model(type='word', size=100)
# model.wv.save_word2vec_format('../../data/laozhu-word-300d', binary=False)
# train_df[:3]
print('OK')

參考於：

**公眾號“素質雲筆記”定期更新部落格內容：** ![這裡寫圖片描述](https://img-blog.csdn.net/20180226155348545?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvc2luYXRfMjY5MTczODM=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

python︱gensim訓練word2vec及相關函式與功能理解

一、gensim介紹

二、訓練模型

1、訓練

2、模型使用

. 3、模型匯出與匯入

4、增量訓練

5、bow2vec + TFIDF模型

5.1 Bow2vec

5.2 tfidf

5.3 繼續轉換

相關轉換

詞頻-逆文件頻（Term Frequency * Inverse Document Frequency， Tf-Idf）

潛在語義索引（Latent Semantic Indexing，LSI，or sometimes LSA）

隨機對映（Random Projections，RP）

隱含狄利克雷分配（Latent Dirichlet Allocation, LDA）