gensim中word2vec python原始碼理解（一）

阿新 • • 發佈：2018-12-28

gensim中word2vec python原始碼理解（一）使用Hierarchical Softmax方法構建單詞表
 gensim中word2vec python原始碼理解（二）Skip-gram模型訓練

本文主要談一談對gensim包中封裝的word2vec python原始碼中，使用Hierarchical Softmax構建單詞表部分程式碼的理解。
由於之前閱讀的論文是對使用Hierarchical Softmax的Skip-gram模型進行拓展，因此在閱讀程式碼的時候重點閱讀了Hierarchical Softmax構建單詞表的方法，以及Skip-gram模型的訓練方法。對於negative sampling方法和CBOW模型的實現方法，則會繼續對程式碼進行研究。

init

初始化一個model（實際上是Word2Vec類的例項化物件）：

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

進入類的初始化方法__init__，對裡面的屬性值進行初始化。
在傳入的訓練句子不為空的情況下，主要呼叫兩個方法：

self.build_vocab(sentences, trim_rule=trim_rule)
self.train(
                sentences, total_examples=self.corpus_count, epochs=self.iter,
                start_alpha=self.alpha, end_alpha=self.min_alpha
            )

build_vocab

該方法是從句子序列中構建單詞表，其中每個句子都是字串組成的列表。依次呼叫了三個方法：scan_vocab，scale_vocab，finalize_vocab
下面依次介紹三個方法的功能：

scan_vocab ：對句子中的單詞進行初始化

程式碼內容閱讀（有省略）：

sentence_no = -1 #儲存掃描完成的句子數量
total_words = 0 #儲存出現的單詞總數（不去重）
min_reduce = 1
vocab = defaultdict(int) #將單詞表初始化為一個字典
checked_string_types = 0 

#掃描每個句子
for sentence_no, sentence in enumerate(sentences): #取出語料中每個句子和其在語料庫中的編號no
    for word in sentence:
        vocab[word] += 1 #記錄每個詞出現的次數
    total_words += len(sentence) #記錄掃描過的句子裡的單詞總數
    if self.max_vocab_size and len(vocab) > self.max_vocab_size: #如果對於最大單詞數有限制且當前超出限制
        #將語料庫中小於min_reduce（初始值為1）的單詞都刪除
        utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule) 
        min_reduce += 1 #不斷增大min_reduce，直到單詞表長度不大於max_vocab_size

self.corpus_count = sentence_no + 1 #儲存語料數（句子數）
self.raw_vocab = vocab #儲存單詞表
return total_words #返回單詞總數

scale_vocab ：應用min_count的詞彙表設定（丟棄不太頻繁的單詞）和sample（控制更頻繁單詞的取樣）。

程式碼內容閱讀（有省略）：
載入新的詞彙表：

if not update: #載入一個新的詞彙表
    retain_total, retain_words = 0, [] #保留總數，保留的單詞
    #獲得單詞及其出現的數量，raw_vocab是scan_vocab中儲存的單詞表dict
    for word, v in iteritems(self.raw_vocab): 
        #判斷當前單詞是否被丟棄，trim_rule為修剪規則，預設為none
        if keep_vocab_item(word, v, min_count, trim_rule=trim_rule): 
            retain_words.append(word) #新增單詞
            retain_total += v #新增詞數
            if not dry_run:
                #為每個單詞構建一個Vocab類，傳入詞頻、下標
                self.wv.vocab[word] = Vocab(count=v, index=len(self.wv.index2word)) 
                self.wv.index2word.append(word)
        else: #不符合條件則丟棄
            drop_unique += 1
            drop_total += v

新增新的單詞更新模型：

else:
    new_total = pre_exist_total = 0
    new_words = pre_exist_words = []
    for word, v in iteritems(self.raw_vocab):#遍歷更新的單詞表
        if keep_vocab_item(word, v, min_count, trim_rule=trim_rule): #判斷當前單詞是否被丟棄
            if word in self.wv.vocab: #如果單詞存在在之前的單詞表中
                pre_exist_words.append(word) #新增至先前存在的單詞list
                pre_exist_total += v#新增詞頻
                if not dry_run:
                    self.wv.vocab[word].count += v#更新原單詞表的詞頻
            else: #如果單詞不存在在之前的單詞表中（新單詞）
                new_words.append(word)
                new_total += v
                if not dry_run:
                    #為單詞構建一個Vocab類
                    self.wv.vocab[word] = Vocab(count=v, index=len(self.wv.index2word))
                    self.wv.index2word.append(word)#給單詞新增下標
        else:#不符合條件則丟棄
            drop_unique += 1
            drop_total += v

計算取樣閾值

# 預先計算每個詞彙專案的取樣閾值
if not sample:
    # no words downsampled 沒有單詞被downsample，閾值等於單詞總數
    threshold_count = retain_total
elif sample < 1.0:
    # traditional meaning: set parameter as proportion of total
    threshold_count = sample * retain_total
else:
    # new shorthand: sample >= 1 means downsample all words with higher count than sample
    threshold_count = int(sample * (3 + sqrt(5)) / 2)

downsample_total, downsample_unique = 0, 0
for w in retain_words:
    v = self.raw_vocab[w]#v是當前單詞出現的次數
    word_probability = (sqrt(v / threshold_count) + 1) * (threshold_count / v)
    if word_probability < 1.0:
        downsample_unique += 1
        downsample_total += word_probability * v
    else: #如果沒有設定sample值的話，word_probability一定>1
        word_probability = 1.0
        downsample_total += v
    if not dry_run:
        self.wv.vocab[w].sample_int = int(round(word_probability * 2**32)) #設定一個取樣值，round返回浮點數x的四捨五入值。

finalize_vocab ：根據最終詞彙表設定建立表格和模型權重。

程式碼內容閱讀（有省略）：

if not self.wv.index2word:
    self.scale_vocab()
if self.sorted_vocab and not update:
    self.sort_vocab() #按照詞頻降序排列，使得詞頻大的詞下標更小
if self.hs:
    # 新增每個單詞的Huffman編碼資訊
    self.create_binary_tree()
if self.negative:
    # 負取樣
    self.make_cum_table()
if self.null_word:
    # create null pseudo-word for padding when using concatenative L1 (run-of-words)
    # this word is only ever input – never predicted – so count, huffman-point, etc doesn't matter
    word, v = '\0', Vocab(count=1, sample_int=0)
    v.index = len(self.wv.vocab)
    self.wv.index2word.append(word)
    self.wv.vocab[word] = v
# set initial input/projection and hidden weights
if not update:#如果不是新增新詞以更新，則重置權重矩陣
    self.reset_weights()
else:
    self.update_weights()

從程式碼中可以看出，Hierarchical Softmax方法和negative sampling方法對應兩種構建詞表的方法，分別是create_binary_tree和make_cum_table。

create_binary_tree

Hierarchical Softmax方法，使用儲存的詞彙單詞及其詞頻建立一個二進位制哈夫曼樹。頻繁的詞編碼更短。

# build the huffman tree
heap = list(itervalues(self.wv.vocab)) #將字典中的value以列表形式返回，其value是Vocab類的例項
heapq.heapify(heap)
for i in xrange(len(self.wv.vocab) - 1): #儲存內節點
    min1, min2 = heapq.heappop(heap), heapq.heappop(heap)#取出最小的兩個
    #放入兩個小值節點的父節點，下標從單詞表長度向後取，count值取兩個孩子節點的count之和，設定左右孩子
    heapq.heappush( 
        heap, Vocab(count=min1.count + min2.count, index=i + len(self.wv.vocab), left=min1, right=min2)
    )#最終只剩一個根節點在堆疊中

# recurse over the tree, assigning a binary code to each vocabulary word 
#在樹上遞迴，為每個詞彙詞分配一個二進位制程式碼，儲存到達該節點的路徑上經過的內節點
if heap:
    max_depth, stack = 0, [(heap[0], [], [])] #定義一個最大深度，一個堆疊，放入根節點
    while stack:
        node, codes, points = stack.pop()
        #node節點對應一個Vocab類的例項（也就是一個節點），code對應該節點的編碼，points對應到達該節點經過的節點
        if node.index < len(self.wv.vocab):
        #如果取出的節點下標小於單詞表的長度，即該詞在單詞表內，取出的是葉節點
            # 葉節點=>從根儲存它的路徑
            node.code, node.point = codes, points
            max_depth = max(len(codes), max_depth)
        else: #否則，取出的是內節點=>繼續遞迴
            # inner node => continue recursion
            #儲存路徑經過的節點
            points = array(list(points) + [node.index - len(self.wv.vocab)], dtype=uint32)
            # 把左右孩子節點放入棧中
            stack.append((node.left, array(list(codes) + [0], dtype=uint8), points))
            stack.append((node.right, array(list(codes) + [1], dtype=uint8), points))

在構建單詞表完成後，每個單詞對應的都是類Vocab的一個例項，構建哈夫曼樹完成之後，二叉樹中每個內節點對應的也是一個Vocab類的例項，其left和right屬性分別儲存了其左右孩子，points儲存根節點到達該節點的路徑（由經過的內節點的序號構成），codes儲存該節點的二進位制編碼。

reset_weights

重置隱藏層的權重

#syn0表示詞向量矩陣
#單詞數為行，向量維數為列， empty 會建立一個沒有使用特定值來初始化的陣列
self.wv.syn0 = empty((len(self.wv.vocab), self.vector_size), dtype=REAL) 
# 對於每個單詞分別為其初始化一個向量，而不是立即在RAM中實現巨大的隨機矩陣
for i in xrange(len(self.wv.vocab)): #對於單詞表中的每一個單詞
    #初始化單詞向量
    self.wv.syn0[i] = self.seeded_vector(self.wv.index2word[i] + str(self.seed)) 
    if self.hs:
        #syn0表示二叉樹的內節點向量矩陣，全部初始化為0向量
        self.syn1 = zeros((len(self.wv.vocab), self.layer1_size), dtype=REAL)
    if self.negative:
        self.syn1neg = zeros((len(self.wv.vocab), self.layer1_size), dtype=REAL)
    self.wv.syn0norm = None

    self.syn0_lockf = ones(len(self.wv.vocab), dtype=REAL)  # zeros suppress learning

至此，構建單詞表完成。

gensim中word2vec python原始碼理解（一）

init

build_vocab

create_binary_tree

reset_weights

gensim中word2vec python原始碼理解（一）

python 原始碼解析（一）

Python 原始碼剖析（一）【python物件】

Darknet 原始碼理解（一）----主體框架的理解

python 中資料型別--列表、元組的理解（一）

關於Python中深拷貝與淺拷貝的理解（一）---概念

Python圖像處理庫PIL中圖像格式轉換（一）

Python中的Flask入門基礎（一）

RxJava 2.x 教程及原始碼揭祕（一）入門理解及基本操作符

python中的import模組引用（一）

spring之AOP原始碼深入理解（一）aop攔截

python pickle模組學習理解（一）

bottle（python的一個小的伺服器框架）的原始碼閱讀（一）

RocketMQ中Broker的啟動原始碼分析（一）

AndroidStudio中集成使用Kotlin（一）

Android中關於JNI 的學習（一）對於JNIEnv的一些認識

Python 學習筆記（一）

Python/ MySQL練習題（一）

python學習筆記（一）

Python基礎學習（一）

gensim中word2vec python原始碼理解（一）

init

build_vocab

create_binary_tree

reset_weights

相關推薦