python下進行lda主題挖掘(二)——利用gensim訓練LDA模型

阿新 • • 發佈：2018-12-24

到2018年3月7日為止，本系列三篇文章已寫完，可能後續有新的內容的話會繼續更新。

本篇是我的LDA主題挖掘系列的第二篇，介紹如何利用gensim包提供的方法來訓練自己處理好的語料。
gensim提供了多種方法：

速度較慢的：

from gensim.models.ldamodel import LdaModel
# 利用處理好的語料訓練模型
lda = LdaModel(corpus, num_topics=10)
# 推斷新文字的主題分佈
doc_lda = lda[doc_bow]
# 用新語料更新模型
lda.update(other_corpus)

速度較快，使用多核心的：

>>> from gensim import corpora, models 
>>> lda = LdaMulticore(corpus, id2word=id2word, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document
>>> lda.update(corpus2) # update the LDA model with additional documents 

>>> print(lda[doc_bow])

使用多程序對效能的提升：

Wall-clock performance on the English Wikipedia (2G corpus positions, 3.5M documents, 100K features, 0.54G non-zero entries in the final bag-of-words matrix), requesting 100 topics:
(Measured on this i7 server with 4 physical cores, so that optimal workers=3, one less than the number of cores.)

algorithm	training time
LdaMulticore(workers=1)	2h30m
LdaMulticore(workers=2)	1h24m
LdaMulticore(workers=3)	1h6m
oldLdaModel	3h44m
simply iterating over input corpus = I/O overhead	20m

workers的值需要比電腦的核心數小1
本文程式碼使用多核心的方法。
有問題歡迎留言交流。
本文在將語料轉化為corpus後，進行了如下操作：

tfidf = models.TfidfModel(corpus)
corpusTfidf = tfidf[corpus]

這一步是用來調整語料中不同詞的詞頻，將那些在所有文件中都出現的高頻詞的詞頻降低，具體原理可參見阮一峰老師的系列部落格：TF-IDF與餘弦相似性的應用（一）：自動提取關鍵詞，我經過這一步處理後，貌似效果提升不明顯，而且這一步時間消耗較大，不建議採用。可直接將corpus作為訓練資料傳入lda模型中。

#-*-coding:utf-8-*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import os
import codecs
from gensim.corpora import Dictionary
from gensim import corpora, models
from datetime import datetime
import platform
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO)

platform_info = platform.platform().lower()
if 'windows' in platform_info:
    code = 'gbk'
elif 'linux' in platform_info:
    code = 'utf-8'
path = sys.path[0]

class GLDA(object):
    """docstring for GdeltGLDA"""

    def __init__(self, stopfile=None):
        super(GLDA, self).__init__()
        if stopfile:
            with codecs.open(stopfile, 'r', code) as f:
                self.stopword_list = f.read().split(' ')
            print ('the num of stopwords is : %s'%len(self.stopword_list))
        else:
            self.stopword_list = None

    def lda_train(self, num_topics, datafolder, middatafolder, dictionary_path=None, corpus_path=None, iterations=5000, passes=1, workers=3):       
        time1 = datetime.now()
        num_docs = 0
        doclist = []
        if not corpus_path or not dictionary_path: # 若無字典或無corpus，則讀取預處理後的docword。一般第一次執行都需要讀取，在後期調參時，可直接傳入字典與corpus路徑
            for filename in os.listdir(datafolder): # 讀取datafolder下的語料
                with codecs.open(datafolder+filename, 'r', code) as source_file:
                    for line in source_file:
                        num_docs += 1
                        if num_docs%100000==0:
                            print ('%s, %s'%(filename, num_docs))
                        #doc = [word for word in doc if word not in self.stopword_list]
                        doclist.append(line.split(' '))
                print ('%s, %s'%(filename, num_docs))
        if dictionary_path:
            dictionary = corpora.Dictionary.load(dictionary_path) # 載入字典
        else:            
            #構建詞彙統計向量並儲存
            dictionary = corpora.Dictionary(doclist)
            dictionary.save(middatafolder + 'dictionary.dictionary')
        if corpus_path:
            corpus = corpora.MmCorpus(corpus_path) # 載入corpus
        else:
            corpus = [dictionary.doc2bow(doc) for doc in doclist]
            corpora.MmCorpus.serialize(middatafolder + 'corpus.mm', corpus) # 儲存corpus
        tfidf = models.TfidfModel(corpus)
        corpusTfidf = tfidf[corpus]
        time2 = datetime.now()
        lda_multi = models.ldamulticore.LdaMulticore(corpus=corpusTfidf, id2word=dictionary, num_topics=num_topics, \
            iterations=iterations, workers=workers, batch=True, passes=passes) # 開始訓練
        lda_multi.print_topics(num_topics, 30) # 輸出主題詞矩陣
        print ('lda training time cost is : %s, all time cost is : %s '%(datetime.now()-time2, datetime.now()-time1))
        #模型的儲存/ 載入
        lda_multi.save(middatafolder + 'lda_tfidf_%s_%s.model'%(2014, num_topics, iterations)) # 儲存模型
        # lda = models.ldamodel.LdaModel.load('zhwiki_lda.model') # 載入模型
        # save the doc-topic-id
        topic_id_file = codecs.open(middatafolder + 'topic.json', 'w', 'utf-8')
        for i in range(num_docs):
            topic_id = lda_multi[corpusTfidf[i]][0][0] # 取概率最大的主題作為文字所屬主題
            topic_id_file.write(str(topic_id)+ ' ')

if __name__ == '__main__':
    datafolder = path + os.sep + 'docword' + os.sep # 預處理後的語料所在資料夾，函式會讀取此資料夾下的所有語料檔案
    middatafolder = path + os.sep + 'middata' + os.sep
    dictionary_path = middatafolder + 'dictionary.dictionary' # 已處理好的字典，若無，則設定為False
    corpus_path = middatafolder + 'corpus.mm' # 對語料處理過後的corpus，若無，則設定為False
    # stopfile = path + os.sep + 'rest_stopwords.txt' # 新新增的停用詞檔案
    num_topics = 50
    passes = 2 # 這個引數大概是將全部語料進行訓練的次數，數值越大，引數更新越多，耗時更長
    iterations = 6000
    workers = 3 # 相當於程序數
    lda = GLDA()
    lda.lda_train(num_topics, datafolder, middatafolder, dictionary_path=dictionary_path, corpus_path=corpus_path, iterations=iterations, passes=passes, workers=workers)

以上，歡迎交流與指正。

python下進行lda主題挖掘(二)——利用gensim訓練LDA模型

到2018年3月7日為止，本系列三篇文章已寫完，可能後續有新的內容的話會繼續更新。本篇是我的LDA主題挖掘系列的第二篇，介紹如何利用gensim包提供的方法來訓練自己處理好的語料。 gensim提供了多種方法：速度較慢的：

文字主題抽取：用gensim訓練LDA模型

得知李航老師的《統計學習方法》出了第二版，我第一時間就買了。看了這本書的目錄，非常高興，好傢伙，居然把主題模型都寫了，還有pagerank。一路看到了馬爾科夫蒙特卡羅方法和LDA主題模型這裡，被打擊到了，滿滿都是數學公式。LDA是目前為止我見過最複雜的模型了。找了培訓班的視訊看，對LDA模型有了大致的認識

python下進行lda主題挖掘(一)——預處理(英文)

到2018年3月7日為止，本系列三篇文章已寫完，可能後續有新的內容的話會繼續更新。歡迎閱讀並交流。寫在前面本人打算將LDA這部分的內容寫成一個系列，不涉及演算法思想，只分享程式碼與使用經驗，包括但不限於以下內容：英文文件的預處理、LD

python下進行lda主題挖掘(三)——計算困惑度perplexity

到2018年3月7日為止，本系列三篇文章已寫完，可能後續有新的內容的話會繼續更新。本篇是我的LDA主題挖掘系列的第三篇，專門來介紹如何對訓練好的LDA模型進行評價。訓練好好LDA主題模型後，如何評價模型的好壞？能否直接將訓練好的

利用NLTK在Python下進行自然語言處理

自然語言處理是電腦科學領域與人工智慧領域中的一個重要方向。自然語言工具箱（NLTK，Natural Language Toolkit）是一個基於Python語言的類庫，它也是當前最為流行的自然語言程式設計與開發工具。在進行自然語言處理研究和應用時，恰當利用NLTK中提供的函式

【sklearn】利用sklearn訓練LDA主題模型及調參詳解

人生苦短，我愛python，尤愛sklearn。sklearn不僅提供了機器學習基本的預處理、特徵提取選擇、分類聚類等模型介面，還提供了很多常用語言模型的介面，sklearn.decomposition.LatentDirichletAllocation就

使用Centos下的iptables實現實驗室按教室、按時間進行上網控制（二）

上網控制 Linux防火墻 IPTABLES 高校運維 1.input鏈策略。input文件; #loopbackiptables -A INPUT -i lo -j ACCEPT #DOS防護iptables -A INPUT -i eth0 -p tcp --syn -m connlim

OpenGL實驗二利用滑鼠、鍵盤，選單等方式對圖元進行互動操作

實驗目的：利用滑鼠、鍵盤，選單等方式對圖元進行互動操作實驗內容： 1、用滑鼠拖動畫直線，線段終點始終跟隨滑鼠移動； 2、使用選單介面修改直線的顏色； 3、利用鍵盤控制直線在螢幕上移動；可以改進的設想： 1.做一

tensorflow利用預訓練模型進行目標檢測（二）：將檢測結果存入mysql資料庫

mysql版本：5.7 ；資料庫：rdshare；表captain_america3_sd用來記錄某幀是否被檢測。表captain_america3_d用來記錄檢測到的資料。 python模組，包部分內容參考http://www.runoob.com/python/python-modules.html&

Arcgis下DEM資料進行水文分析（二）

第一步：需要的工具第二步驟：通過BIGEMAP下載高程資料 1. 啟動BIGEMAP地圖下載器軟體，檢視左上角是否顯示【已授權：所有地圖】，如果沒有該顯示，請聯絡我們的客服人員。如下圖所示： 2. 選擇左上角屬性選項，選擇【

python對同一個資料夾下進行遍歷操作，跳過處理過的

import os path="路徑" #此處路徑為包含你要處理檔案的路徑 for filename in os.listdir(path): (fname,fename)=os.path.splittext(filename) if(fename=='.j

Python--學習筆記2 常用庫 <利用Python進行資料分析>

numpy 科學計算包：多維陣列物件；數學運算函式；隨機數；傅立葉變換可以作為演算法之間傳遞資料的容器。 pandas 快速處理結構化資料和函式。 dataframe，面向列的二維表結構，含有行標和列標。 matplotliba &nb

lda主題模型python實現篇

個人部落格地址：http://xurui.club/2018/06/01/lda/ 最近在做一個動因分析的專案，自然想到了主題模型LDA。這次先把模型流程說下，原理後面再講。 lda實現有很多開源庫，這裡用的是gensim. 1 文字預處理大概說下文字

python下利用Selenium獲取動態頁面資料

利用python爬取網站資料非常便捷，效率非常高，但是常用的一般都是使用BeautifSoup、requests搭配組合抓取靜態頁面，即網頁上顯示的資料都可以在html原始碼中找到，而不是網站通過js或者ajax非同步載入的，這種型別的網站資料爬取起來較簡單。但

python下利用opencv提取surf特徵並儲存

一、演算法背景介紹 Lowe於2000年提出了SIFT演算法，並於2004年加以完善和改進，SIFT特徵對影象旋轉、平移、縮放、亮度變化能夠保持良好的不變性，且其獨特性好，資訊量較為豐富，得到了廣泛的應用，但其提取計算量較大，效率較低，因此Bay等人

python下對hsv顏色空間進行量化

更新：優化了程式碼，理由numpy的ufunc函式功能替換了之前的雙重for迴圈，測試圖片大小為692*1024*3，優化前執行時間為6.9s，優化後為0.8s。由於工作需要，需要計算顏色直方圖來提取顏色特徵，但若不將顏色空間進行量化，則直方圖向量維

利用python Pandas進行資料預處理

目錄： 1.安裝pandas 2.pandas的引入 3.資料清洗 ①處理缺

Python 文字挖掘：使用gensim進行文字相似度計算

index = similarities.MatrixSimilarity(corpus_tfidf)#把所有評論做成索引 sims = index[vec_tfidf]#利用索引計算每一條評論和商品描述之間的相似度 similarity = list(sims)#把相似度儲存成陣列，以便寫入txt 文件

詳解python實現FP-TREE進行關聯規則挖掘(帶有FP樹顯示功能)附原始碼下載(1)

程式使用PYTHON3.2實現，要生成每一步樹的圖片，請安裝一個繪相簿PIL(Python Image Library) 開啟原始碼後可以在sample.py找到樣本如下: items=('chi

VELT-0.1.5開發：在VS2013下進行python開發

快樂蝦歡迎轉載，但請保留作者資訊本文僅適用於vs2013 + velt-0.1.5VELT的全稱是Visual EmbedLinuxTools，它是一個visual studio外掛，用以輔助完成Lin

python下進行lda主題挖掘(二)——利用gensim訓練LDA模型

速度較慢的：

速度較快，使用多核心的：

相關推薦