初試主題模型LDA-基於python的gensim包

阿新 • • 發佈：2017-07-07

rpo nco reload tps 代碼 list sdn str height

http://blog.csdn.net/a_step_further/article/details/51176959

LDA是文本挖掘中常用的主題模型，用來從大量文檔中提取出最能表達各個主題的一些關鍵詞，具體算法原理可參閱KM上相關文章。筆者因業務需求，需對騰訊微博上若幹賬號的消息進行主題提取，故而嘗試了一下該算法，基於python的gensim包實現一個簡單的分析。

安裝python的中文分詞模塊， jieba
安裝python的文本主題建模的模塊, gensim (官網 https://radimrehurek.com/gensim/)。這個模塊安裝時依賴了一大堆其它包，需要耐心地一個一個安裝。

[python]

#!/usr/bin/python
#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import jieba
from gensim import corpora, models
def get_stop_words_set(file_name):
with open(file_name,‘r‘) as file:
return set([line.strip() for line in file])
def get_words_list(file_name,stop_word_file):
stop_words_set = get_stop_words_set(stop_word_file)
print "共計導入 %d 個停用詞" % len(stop_words_set)
word_list = []
with open(file_name,‘r‘) as file:
for line in file:
tmp_list = list(jieba.cut(line.strip(),cut_all=False))
word_list.append([term for term in tmp_list if str(term) not in stop_words_set]) #註意這裏term是unicode類型，如果不轉成str，判斷會為假
return word_list
if __name__ == ‘__main__‘:
if len(sys.argv) < 3:
print "Usage: %s <raw_msg_file> <stop_word_file>" % sys.argv[0]
sys.exit(1)
raw_msg_file = sys.argv[1]
stop_word_file = sys.argv[2]
word_list = get_words_list(raw_msg_file,stop_word_file) #列表，其中每個元素也是一個列表，即每行文字分詞後形成的詞語列表
word_dict = corpora.Dictionary(word_list) #生成文檔的詞典，每個詞與一個整型索引值對應
corpus_list = [word_dict.doc2bow(text) for text in word_list] #詞頻統計，轉化成空間向量格式
lda = models.ldamodel.LdaModel(corpus=corpus_list,id2word=word_dict,num_topics=10,alpha=‘auto‘)
output_file = ‘./lda_output.txt‘
with open(output_file,‘w‘) as f:
for pattern in lda.show_topics():
print >> f, "%s" % str(pattern)

另外還有一些學習資料：https://yq.aliyun.com/articles/26029 [python] LDA處理文檔主題分布代碼入門筆記

初試主題模型LDA-基於python的gensim包

rpo nco reload tps 代碼 list sdn str height http://blog.csdn.net/a_step_further/article/details/51176959 LDA是文本挖掘中常用的主題模型，用來從大量文檔中提取出最能表達各個