1. 程式人生 > >王小草【機器學習】筆記--主題模型LDA實踐與應用

王小草【機器學習】筆記--主題模型LDA實踐與應用

標籤(空格分隔): 王小草機器學習筆記

筆記整理時間:2016年12月30日
筆記整理者:王小草

1. LDA的實現工具

在主題模型LDA的理論篇,長篇大幅的公式與推導也許實在煩心,也不願意自己去寫程式碼實現一遍的話,不妨用一用一些已經開源和封裝好的現成庫吧~

Scikit-learn:
sklearn.decomposition.LatentDirichletAllocation

2. LDA的實現案例

2.1 使用gensim工具實現LDA–小樣本測試

環境準備:
安裝gensim:

pip instll gensim

下載安裝成功之後會出現如下介面:
image_1b57co64i9a2k6j1fan10gcn9v9.png-92.2kB

資料:
首先我們來看一眼資料:
語料庫中有9篇文件,每篇文件為1行。資料儲存在檔名為16.LDA_test.txt的文字檔案中。

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

程式:
(1)首先,將這個檔案讀進來:

f = open('/home/cc/下載/深度學習筆記/主題模型/16(1)/16.LDA_test.txt')

(2)然後對每行的文件進行分詞,並去掉停用詞:

stop_list = set('for a of the and to in'.split())
texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]
print 'Text = '
pprint(texts)

列印結果:

Text = 
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

(3)構建字典:

dictionary = corpora.Dictionary(texts)
print dictionary

V = len(dictionary) # 字典的長度

列印字典:總共有35個詞

Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)

(4)計算每個文件中的TF-IDF值:

# 根據字典,將每行文件都轉換為索引的形式
corpus = [dictionary.doc2bow(text) for text in texts]
# 逐行列印
for line in corpus:
       print line

轉換後還是每行一片文章,只是原來的文字變成了(索引,1)的形式,這個索引根據的是字典中的(索引,詞)。列印結果如下:

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)]
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)]
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(25, 1), (26, 1), (27, 1), (28, 1)]
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(8, 1), (26, 1), (29, 1)]

現在對每篇文件中的每個詞都計算tf-idf值

corpus_tfidf = models.TfidfModel(corpus)[corpus]

#逐行列印
print 'TF-IDF:'
for c in corpus_tfidf:
    print c

仍然是每一行一篇文件,只是上面一步中的1的位置,變成了每個詞索引所對應的tf-idf值了。

TF-IDF:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

(5)應用LDA模型
前面4步可以說是特徵資料的準備。因為這裡我們使用每篇文章的tf-idf值來作為特徵輸入進LDA模型。

訓練模型:

print '\nLDA Model:'
# 設定主題的數目
num_topics = 2
# 訓練模型
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                      alpha='auto', eta='auto', minimum_probability=0.001)

列印一下每篇文件被分在各個主題的概率:

doc_topic = [a for a in lda[corpus_tfidf]]
    print 'Document-Topic:\n'
    pprint(doc_topic)
LDA Model:
Document-Topic:

[[(0, 0.25865201763870671), (1, 0.7413479823612934)],
 [(0, 0.6704214035190138), (1, 0.32957859648098625)],
 [(0, 0.34722886288787302), (1, 0.65277113711212698)],
 [(0, 0.64268836524831052), (1, 0.35731163475168948)],
 [(0, 0.67316053818546506), (1, 0.32683946181453505)],
 [(0, 0.37897103968594514), (1, 0.62102896031405486)],
 [(0, 0.6244681672561716), (1, 0.37553183274382845)],
 [(0, 0.74840501728867792), (1, 0.25159498271132213)],
 [(0, 0.65364678163446832), (1, 0.34635321836553179)]]

列印每個主題中,每個詞出現的概率:
因為前面訓練模型時傳入了引數minimum_probability=0.001,所以小於這個概率的詞將不被輸出了。

for topic_id in range(num_topics):
    print 'Topic', topic_id
    pprint(lda.show_topic(topic_id))
Topic 0
[(u'system', 0.041635423550867606),
 (u'survey', 0.040429107770606001),
 (u'graph', 0.038913672197129358),
 (u'minors', 0.038613604352799001),
 (u'trees', 0.035093470419085344),
 (u'time', 0.034314182442026844),
 (u'user', 0.032712431543062859),
 (u'response', 0.032562733895067024),
 (u'eps', 0.032317332054789358),
 (u'intersection', 0.031074066863528784)]
Topic 1
[(u'interface', 0.038423961073724748),
 (u'system', 0.036616390857180062),
 (u'management', 0.03585869312482335),
 (u'graph', 0.034776623890248701),
 (u'user', 0.03448476247382859),
 (u'survey', 0.033892977987880241),
 (u'eps', 0.033683486487186061),
 (u'computer', 0.032741732328417393),
 (u'minors', 0.031949259380969104),
 (u'human', 0.03156868862825063)]

計算文件與文件之間的相似性:
相似性是通過tf-idf計算的。

similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])
print 'Similarity:'
pprint(list(similarity))
Similarity:
[array([ 0.99999994,  0.71217406,  0.98829806,  0.74671113,  0.70895636,
        0.97756702,  0.76893044,  0.61318189,  0.73319417], dtype=float32),
 array([ 0.71217406,  1.        ,  0.81092042,  0.99872446,  0.99998957,
        0.8440569 ,  0.99642557,  0.99123365,  0.99953747], dtype=float32),
 array([ 0.98829806,  0.81092042,  1.        ,  0.83943164,  0.808236  ,
        0.99825525,  0.85745317,  0.72650033,  0.82834125], dtype=float32),
 array([ 0.74671113,  0.99872446,  0.83943164,  0.99999994,  0.99848306,
        0.87005669,  0.99941987,  0.98329824,  0.99979806], dtype=float32),
 array([ 0.70895636,  0.99998957,  0.808236  ,  0.99848306,  1.        ,
        0.84159577,  0.99602884,  0.99182749,  0.99938792], dtype=float32),
 array([ 0.97756702,  0.8440569 ,  0.99825525,  0.87005669,  0.84159577,
        0.99999994,  0.88634008,  0.76580745,  0.85997516], dtype=float32),
 array([ 0.76893044,  0.99642557,  0.85745317,  0.99941987,  0.99602884,
        0.88634008,  1.        ,  0.9765296 ,  0.99853373], dtype=float32),
 array([ 0.61318189,  0.99123365,  0.72650033,  0.98329824,  0.99182749,
        0.76580745,  0.9765296 ,  0.99999994,  0.9867571 ], dtype=float32),
 array([ 0.73319417,  0.99953747,  0.82834125,  0.99979806,  0.99938792,
        0.85997516,  0.99853373,  0.9867571 ,  1.        ], dtype=float32)]

2.2 使用gensim工具實現LDA–網易新聞測試

上面的案例用了自己造的幾個小樣本,現在我們使用一個相對較大的語料庫,來自網易新聞的爬取。
資料的格式與上文一致,每一行為一片文章。

下面直接給出完整程式碼,註釋與解釋在程式碼中已標註:

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from gensim import corpora, models, similarities
from pprint import pprint
import time

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


def load_stopword():
    '''
    載入停用詞表
    :return: 返回停用詞的列表
    '''
    f_stop = open('/home/cc/下載/深度學習筆記/主題模型/16(1)/16.stopword.txt')
    sw = [line.strip() for line in f_stop]
    f_stop.close()
    return sw


if __name__ == '__main__':

    print '1.初始化停止詞列表 ------'
    # 開始的時間
    t_start = time.time()
    # 載入停用詞表
    stop_words = load_stopword()

    print '2.開始讀入語料資料 ------ '
    # 讀入語料庫
    f = open('/home/cc/下載/深度學習筆記/主題模型/16(1)/16.news.dat')
    # 語料庫分詞並去停用詞
    texts = [[word for word in line.strip().lower().split() if word not in stop_words] for line in f]

    print '讀入語料資料完成,用時%.3f秒' % (time.time() - t_start)
    f.close()
    M = len(texts)
    print '文字數目:%d個' % M

    print '3.正在建立詞典 ------'
    # 建立字典
    dictionary = corpora.Dictionary(texts)
    V = len(dictionary)

    print '4.正在計算文字向量 ------'
    # 轉換文字資料為索引,並計數
    corpus = [dictionary.doc2bow(text) for text in texts]

    print '5.正在計算文件TF-IDF ------'
    t_start = time.time()
    # 計算tf-idf值
    corpus_tfidf = models.TfidfModel(corpus)[corpus]
    print '建立文件TF-IDF完成,用時%.3f秒' % (time.time() - t_start)

    print '6.LDA模型擬合推斷 ------'
    # 訓練模型
    num_topics = 30
    t_start = time.time()
    lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                            alpha=0.01, eta=0.01, minimum_probability=0.001,
                            update_every = 1, chunksize = 100, passes = 1)
    print 'LDA模型完成,訓練時間為\t%.3f秒' % (time.time() - t_start)

    # 隨機列印某10個文件的主題
    num_show_topic = 10  # 每個文件顯示前幾個主題
    print '7.結果:10個文件的主題分佈:--'
    doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文件的主題分佈
    idx = np.arange(M)
    np.random.shuffle(idx)
    idx = idx[:10]
    for i in idx:
        topic = np.array(doc_topics[i])
        topic_distribute = np.array(topic[:, 1])
        # print topic_distribute
        topic_idx = topic_distribute.argsort()[:-num_show_topic-1:-1]
        print ('第%d個文件的前%d個主題:' % (i, num_show_topic)), topic_idx
        print topic_distribute[topic_idx]

    num_show_term = 7   # 每個主題顯示幾個詞
    print '8.結果:每個主題的詞分佈:--'
    for topic_id in range(num_topics):
        print '主題#%d:\t' % topic_id
        term_distribute_all = lda.get_topic_terms(topicid=topic_id)
        term_distribute = term_distribute_all[:num_show_term]
        term_distribute = np.array(term_distribute)
        term_id = term_distribute[:, 0].astype(np.int)
        print '詞:\t',
        for t in term_id:
            print dictionary.id2token[t],
        print '\n概率:\t', term_distribute[:, 1]
輸出的結果如下:
1.初始化停止詞列表 ------
2.開始讀入語料資料 ------ 
讀入語料資料完成,用時13.080秒
文字數目:2043個
3.正在建立詞典 ------
4.正在計算文字向量 ------
5.正在計算文件TF-IDF ------
建立文件TF-IDF完成,用時0.167秒
6.LDA模型擬合推斷 ------
LDA模型完成,訓練時間為   15.519秒
7.結果:10個文件的主題分佈:--
第254個文件的前10個主題: [ 1 25 19 24 10 16  5 18  6 13]
[ 0.44393518  0.21888814  0.07659828  0.04933759  0.03008161  0.02684303
  0.02502133  0.02427699  0.01859238  0.0161121 ]
第965個文件的前10個主題: [16 25 19 15 22 24 21  0  3 11]
[ 0.25308559  0.19305099  0.13224413  0.06946441  0.05560054  0.03552766
  0.03370245  0.03150339  0.03144675  0.03016056]
第1333個文件的前10個主題: [22  3 16 13  6 10 12 24 20 18]
[ 0.25266869  0.2490401   0.07103551  0.04863435  0.04555946  0.04091868
  0.03917051  0.03540118  0.02700782  0.02312102]
第1506個文件的前10個主題: [22 25 19 27 15  5 11 10 17 24]
[ 0.23239891  0.18421203  0.11338873  0.08901366  0.06068372  0.03291609
  0.02791245  0.02776886  0.02628468  0.02576567]
第1688個文件的前10個主題: [20 21  4 10 13 15  8  2 18 12]
[ 0.31712662  0.17080875  0.05704131  0.05687532  0.04982736  0.04717193
  0.03782856  0.03099684  0.02884255  0.02425099]
第235個文件的前10個主題: [ 4 25 27  6 22 17 19 24 23 10]
[ 0.44620112  0.12010051  0.09887277  0.06796208  0.05403353  0.04259883
  0.03434136  0.02677677  0.01700118  0.01343408]
第1024個文件的前10個主題: [24 18  6 17 26  3 22 20 16  5]
[ 0.48215461  0.0692606   0.04261769  0.04186652  0.03632035  0.03460434
  0.03403761  0.02318917  0.02228433  0.02180922]
第110個文件的前10個主題: [19 17 16  1 12 15 18  3 14  4]
[ 0.49900829  0.15936266  0.09035218  0.03383237  0.03352359  0.03146543
  0.03134898  0.02021716  0.01569431  0.01184131]
第734個文件的前10個主題: [25  4  1 19 20  6 14 17 21  8]
[ 0.45159054  0.06032324  0.05592239  0.05458899  0.05271094  0.04798047
  0.03560804  0.03439629  0.03002685  0.02939322]
第210個文件的前10個主題: [10 25  6 19  3 24 23 18 27 17]
[ 0.47695906  0.11626224  0.10508662  0.06398151  0.04230729  0.03933725
  0.03031681  0.02468957  0.01852041  0.01462922]
8.結果:每個主題的詞分佈:--
主題#0:   
詞:  村民 男生 歐盟 巴西 老人 文化 雲南 
概率: [ 0.0072346   0.00541034  0.00524958  0.00447916  0.00360527  0.00354822
  0.0034284 ]
主題#1:   
詞:  王某 劉某 塞夫 張某 參議院 勞動黨 核試驗 
概率: [ 0.00618227  0.00581704  0.00565214  0.00555373  0.00479161  0.00463324
  0.00461892]
主題#2:   
詞:  網民 城 加快 現任 明星 鏡頭 人生 
概率: [ 0.00370845  0.00354329  0.00332598  0.00316524  0.00291984  0.00277072
  0.00233122]
主題#3:   
詞:  女士 李 廣東省 俄羅斯 普京 睪丸 行動 
概率: [ 0.00804048  0.0076339   0.00539693  0.00406451  0.00405289  0.00392766
  0.00386771]
主題#4:   
詞:  李某 公民 超級 工資 大規模 小杰 合理 
概率: [ 0.00964112  0.00334459  0.00301747  0.00277193  0.00274856  0.00260549
  0.00246133]
主題#5:   
詞:  暴力 嫖娼 大媽 昌平 雷某 療店 執法 
概率: [ 0.00771442  0.00567275  0.00558628  0.00369355  0.0034754   0.00302371
  0.00284976]
主題#6:   
詞:  經濟 政府 投資 孩子 我 改革 副 
概率: [ 0.00641846  0.004049    0.00381057  0.00374396  0.00366088  0.00344332
  0.00338472]
主題#7:   
詞:  林 妥協 車主 收取 國土資源部 划船 科技 
概率: [ 0.00704328  0.00389549  0.00373839  0.00314698  0.00291387  0.00280308
  0.00273133]
主題#8:   
詞:  星巴克 草原 收費 冷飲 點選 青年 張北縣 
概率: [ 0.00821544  0.0060041   0.00506269  0.00350584  0.00309426  0.002562
  0.00253879]
主題#9:   
詞:  戰機 發射 魯榮漁 船上 解放軍報 東 catalina 
概率: [ 0.00461292  0.00418864  0.00383498  0.002784    0.00258228  0.00228425
  0.00228091]
主題#10:  
詞:  特朗普 候選人 休斯敦 共和黨 拘留 受訪者 穆斯林 
概率: [ 0.0175363   0.00661245  0.00599092  0.0058532   0.00525441  0.00505672
  0.00469219]
主題#11:  
詞:  印度 江面 司機 贏得 褲子 文明 反應 
概率: [ 0.00637539  0.00342981  0.00337622  0.00328188  0.00323266  0.00294272
  0.00268347]
主題#12:  
詞:  寺廟 南部 釋持忠 阿勒頗 孔 敘 國際足聯 
概率: [ 0.00664495  0.00412984  0.00384755  0.00307015  0.00284295  0.002689
  0.00233919]
主題#13:  
詞:  軍事 參議員 航母 充值 身份證 眾議員 被捕 
概率: [ 0.00523342  0.00466948  0.0045969   0.00320847  0.0031927   0.00310526
  0.00306481]
主題#14:  
詞:  閱讀 日本 讀書 韓國 書籍 電子 意思 
概率: [ 0.01650023  0.01262877  0.00942729  0.00878824  0.00664274  0.00633476
  0.00579329]
主題#15:  
詞:  越南 陳某 漁船 行凶 公約 海域 捕魚 
概率: [ 0.00725566  0.00571377  0.00457485  0.00433582  0.00429894  0.00414336
  0.00396644]
主題#16:  
詞:  面板病 逃生 青島市 鎮 結果顯示 入境 社團 
概率: [ 0.00631313  0.00460248  0.0043726   0.00362972  0.00359586  0.00331288
  0.00304227]
主題#17:  
詞:  發展 腐敗 遊客 創新 規劃 調圖 民間 
概率: [ 0.0078954   0.00529992  0.00481503  0.00461509  0.00411728  0.00373398
  0.00361064]
主題#18:  
詞:  朝鮮 金正恩 漁民 朝鮮勞動黨 導彈 偷拍 平壤 
概率: [ 0.02084754  0.00747336  0.00493015  0.00463835  0.0043983   0.00437807
  0.00421136]
主題#19:  
詞:  醫院 患者 醫生 我 拆違 治療 已經 
概率: [ 0.00858462  0.00447255  0.00411326  0.00395962  0.00362965  0.00359018
  0.00347641]
主題#20:  
詞:  導遊 遊客 黨校 旅行社 購物 旅客 購買 
概率: [ 0.0048018   0.00457637  0.00403571  0.00375318  0.00347159  0.00297388
  0.00261786]
主題#21:  
詞:  總統 希拉里 結婚 選民 品質 大使 小 
概率: [ 0.00738952  0.00560565  0.00380818  0.003547    0.00350746  0.00349869
  0.00347796]
主題#22:  
詞:  總理 縣級市 事故 安全 縣 救援 市 
概率: [ 0.00411836  0.00409722  0.00352026  0.00329144  0.00320824  0.00303015
  0.00302035]
主題#23:  
詞:  督察 環保 醫療 魏則西 督察組 中央 父母 
概率: [ 0.00677646  0.00570995  0.00463713  0.00401606  0.00395448  0.00371315
  0.00354444]
主題#24:  
詞:  中國 美國 國家 生態 支援 環球網 報道 
概率: [ 0.00821696  0.00702561  0.00583064  0.00455489  0.00425351  0.00352046
  0.00350395]
主題#25:  
詞:  警方 他 她 了 是 人 不 
概率: [ 0.00409868  0.00382622  0.00376524  0.00340385  0.00330583  0.00308222
  0.00295222]
主題#26:  
詞:  南海 仲裁 領土 案 習近平 菲律賓 大熊貓 
概率: [ 0.008451    0.00776209  0.00545014  0.00529817  0.00526152  0.00523184
  0.00459266]
主題#27:  
詞:  臺灣 旅遊 大陸 蔡 路 澳大利亞 英文 
概率: [ 0.00697881  0.00535573  0.00427379  0.00418468  0.00406187  0.00392741
  0.00390961]
主題#28:  
詞:  雷洋 老太太 持刀 潘 搶劫 老 幼兒園 
概率: [ 0.01397124  0.00443968  0.00388552  0.0034777   0.00311585  0.00298354
  0.00231442]
主題#29:  
詞:  地震 度 北緯 級 萌 東經 震源 
概率: [ 0.00819944  0.00660477  0.00532019  0.00473097  0.00449198  0.0043948
  0.00421006]

Process finished with exit code 0

2.3 使用lda工具實現LDA–路透社資料測試

前面使用了gensim工具去實現了LDA模型。
這裡我們換一個工具,使用lda

安裝:
執行pip下載安裝:

pip install lda

完整程式碼如下

# -*- coding:utf8 -*-

import numpy as np
import matplotlib.pyplot as plt
import lda
import lda.datasets


if __name__ == "__main__":
    # # 1. 載入資料
    # document-term matrix 
    X = lda.datasets.load_reuters()  # 載入語料庫,每一行為一篇文章
    print("type(X): {}".format(type(X))) # 資料格式是ndarray
    print("shape: {}\n".format(X.shape)) # 資料的大小
    print(X[:10, :10])  # 列印前10行前10列來看看

    # the vocab  
    vocab = lda.datasets.load_reuters_vocab()  # 載入字典
    print("type(vocab): {}".format(type(vocab)))  # 資料格式是tuple
    print("len(vocab): {}\n".format(len(vocab)))  # 資料大小,即總共有多少個無重複的詞
    print(vocab[:10])

    # titles for each story  載入標題
    titles = lda.datasets.load_reuters_titles()  # 載入標題
    print("type(titles): {}".format(type(titles)))  # 資料類也是tuple
    print("len(titles): {}\n".format(len(titles)))  # 資料集大小(應與語料庫中的行數相等)
    print(titles[:10])

    # # 2.訓練模型
    print 'LDA start ----'
    topic_num = 20
    model = lda.LDA(n_topics=topic_num, n_iter=500, random_state=1)
    model.fit(X)

    # # 3.輸出結果
    # topic-word
    topic_word = model.topic_word_  # 獲取每個主題中詞的概率
    print("type(topic_word): {}".format(type(topic_word)))
    print("shape: {}".format(topic_word.shape))
    print(vocab[:5]) # 字典的前5個詞
    print(topic_word[:, :5])  # 對應與字典的前5個詞的概率

    # Print Topic distribution
    n = 7
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n + 1):-1]
        print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

    # Document - topic 取出每個文件中對應概率最高的那個主題
    doc_topic = model.doc_topic_
    print("type(doc_topic): {}".format(type(doc_topic)))
    print("shape: {}".format(doc_topic.shape))
    for i in range(10):
        topic_most_pr = doc_topic[i].argmax()
        print("doc: {} topic: {} value: {}".format(i, topic_most_pr, doc_topic[i][topic_most_pr]))

    # # 4. 視覺化結果
    # Topic - word  主題中詞的分佈圖
    plt.figure(figsize=(8, 12))
    # f, ax = plt.subplots(5, 1, sharex=True)
    for i, k in enumerate([0, 5, 9, 14, 19]):
        ax = plt.subplot(5, 1, i+1)
        ax.plot(topic_word[k, :], 'r-')
        ax.set_xlim(-50, 4350)   # [0,4258]
        ax.set_ylim(0, 0.08)
        ax.set_ylabel("Prob")
        ax.set_title("topic {}".format(k))
    plt.xlabel("word", fontsize=14)
    plt.tight_layout()
    plt.show()

    # Document - Topic  文件中主題的分佈圖
    plt.figure(figsize=(8, 12))
    # f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)
    for i, k in enumerate([1, 3, 4, 8, 9]):
        ax = plt.subplot(5, 1, i+1)
        ax.stem(doc_topic[k, :], linefmt='g-', markerfmt='ro')
        ax.set_xlim(-1, topic_num+1)
        ax.set_ylim(0, 1)
        ax.set_ylabel("Prob")
        ax.set_title("Document {}".format(k))
    plt.xlabel("Topic", fontsize=14)
    plt.tight_layout()
    plt.show()

執行程式碼輸出的結果:

/home/cc/anaconda2/bin/python /home/cc/PycharmProjects/mltest/LDA/16.3.reuters.py
type(X): <type 'numpy.ndarray'>
shape: (395, 4258)

[[ 1  0  1  0  0  0  1  0  0  1]
 [ 7  0  2  0  0  0  0  1  0  0]
 [ 0  0  0  1 10  0  4  1  1  0]
 [ 6  0  1  0  0  0  1  1  1  0]
 [ 0  0  0  2 14  1  1  0  2  1]
 [ 0  0  2  2 24  0  2  0  2  1]
 [ 0  0  0  2  7  1  1  0  1  0]
 [ 0  0  2  2 20  0  2  0  3  1]
 [ 0  1  0  2 17  2  2  0  0  0]
 [ 2  0  2  0  0  2  0  1  0  3]]
type(vocab): <type 'tuple'>
len(vocab): 4258

('church', 'pope', 'years', 'people', 'mother', 'last', 'told', 'first', 'world', 'year')
type(titles): <type 'tuple'>
len(titles): 395

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21', "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23", '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25', '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25', "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25", '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26', "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25", '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26', '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26')
LDA start ----
INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
INFO:lda:<50> log likelihood: -681322
INFO:lda:<60> log likelihood: -678979
INFO:lda:<70> log likelihood: -676598
INFO:lda:<80> log likelihood: -675383
INFO:lda:<90> log likelihood: -673316
INFO:lda:<100> log likelihood: -672761
INFO:lda:<110> log likelihood: -671320
INFO:lda:<120> log likelihood: -669744
INFO:lda:<130> log likelihood: -669292
INFO:lda:<140> log likelihood: -667940
INFO:lda:<150> log likelihood: -668038
INFO:lda:<160> log likelihood: -667429
INFO:lda:<170> log likelihood: -666475
INFO:lda:<180> log likelihood: -665562
INFO:lda:<190> log likelihood: -664920
INFO:lda:<200> log likelihood: -664979
INFO:lda:<210> log likelihood: -664722
INFO:lda:<220> log likelihood: -664459
INFO:lda:<230> log likelihood: -664360
INFO:lda:<240> log likelihood: -663600
INFO:lda:<250> log likelihood: -664164
INFO:lda:<260> log likelihood: -663826
INFO:lda:<270> log likelihood: -663458
INFO:lda:<280> log likelihood: -663393
INFO:lda:<290> log likelihood: -662904
INFO:lda:<300> log likelihood: -662294
INFO:lda:<310> log likelihood: -662031
INFO:lda:<320> log likelihood: -662430
INFO:lda:<330> log likelihood: -661601
INFO:lda:<340> log likelihood: -662108
INFO:lda:<350> log likelihood: -662152
INFO:lda:<360> log likelihood: -661899
INFO:lda:<370> log likelihood: -661012
INFO:lda:<380> log likelihood: -661278
INFO:lda:<390> log likelihood: -661085
INFO:lda:<400> log likelihood: -660418
INFO:lda:<410> log likelihood: -660510
INFO:lda:<420> log likelihood: -660343
INFO:lda:<430> log likelihood: -659789
INFO:lda:<440> log likelihood: -659336
INFO:lda:<450> log likelihood: -659039
INFO:lda:<460> log likelihood: -659329
INFO:lda:<470> log likelihood: -658707
INFO:lda:<480> log likelihood: -658879
INFO:lda:<490> log likelihood: -658819
INFO:lda:<499> log likelihood: -658407
type(topic_word): <type 'numpy.ndarray'>
shape: (20, 4258)
('church', 'pope', 'years', 'people', 'mother')
[[  2.72436509e-06   2.72436509e-06   2.72708945e-03   2.72436509e-06
    2.72436509e-06]
 [  2.29518860e-02   1.08771556e-06   7.83263973e-03   1.15308726e-02
    1.08771556e-06]
 [  3.97404221e-03   4.96135108e-06   2.98177200e-03   4.96135108e-06
    4.96135108e-06]
 [  3.27374625e-03   2.72585033e-06   2.72585033e-06   2.45599115e-03
    2.72585033e-06]
 [  8.26262882e-03   8.56893407e-02   1.61980569e-06   4.87561512e-04
    1.61980569e-06]
 [  1.30107788e-02   2.95632328e-06   2.95632328e-06   2.95632328e-06
    2.95632328e-06]
 [  2.80145003e-06   2.80145003e-06   2.80145003e-06   2.80145003e-06
    2.80145003e-06]
 [  2.42858077e-02   4.66944966e-06   4.66944966e-06   4.66944966e-06
    2.42858077e-02]
 [  6.84655429e-03   1.90129250e-06   6.84655429e-03   1.90129250e-06
    1.90129250e-06]
 [  3.48361655e-06   3.48361655e-06   3.48361655e-06   3.48361655e-06
    3.48361655e-06]
 [  2.98781661e-03   3.31611166e-06   3.31611166e-06   8.29359526e-03
    3.31611166e-06]
 [  4.27062069e-06   4.27062069e-06   4.27062069e-06   1.19620086e-02
    4.27062069e-06]
 [  1.50994982e-02   1.64107142e-06   1.64107142e-06   1.59200339e-02
    2.95556963e-03]
 [  7.73480150e-07   7.73480150e-07   1.70946848e-02   7.73480150e-07
    7.73480150e-07]
 [  2.82280146e-06   2.82280146e-06   2.82280146e-06   6.77754631e-03
    7.28311005e-02]
 [  5.15309856e-06   5.15309856e-06   4.64294180e-03   5.15309856e-06
    5.15309856e-06]
 [  3.41695768e-06   3.41695768e-06   3.41695768e-06   1.29878561e-02
    3.41695768e-06]
 [  3.90980357e-02   1.70316633e-03   4.42279319e-03   3.39953358e-06
    3.39953358e-06]
 [  2.39373034e-06   2.39373034e-06   2.39373034e-06   2.39612407e-03
    2.39373034e-06]
 [  3.32493234e-06   3.32493234e-06   3.32493234e-06   3.32493234e-06
    3.32493234e-06]]
*Topic 0
- government british minister west group letters party
*Topic 1
- church first during people political country ceremony
*Topic 2
- elvis king wright fans presley concert life
*Topic 3
- yeltsin russian russia president kremlin michael romania
*Topic 4
- pope vatican paul surgery pontiff john hospital
*Topic 5
- family police miami versace cunanan funeral home
*Topic 6
- south simpson born york white north african
*Topic 7
- order church mother successor since election religious
*Topic 8
- charles prince diana royal queen king parker
*Topic 9
- film france french against actor paris bardot
*Topic 10
- germany german war nazi christian letter book
*Topic 11
- east prize peace timor quebec belo indonesia
*Topic 12
- n't told life people church show very
*Topic 13
- years world time year last say three
*Topic 14
- mother teresa heart charity calcutta missionaries sister
*Topic 15
- city salonika exhibition buddhist byzantine vietnam swiss
*Topic 16
- music first people tour including off opera
*Topic 17
- church catholic bernardin cardinal bishop death cancer
*Topic 18
- harriman clinton u.s churchill paris president ambassador
*Topic 19
- century art million museum city churches works
type(doc_topic): <type 'numpy.ndarray'>
shape: (395, 20)
doc: 0 topic: 8 value: 0.483043478261
doc: 1 topic: 1 value: 0.290579710145
doc: 2 topic: 14 value: 0.665690376569
doc: 3 topic: 8 value: 0.507655502392
doc: 4 topic: 14 value: 0.778966789668
doc: 5 topic: 14 value: 0.844097222222
doc: 6 topic: 14 value: 0.803535353535
doc: 7 topic: 14 value: 0.87747440273
doc: 8 topic: 14 value: 0.819615384615
doc: 9 topic: 8 value: 0.534210526316

Process finished with exit code 0

輸出的兩個分佈圖:
主題中詞的分佈
figure_1.png-99.8kB

文件中主題的分佈
figure_1-1.png-62.8kB

素材來源:
小象學院機器學習課程