1. 程式人生 > >主題模型TopicModel:通過gensim實現LDA

主題模型TopicModel:通過gensim實現LDA

使用python gensim輕鬆實現lda模型。

gensim簡介

gemsim是一個免費python庫,能夠從文件中有效地自動抽取語義主題。gensim中的演算法包括:LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通過在一個訓練文件語料庫中,檢查詞彙統計聯合出現模式, 可以用來發掘文件語義結構,這些演算法屬於非監督學習,可以處理原始的,非結構化的文字(”plain text”)。

Gensim是一個相當專業的主題模型Python工具包。在文字處理中,比如商品評論挖掘,有時需要了解每個評論分別和商品的描述之間的相似度,以此衡量評論的客觀性。評論和商品描述的相似度越高,說明評論的用語比較官方,不帶太多感情色彩,比較注重描述商品的屬性和特性,角度更客觀。

Gensim:實現語言,Python,實現模型,LDA,Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度學習的word2vec,paragraph2vec。

gensim 特性

  • 記憶體獨立- 對於訓練語料來說,沒必要在任何時間將整個語料都駐留在RAM中
  • 有效實現了許多流行的向量空間演算法-包括tf-idf,分散式LSA, 分散式LDA 以及 RP;並且很容易新增新演算法
  • 對流行的資料格式進行了IO封裝和轉換
  • 在其語義表達中,可以相似查詢
  • gensim的建立的目的是,由於缺乏簡單的(java很複雜)實現主題建模的可擴充套件軟體框架.

gensim 設計原則

  • 簡單的介面,學習曲線低。對於原型實現很方便
  • 根據輸入的語料的size來說,記憶體各自獨立;基於流的演算法操作,一次訪問一個文件.

gensim 核心概念

gensim的整個package會涉及三個概念:corpus, vector, model.

  • 語庫(corpus)
    文件集合,用於自動推出文件結構,以及它們的主題等,也可稱作訓練語料。
  • 向量(vector)

    在向量空間模型(VSM)中,每個文件被表示成一個特徵陣列。例如,一個單一特徵可以被表示成一個問答對(question-answer pair):

    [1].在文件中單詞”splonge”出現的次數? 0個
    [2].文件中包含了多少句子? 2個
    [3].文件中使用了多少字型? 5種
    這裡的問題可以表示成整型id (比如:1,2,3等), 因此,上面的文件可以表示成:(1, 0.0), (2, 2.0), (3, 5.0). 如果我們事先知道所有的問題,我們可以顯式地寫成這樣:(0.0, 2.0, 5.0). 這個answer序列可以認為是一個多維矩陣(3維). 對於實際目的,只有question對應的answer是一個實數.

    對於每個文件來說,answer是類似的. 因而,對於兩個向量來說(分別表示兩個文件),我們希望可以下類似的結論:“如果兩個向量中的實數是相似的,那麼,原始的文件也可以認為是相似的”。當然,這樣的結論依賴於我們如何去選取我們的question。

  • 稀疏矩陣(Sparse vector)

    通常,大多數answer的值都是0.0. 為了節省空間,我們需要從文件表示中忽略它們,只需要寫:(2, 2.0), (3, 5.0) 即可(注意:這裡忽略了(1, 0.0)). 由於所有的問題集事先都知道,那麼在稀疏矩陣的文件表示中所有缺失的特性可以認為都是0.0.

    gensim的特別之處在於,它沒有限定任何特定的語料格式;語料可以是任何格式,當迭代時,通過稀疏矩陣來完成即可。例如,集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一個包含兩個文件的語料,每個都有兩個非零的 pair。

  • 模型(model)

    對於我們來說,一個模型就是一個變換(transformation),將一種文件表示轉換成另一種。初始和目標表示都是向量--它們只在question和answer之間有區別。這個變換可以通過訓練的語料進行自動學習,無需人工監督,最終的文件表示將更加緊湊和有用;相似的文件具有相似的表示。

[Gensim-用Python做主題模型 ]

gensim的安裝

gensim依賴NumPy和SciPy這兩大Python科學計算工具包,要先安裝。

再安裝gensim: pip install gensim

gensim官網教程

分為下面幾部分

使用gensim快速實現lda

文件的向量表示Corpora and Vector Spaces

將用字串表示的文件轉換為用id表示的文件向量:

documents = ["Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"]
"""
#use StemmedCountVectorizer to get stemmed without stop words corpus
Vectorizer = StemmedCountVectorizer
# Vectorizer = CountVectorizer
vectorizer = Vectorizer(stop_words='english')
vectorizer.fit_transform(documents)
texts = vectorizer.get_feature_names()
# print(texts)
"""
texts = [doc.lower().split() for doc in documents]
# print(texts)
dict = corpora.Dictionary(texts)    #自建詞典
# print dict, dict.token2id
#通過dict將用字串表示的文件轉換為用id表示的文件向量
corpus = [dict.doc2bow(text) for text in texts]
print(corpus)

查詢doc主題的兩種方式

也就是查詢某個文件對應的主題及其概率

topics = [lda_model[c] for c in corpus_tfidf]  #大量查詢時不推薦,太慢,只適合查詢小的集合

實現例項

使用gensim python拓展包

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__title__ = 'topic model - build lda - 20news dataset'
__author__ = 'pi'
__mtime__ = '12/26/2014-026'
# code is far away from bugs with the god animal protecting
    I love animals. They taste delicious.
              ┏┓      ┏┓
            ┏┛┻━━━┛┻┓
            ┃      ☃      ┃
            ┃  ┳┛  ┗┳  ┃
            ┃      ┻      ┃
            ┗━┓      ┏━┛
                ┃      ┗━━━┓
                ┃  神獸保佑    ┣┓
                ┃ 永無BUG!   ┏┛
                ┗┓┓┏━┳┓┏┛
                  ┃┫┫  ┃┫┫
                  ┗┻┛  ┗┻┛
"""
from Colors import *
from collections import defaultdict
import re
import datetime
from sklearn import datasets
import nltk
from gensim import corpora
from gensim import models
import numpy as np
from scipy import spatial
from CorePyPro.Fun.TimeStump import totalTime


def load_texts(dataset_type='train', groups=None):
    """
    load datasets to bytes list
    :return:train_dataset_bunch.data bytes list
    """
    if groups == 'small':
        groups = ['comp.graphics', 'comp.os.ms-windows.misc']  # 僅用於小資料測試時用, #1368
    elif groups == 'medium':
        groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware',
                  'comp.windows.x', 'sci.space']  # 中量資料時用    #3414
    train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets',
                                               categories=groups)  # 13180
    return train_dataset_bunch.data


def preprocess_texts(texts, test_doc_id=1):
    """
    texts preprocessing
    :param texts: bytes list
    :return:bytes list
    """
    texts = [t.decode(errors='ignore') for t in texts]  # bytes2str
    # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id])
    # split_texts = [t.lower().split() for t in texts]
    # print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id])


    # lower str & split str 2 word list with sep=... & delete None
    SEPS = '[\s()-/,:.?!]\s*'
    texts = [re.split(SEPS, t.lower()) for t in texts]
    for t in texts:
        while '' in t:
            t.remove('')
    # print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id])


    # nltk.download()   #then choose the corpus.stopwords
    stopwords = set(nltk.corpus.stopwords.words('english'))  # #127
    stopwords.update(['from', 'subject', 'writes'])  # #129
    word_usage = defaultdict(int)
    for t in texts:
        for w in t:
            word_usage[w] += 1
    COMMON_LINE = len(texts) / 10
    too_common_words = [w for w in t if word_usage[w] > COMMON_LINE]  # set(too_common_words)
    # print('too_common_words: #', len(too_common_words), '\n', too_common_words)   #68
    stopwords.update(too_common_words)
    # print('stopwords: #', len(stopwords), '\n', stopwords)  #   #147

    english_stemmer = nltk.SnowballStemmer('english')
    MIN_WORD_LEN = 3  # 4
    texts = [[english_stemmer.stem(w) for w in t if
              not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in
             texts]  # set('+-.?!()>@0123456789*/')
    # print(REDH, 'texts[%d] delete ^alphanum & stopwords & len<%d & stemmed: #' % (test_doc_id, MIN_WORD_LEN),
    # len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id])
    return texts


def build_corpus(texts):
    """
    build corpora
    :param texts: bytes list
    :return: corpus DirectTextCorpus(corpora.TextCorpus)
    """

    class DirectTextCorpus(corpora.TextCorpus): 
        def get_texts(self):
            return self.input

        def __len__(self):
            return len(self.input)

    corpus = DirectTextCorpus(texts)
    return corpus


def build_id2word(corpus):
    """
    from corpus build id2word=dict
    :param corpus:
    :return:dict = corpus.dictionary
    """
    dict = corpus.dictionary  # gensim.corpora.dictionary.Dictionary
    # print(dict.id2token)
    try:
        dict['anything']
    except:
        pass
        # print("dict.id2token is not {} now")
    # print(dict.id2token)
    return dict


def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
    dict.save(dictDir)
    print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT)
    corpora.MmCorpus.serialize(corpusDir, corpus)
    print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT)
    # corpus.save(fname='./LDA/corpus.mm')  # stores only the (tiny) iteration object


def load_ldamodel(modelDir='./lda.pkl'):
    model = models.LdaModel.load(fname=modelDir)
    print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT)
    return model


def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
    dict = corpora.Dictionary.load(fname=dictDir)
    print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT)
    # dict = corpora.Dictionary.load_from_text('./id_word.txt')
    corpus = corpora.MmCorpus(corpusDir)  # corpora.mmcorpus.MmCorpus
    print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT)
    return dict, corpus


def build_doc_word_mat(corpus, model, num_topics):
    """
    build doc_word_mat in topic space
    :param corpus:
    :param model:
    :param num_topics: int
    :return:doc_word_mat np.array (len(topics) * num_topics)
    """
    topics = [model[c] for c in corpus]  # (word_id, weight) list
    doc_word_mat = np.zeros((len(topics), num_topics))
    for doc, topic in enumerate(topics):
        for word_id, weight in topic:
            doc_word_mat[doc, word_id] += weight
    return doc_word_mat


def compute_pairwise_dist(doc_word_mat):
    """
    compute pairwise dist
    :param doc_word_mat: np.array (len(topics) * num_topics)
    :return:pairwise_dist <class 'numpy.ndarray'>
    """
    pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat))
    max_weight = pairwise_dist.max() + 1
    for i in list(range(len(pairwise_dist))):
        pairwise_dist[i, i] = max_weight
    return pairwise_dist


def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5):
    """
    find the closest_doc_ids for  doc[test_doc_id]
    :param corpus:
    :param model:
    :param num_topics:
    :param test_doc_id:
    :param topn:
    :return:
    """
    doc_word_mat = build_doc_word_mat(corpus, model, num_topics)
    pairwise_dist = compute_pairwise_dist(doc_word_mat)
    # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id])
    closest_doc_ids = pairwise_dist[test_doc_id].argsort()
    # return closest_doc_ids[:topn]
    for closest_doc_id in closest_doc_ids[:topn]:
        print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id])


def evaluate_model(model):
    """
    計算模型在test data的Perplexity
    :param model:
    :return:model.log_perplexity float
    """
    test_texts = load_texts(dataset_type='test', groups='small')
    test_texts = preprocess_texts(test_texts)
    test_corpus = build_corpus(test_texts)
    return model.log_perplexity(test_corpus)


def test_num_topics():
    dict, corpus = load_corpus_dict()
    print("#corpus_items:", len(corpus))
    for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]:
        start_time = datetime.datetime.now()
        model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)
        end_time = datetime.datetime.now()
        print("total running time = ", end_time - start_time)
        print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model),
              DEFAULT)


def test():
    texts = load_texts(dataset_type='train', groups='small')
    original_texts = texts
    test_doc_id = 1

    # texts = preprocess_texts(texts, test_doc_id=test_doc_id)
    # corpus = build_corpus(texts=texts)  # corpus DirectTextCorpus(corpora.TextCorpus)
    # dict = build_id2word(corpus)
    # save_corpus_dict(dict, corpus)
    dict, corpus = load_corpus_dict()
    # print(len(corpus))

    num_topics = 100
    model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)  # 每次結果不同
    model.show_topic(0)
    # model.save(fname='./lda.pkl')

    # model = load_ldamodel()
    # closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3)

    print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT)


if __name__ == '__main__':
    test()
    # test_num_topics()