1. 程式人生 > >主題模型TopicModel:LDA的缺陷和改進

主題模型TopicModel:LDA的缺陷和改進

LDA的缺陷和改進

1. 短文字與LDA

ICML論文有理論分析,文件太短確實不利於訓練LDA,但平均長度是10這個數量級應該是可以的,如peacock基於query 訓練模型。

有一些經驗技巧加工資料,譬如把同一session 的查詢拼接,同一個人的twitter 拼接等。也可以用w2v那樣的小視窗訓練過lda。


短文字上效果不好的原因是document-level word co-occurrences 很稀疏。

解決這個問題的方式

1. 是如word2vec一樣,利用local context-level word co-occurrences。 也就是說,把每個詞當成一個文件以及把它周圍出現過的詞當做這個文件的內容。這樣的話就不會受文件長度的限制了。

2. 短文字語義更集中明確,LDA是適合處理的,也可以做一些文字擴充套件的工作,有query log的話,1. query session,2. clickstream。無query log的話,1. 短文本當做query,通過搜尋引擎(或語料庫)獲取Top相關性網頁,2. 用語料庫中短文本週邊詞集,3. 知識庫中近義詞,上下位詞等。

3. KBTM

[http://weibo.com/1991303247/CltoOaSTN?type=repost#_rnd1433930168895]

2. LDA limitations: what’s next?

Although LDA is a great algorithm for topic-modelling, it still has some limitations, mainly due to the fact that it’s has become popular and available to the mass recently.

One major limitation is perhaps given by its underlying unigram text model: LDA doesn’t consider themutual position of the words in the document. Documents like “Man, I love this can” and “I can love this man” are probably modelled the same way. It’s also true that for longer documents, mismatching topics is harder. To overcome this limitation, at the cost of almost square the complexity, you can use 2-grams (or N-grams)along with 1-gram.

Another weakness of LDA is in the topics composition: they’re overlapping. In fact, you can find thesame word in multiple topics(the example above, of the word “can”, is obvious). The generated topics, therefore, are not independent andorthogonal(正交的) like in a PCA-decomposed basis, for example. This implies that you must pay lots of attention while dealing with them (e.g. don’t usecosine similarity).

For a more structured approach - especially if the topic composition is very misleading - you might consider thehierarchical variation of LDA, named H-LDA, (or simply Hierarchical LDA). In H-LDA, topics are joined together in a hierarchy by using a Nested Chinese Restaurant Process (NCRP). This model is more complex than LDA, and the description is beyond the goal of this blog entry, but if you like to have an idea of the possible output, here it is. Don’t forget that we’re still in theprobabilistic world: each node of the H-DLA tree is a topic distribution.

[http://engineering.intenthq.com/2015/02/automatic-topic-modelling-with-lda/]

LDA是一種非監督機器學習技術,可以用來識別大規模文件集(document collection)或語料庫(corpus)中潛藏的主題資訊。它採用了詞袋(bag of words)的方法,這種方法將每一篇文件視為一個詞頻向量,從而將文字資訊轉化為了易於建模的數字資訊。但是詞袋方法沒有考慮詞與詞之間的順序,這簡化了問題的複雜性,同時也為模型的改進提供了契機。每一篇文件代表了一些主題所構成的一個概率分佈,而每一個主題又代表了很多單詞所構成的一個概率分佈。由於 Dirichlet分佈隨機向量各分量間的弱相關性(之所以還有點“相關”,是因為各分量之和必須為1),使得我們假想的潛在主題之間也幾乎是不相關的,這與很多實際問題並不相符,從而造成了LDA的又一個遺留問題。

3. big data text analysis inconsistent, inaccurate

LDA is also inaccurate enough at some tasks that the results of any topic model created with it are essentially meaningless, according toLuis Amaral.

Applied to messy, inconsistently scrubbed data from many sources in many formats – the base of data for which big data is often praised for its ability to manage – the results would be far less accurate and far less reproducible.

"Our systematic analysis clearly demonstrates that current implementations of LDA have low validity," the paper reports (full text PDFhere).

改進:TopicMapping

1. breaks words down into bases (treating "stars" and "star" as the same word), then eliminates conjunctions, pronouns and other "stop words" that modify the meaning but not the topic, using a standardized list.

2. Then the algorithm builds a model identifying words that often appear together in the same document and use the proprietary Infomap natural-language processing software to assign those clusters of words into groups identified as a "community" that define the topic. Words could appear in more than one topic area.

The new approach delivered results that were 92 percent accurate and 98 percent reproducible, though, according to the paper, it only moderately improved the likelihood that any given result would be accurate.

The best way to improve those analyses is to apply techniques common in community detection algorithms – which identify connections among specific variables and use those to help categorize or verify the classification of those that aren't clearly in one group or another.

LDA平行計算

Spark MLlib LDA 基於GraphX實現原理,以文件到詞作為邊,以詞頻作為邊資料,把語料庫構造成圖,把對語料庫中每篇文件的每個詞操作轉化為在圖中每條邊上的操作,而對邊RDD處理是GraphX中最常見的的處理方法。 

基於GraphX實現的Gibbs Sampling LDA,定義文件與詞的二部圖,頂點屬性為文件或詞所對應的topic向量計數,邊屬性為Gibbs Sampler取樣生成的新一輪topic。每一輪迭代取樣生成topic,用mapReduceTriplets函式為文件或詞累加對應topic計數。這好像是Pregel的處理方式?Pregel實現過LDA。

ref: