1. 程式人生 > >gensim學習筆記(三)- 計算文件之間的相似性

gensim學習筆記(三)- 計算文件之間的相似性

Similarity interface

載入配置logging

>>> import logging>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

在之前的筆記中,我們介紹了向量空間模型和向量空間的變換,比如TF-IDF模型和LSI模型。通過這些轉換,我們可以計算文件之間的相似性,或者計算一個特定的文件和文件集中文件的相似性,比如說一個查詢(query)和一系列文件的相似性。接下來,通過具體操作實踐一下.

>>> 
from gensim import corpora, models, similarities >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors" >>> print(corpus) MmCorpus(9 documents, 12 features, 28
non-zero entries) >>> lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

首先從本地載入資料,再經過LSI模型轉換
現在假設有查詢query:”Human computer interaction”,計算這個查詢語句和其他9個文件之間的相似性,如下:

>>> doc = "Human computer interaction"
>>> vec_bow = dictionary.doc2bow(doc.lower().split())
>>> 
vec_lsi = lsi[vec_bow] # convert the query to LSI space >>> print(vec_lsi) [(0, -0.461821), (1, 0.070028)]

這裡我們使用餘弦相似度來度量兩個向量之間的相似性

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

>>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

關於similiarities.MatrixSimilarity,需要注意的是當語料庫太大的時候,記憶體無法存放的時候,會報錯,這時候,我們要使用 similarities.Similarity class,詳細見http://radimrehurek.com/gensim/similarities/docsim.html

>>> index.save('/tmp/deerwester.index')
>>> index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

index同樣可以儲存和載入。
通過index,可以計算出查詢“Huaman computer interaction”和每個文件的相似性,如下:

>>> sims = index[vec_lsi] # perform a similarity query against the corpus
>>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]

>>> sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print(sims) # print sorted (document number, similarity score) 2-tuples
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees

我們觀察到了一個很有趣的結果,documents no. 2 (“The EPS user interface management system”) 和 4 (“Relation of user perceived responsetime to error measurement”) 沒有一個單詞和查詢”Human computer interaction”相同,然而,在使用了LSI變換之後,觀察到它們都和查詢語句有很高的相似度量 (no. 2 是最相似的!), 這也符合我們的直觀邏輯,因為和“computer-human” 是和查詢語句相關的話題. 這也是我們為什麼一開始要做LSI變換的原因,如果僅僅考慮詞袋模型,那結果可能會很差。