1. 程式人生 > >倒排索引

倒排索引

key als ron 單詞 end line result std lin

倒排索引(inverted index)

常被成為反向索引、置入文檔和反向檔案,是一種索引方法,被用來存儲在全文搜索下某個單詞在一個文檔

或者一組文檔中的存儲位置的映射。是文檔檢索系統中最常用的數據結構。

例如:

下面是要被索引的文本:

T0 = "it is what it is"

T1 = "what is it"

T2 = "it is a banana"

生成的倒排索引可以表示為下面所示:

"a" = {(2,2)}

"banana" = {(2,3)}

"is" = {(0,1),(0,4),(1,1),(2,1)}

"it" = {(0,0),(0,3),(1,2),(2,0)}

"what" = {(0,2),(1,0)}

我們可以得到這些完全反向索引,有(文檔位置、查詢單詞所在文檔中位置)組成的成對數據。

同樣,文檔位置、和查詢單詞所在文檔中位置,都從零開始計算。

所以,"banana":{(2,3)}表示 banana在第三個文檔中的第四個單詞位置。

=====例子如下:

DATA:存儲正向索引

word_index:存儲倒排索引,每個空格分隔的單詞作為key,

      value是list結果,通過list.append方法,依次添加相應單詞在文本文件中的位置()。

      單詞位置使用(行中index+所在行號)的形式表示。  

#
coding:utf-8 import sys DATA = {} word_index = {}# query->(line_no,word_index) #using rever_index #使用倒排結果 def check_index(sentense): query = sentense.split( ) for v in query: if word_index.has_key(v)==True: #print word_index[v],"####",v for
index_lineno in word_index[v]: #[‘0.0‘,‘2,1‘,‘2,3‘] #print index_lineno print DATA[int(index_lineno.split(.)[1])] if __name__ =="__main__": # 生成倒排 line_num = 0 for line in sys.stdin: line = line.strip( \r\n) fields = line.split( ) DATA[line_num] = line for i, val in enumerate(fields): if word_index.has_key(val) == False: word_index[val] = [] word_index[val].append(".".join( [str(i), str(line_num)])) line_num += 1 print word_index print DATA print "=====test query" queries = "it is example" print ("####input search sentense:%s",queries) print "####search result is :" check_index(queries) print "done==========" sys.exit(0)

=====

input.data 文本文件:

it is what it is
what is it
it is a banana
from your second example
When I run the algo using some sample
What am I doing wrong ?

======運行結果:

{What: [0.5], doing: [3.5], is: [1.0, 4.0, 1.1, 1.2], 
some: [6.4], it: [0.0, 3.0, 2.1, 0.2], sample: [7.4],
second: [2.3], your: [1.3], what: [2.0, 0.1], from: [0.3],
banana: [3.2], ?: [5.5], run: [2.4], I: [1.4, 2.5],
When: [0.4], wrong: [4.5], using: [5.4], a: [2.2],
am: [1.5], algo: [4.4], the: [3.4], example: [3.3]}
{0:
it is what it is, 1: what is it, 2: it is a banana,
  3: from your second example, 4: When I run the algo using some sample,
  5: What am I doing wrong ?} =====test query (####input search sentense:%s, it is example) ####search result is : it is what it is it is what it is what is it it is a banana it is what it is it is what it is what is it it is a banana from your second example done==========

倒排索引