【Python】倒排索引

阿新 • • 發佈：2018-12-22

程式碼連結

預處理

word stemming

一個單詞可能不同的形式，在英語中比如動詞的主被動、單複數等。比如live\lives\lived.
雖然英文的處理看起來已經很複雜啦但實際在中文裡的處理要更加複雜的多。

stop words

比如a、the這種詞在處理的時候沒有實際意義。在這裡處理的時候先對詞頻進行統計，人為界定停詞，簡單的全部替換為空格。但是這種方式並不適用於所有的情況，對於比如，To be or not to be，這種就很難處理。

具體實現

Index.txt 記錄所出現的檔案
這裡將建立倒排索引分為三步

thefile.txt 所有出現過的詞（詞頻由高到低）
stop_word.txt 停詞
data.pkl 所建立的索引

1 count.py 確定停詞
2 index.py 建立倒排索引
3 query.py 用於查詢

這裡在建立倒排索引的時候只記錄了出現的檔名，並沒有記錄在檔案中出現的位置。

圖為count.py生成的詞頻統計

這裡寫圖片描述

count.py

#-*- coding:utf-8 -*-
'''
@author birdy qian
'''
import sys
from nltk import *                                                                                          #import natural-language-toolkit 

from operator import itemgetter                                                                 #for sort

def output_count(fdist):                                                                                #output the relative information
    #vocabulary =fdist.items()
    vocabulary =fdist.items()                                                                           #get all the vocabulary  



    vocabulary=sorted(vocabulary, key=itemgetter(1),reverse=True)               #sort the vocabulary in decreasing order
    print vocabulary[:250]                                                                              #print top 250 vocabulary and its count on the screen
    print 'drawing plot.....'                                                                               #show process
    fdist.plot(120 , cumulative=False)                                                              #print the plot

    #output in file
    file_object = open('thefile.txt', 'w')                                                              #prepare the file for writing
    for j in vocabulary:
        file_object.write( j[0] + ' ')                                                                      #put put all the vocabulary in decreasing order 
    file_object.close( )                                                                                        #close the file


def pre_file(filename): 
    print("read file %s.txt....."%filename)                                                             #show process
    content = open( str(filename) + '.txt', "r").read()
    content = content.lower()
    for ch in '!"#$%&()*+,-./:;<=>[email protected][\\]^_‘{|}~' :                                            #cancel the punction
        content = content.replace(ch, " ")

    plurals = content.split()                                                                               #split the file at '\n' or ' '

    stemmer = PorterStemmer()                                                                       #prepare for stemming
    singles = [stemmer.stem(plural) for plural in plurals]                                  #handling stemming

    return singles



#main function
def main(): 
    print "read index....."                                                                                 #show process
    input = open('index.txt', 'r')                                                                      #titles that need to be handled
    all_the_file =input.read( )
    file=all_the_file.split()
    input.close()                                                                                               #close the file
    fdist1=FreqDist()                                                                                       #create a new dist

    for x in range( 0, len(file) ):
        #print file[x]
        txt = pre_file( file[x] )                                                                                   #pre handing the txt

        for words in txt :
            words =words.decode('utf-8').encode(sys.getfilesystemencoding())        #change string typt from utf-8 to gbk
            fdist1[words] +=1                                                                                   #add it to the dist



    output_count(fdist1)



#runfile
if __name__ == '__main__': 
    main()

index.py

#-*- coding:utf-8 -*-
'''
@author birdy qian
'''

import sys
import pickle                   
from nltk import *                                                                                          #import natural-language-toolkit
from operator import itemgetter                                                                 #for sort


STOPWORDS = []                                                                                          #grobal variable

def output_index(result):
    #print result

    output = open('data.pkl', 'wb')
    pickle.dump(result, output)                                                                     # Pickle dictionary using protocol 0
    output.close()


def pre_file(filename): 
    global STOPWORDS
    print("read file %s.txt....."%filename)                                                             #show process
    content = open( str(filename) + '.txt', "r").read()
    content = content.lower()
    for ch in '!"#$%&()*+,-./:;<=>[email protected][\\]^_��{|}~' :                                           #cancel the punction
        content = content.replace(ch, " ")

    for ch in  STOPWORDS:                                                                               #cancel the stopwords
        content = content.replace(ch, " ")      

    plurals = content.split()                                                                               #split the file at '\n' or ' '

    stemmer = PorterStemmer()                                                                       #prepare for stemming
    singles = [stemmer.stem(plural) for plural in plurals]                                  #handling stemming

    return singles

def readfile(filename):
    input = open(filename, 'r')                                                                     #titles that need to be handled
    all_the_file =input.read( )
    words = all_the_file.split()                                                                            #split the file at '\n' or ' '
    input.close()           
    return words



#main function
def main(): 
    global STOPWORDS
    print "read index....."                                                                                 #show process
    file=readfile('index.txt')
    print "read stopwords....." 
    STOPWORDS = readfile('stop_word.txt')  

    print "create word list....."
    word = list(readfile('thefile.txt'))                                                                        #the file with all the words in all the books
    result = {}                                                                                                     #memorize the result 

    for x in range( 0, len(file) ):
        #print file[x]

        txt = pre_file( file[x] )                                                                                   # file[x] is the title 
        txt =  {}.fromkeys(txt).keys()                                                                      #cancel the repeat word
        #can also use text.set()                                                            

        for words in txt :
            words =words.decode('utf-8').encode(sys.getfilesystemencoding())        #change string typt from utf-8 to gbk
            if result.get(words) == None :                                                              #if the word is not in the dictionary
                result[words]=[file[x]]
            else:                                                                                                       #if the word is in the dictionary
                t=result.get(words)
                t.append(file[x])
                result[words]=t


    output_index(result)



#runfile
if __name__ == '__main__': 
    main()

query.py

#-*- coding:utf-8 -*-
'''
@author birdy qian
'''
import os 
import sys
import pprint, pickle
from nltk import PorterStemmer

def readfile(filename):
    input = open(filename, 'r')                                                                 #titles that need to be handled
    all_the_file =input.read( )
    words = all_the_file.split()                                                                        #split the file at '\n' or ' '
    input.close()                                                                                           #close the data
    return words

def getdata():
    pkl_file = open('data.pkl', 'rb')                                                               #index is saved in the file 'data.pkl'
    data1 = pickle.load(pkl_file)                                                                   #change the type
    #pprint.pprint(data1)
    pkl_file.close()                                                                                        #close the file
    return  data1                                                                                       #close the data

def output( result ):
    #print result
    if result == None:                                              #if the words is not in the index (one word return None)
        print None
        return
    if len(result) == 0 :                                           #if the words is not in the index (more than one words return [] )
        print None
        return 

    if len(result) < 10 :                                               #if the records is less than 10
        print result

    else:                                                                   #if the records is more than 10
        print 'get '+ str(len(result)) + ' records'                                                                         #the record number
        for i in range( 0 , len(result) / 10 +1):
            print '10 records start from ' +str(i*10+1)

            if 10 * i + 9 < len(result) :                                                                                           #print from 10 * i to 10 * i + 10
                print result[ 10 * i : 10 * i + 10 ]
            else:                                                                                                                           #print from 10 * i to end
                print result[ 10 * i :  len(result) ]
                break
            getstr = raw_input("Enter 'N' for next ten records & other input to quit : ")
            if getstr != 'N':
                break



#main function
def main(): 
    data_list = getdata()                                                                                                   #read data                                                                  
    STOPWORDS = readfile('stop_word.txt') 
    stemmer = PorterStemmer()                                                                                       #prepare for stemming

    while True:
        get_str = raw_input("Enter your query('\\'to quit): ")
        if get_str == '\\' :                                                                                                    #leave the loop
            break

        get_str = get_str.lower()
        for ch in  STOPWORDS:                                                                                           #cancel the stopwords
            get_str = get_str.replace(ch, " ")  
        query_list=get_str.split()                                                                                          #split the file at '\n' or ' '
        query_list = [stemmer.stem(plural) for plural in query_list]                                        #handling stemming


        while True:     
            if query_list != [] :
                break
            get_str = raw_input("Please enter more information: ")
            get_str = get_str.lower()
            for ch in  STOPWORDS:                                                                                       #cancel the stopwords
                 get_str = get_str.replace(ch, " ") 
            query_list=get_str.split()
            query_list = [stemmer.stem(plural) for plural in query_list]                                    #handling stemming



        result=[]
        for k in range( 0 , len(query_list) ):  
            if k==0:                                                                                                            #if the list has not been built 
                result = data_list.get( query_list[0] )
            else:                                                                                                                   #if the list has been built 
                result = list( set(result).intersection(data_list.get( query_list[k] ) ) )
        output( result )


#runfile
if __name__ == '__main__': 
    main()

【Python】倒排索引

程式碼連結預處理 word stemming 一個單詞可能不同的形式，在英語中比如動詞的主被動、單複數等。比如live\lives\lived. 雖然英文的處理看起來已經很複雜啦但實際在中文裡的處理要更加複雜的多。 stop wo

spark【例子】倒排索引(InvertedIndex)

例子描述：【倒排索引(InvertedIndex)】這個例子是在一本講Spark書中看到的，但是樣例程式碼寫的太Java化，沒有函數語言程式設計風格，於是問了些高手，教我寫了份函式式的倒排索引。這段程式碼，我在剛開始學的時候很難想到二次拆分資料，所以這個難點挺不錯的。原始資料 cx1|a

【索引演算法】倒排索引

1.簡介倒排索引源於實際應用中需要根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(inverted index)。帶有倒排索引的檔案我們稱為倒排索引檔案，簡稱倒排檔案(inverted

Hadoop學習之自己動手做搜尋引擎【網路爬蟲+倒排索引+中文分詞】

一、使用技術 Http協議正則表示式佇列模式 Lucenne中文分詞 MapReduce 二、網路爬蟲專案目的通過制定url爬取介面原始碼，通過正則表示式匹配出其中所需的資源（這裡是爬取csdn部落格url及部落格名），將爬到的資源存

python 實現倒排索引

程式碼如下： #encoding:utf-8 fin = open('1.txt', 'r') ''' 建立正向索引: “文件1”的ID > 單詞1：出現位置列表；單詞2：出現位置列表；…

IR中python 寫倒排索引與查詢處理

學習資訊檢索課程，老師讓寫一個倒排索引與查詢處理的程式，於是抱著試試的心態自學python寫了出來。整個沒有什麼太大的演算法技巧，唯一的就是查詢處理那裡遞迴函式正反兩次反覆查詢需要多除錯下。資料結構： #-*-coding:utf-8-*- #!/usr/bin/pyt

jieba分詞python建立倒排索引

# encoding=utf-8 import json import jieba from sys import argv from collections import defaultdict path = argv[1] objs = map(lambda s: j

【原創】python倒排索引之查詢包含某主題或單詞的檔案

什麼是倒排索引？倒排索引（英語：Inverted index），也常被稱為反向索引、置入檔案或反向檔案，是一種索引方法，被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位置的對映。它是文件檢索系統中最常用的資料結構。通過倒排索引，可以根據單詞快速獲取包含這個單詞的文件列表。倒排索引主要由兩個部分

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

》官網 http://lucene.apache.org/ 下載地址：https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/java/7.5.0/ 》 Lucene的全文檢索是指什麼：程式掃描文件

【大資料】實驗三文件倒排索引演算法

實驗三文件倒排索引演算法 151220129 計科吳政億 [email protected] 151220130 計科伍昱名 [email protected] 151220135 計科許麗軍 [email prote

【Hadoop基礎教程】9、Hadoop之倒排索引

開發環境硬體環境：Centos 6.5 伺服器4臺（一臺為Master節點，三臺為Slave節點）軟體環境：Java 1.7.0_45、hadoop-1.2.1 1、倒排索引倒排索引是文件檢索系統中最常用的資料結構，被廣泛用於全文搜尋引

【Elasticsearch 7 探索之路】（三）倒排索引

上一篇，我們介紹了 ES 文件的基本 CURE 和批量操作。我們都知道倒排索引是搜尋引擎非常重要的一種資料結構，什麼是倒排索引，倒排索引的原理是什麼。 1 索引過程在講解倒排索引前，我們先了解索引建立，下圖是 Elasticsearch 中資料索引過程的流程。從上圖可以看到，文件未在 ES 中進行索引

【漫畫】ES原理必知必會的倒排索引和分詞

![es1](https://yqfile.alicdn.com/cf7303615996607dad8068cfc67065cfb1d7ed3d.jpeg) # 倒排索引的初衷 ![es2_1](https://yqfile.alicdn.com/1c23ad58c7183fce376abf40042

第三百六十一節，Python分布式爬蟲打造搜索引擎Scrapy精講—倒排索引

索引原理文章根據 file 索引 -i span 需要 style 第三百六十一節，Python分布式爬蟲打造搜索引擎Scrapy精講—倒排索引倒排索引倒排索引源於實際應用中需要根據屬性的值來查找記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的

python硬剛倒排索引

需要匯入的庫：jieba, json 本程式碼採用直接硬剛倒排索引，可能會引起稍微不適，請選用。程式碼分為三部分：分詞、建立正排索引、建立倒排索引需要檔案：語料庫、停用詞庫語料庫圖片如下：我用的是自己爬取的一部分新聞標題，包含網易，頭條，鳳凰網以及一小部分微信文章標題。語料庫處理：只需要

python倒排索引

一. 實驗目的 1.掌握列表、集合和字典的定義、賦值、使用等基本操作，熟悉處理複雜資料型別的一般流程 2.熟悉列表、集合和字典的常用函式和技巧 3.考察對文字的靈活處理和對排序演算法的運用二. 實驗內容倒排索引（Inverted index），也常

Python 倒排索引

# -*- coding: utf-8 -*- '''Part 1 : Setup index''' dict = {} # a emtry dictionary. n = 100 for row in range(0,n): information = raw_input()

【Python】正則表達式1（未完）

pes mmu get regular rop 則表達式 line out github 1、正則表達式唯一的用途就是在文本中匹配和尋找模式，模式可以簡單，也可以復雜。 2、Regexr 這個網站很個性的就是，有一個community標簽，打開後可以看到評分由高到低

【LeetCode】【Python】Binary Tree Inorder Traversal

nod 不知道 otto div ack return integer neu else Given a binary tree, return the inorder traversal of its nodes‘ values. For example: Gi

【Python】決策樹的python實現

uia bmp say 不知道 times otto outlook lru bgm 【Python】決策樹的python實現 2016-12-08 數據分析師Nieson 1. 決策樹是什麽? 簡單地理解，就是根據一些 feature 進行分類，每個節點提一個問

【Python】倒排索引

程式碼連結

預處理

word stemming

stop words

具體實現

相關推薦