利用lucene對檔案內容進行關鍵字檢索

阿新 • • 發佈：2019-01-02

一、概述

關於lucene的具體介紹，請自行百度。

二、例項講解

在具體實現之前，請根據自己的要求，建立對應的路徑及檔案。

例如，我這邊建立的路徑及檔案是：

D:/tools/LearningByMyself/lucene/source/demo1.txt

D:/tools/LearningByMyself/lucene/source/demo2.txt

D:/tools/LearningByMyself/lucene/index

第一步，建立索引，程式碼如下：

/**
   * @param sourceFile 需要新增到索引中的路徑
   * @param indexFile  存放索引的路徑
   * @throws Exception
   */
public static void textFileIndexer(String sourceFile,String indexFile) throws Exception{
		File sourceDir = new File(sourceFile),
			 indexDir = new File(indexFile); 	
	 
		Directory dir =  FSDirectory.open(indexDir);
                Analyzer luceneAnalyzer = new   StandardAnalyzer(Version.LUCENE_36); 		
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,luceneAnalyzer); 		
		iwc.setOpenMode(OpenMode.CREATE); 
		IndexWriter indexWriter = new IndexWriter(dir,iwc);
     
        File[] textFiles = sourceDir.listFiles();       
        long startTime = new Date().getTime(); 
        
        for(int i=0;i<textFiles.length;i++){
        	if(textFiles[i].isFile() && textFiles[i].getName().endsWith(".txt")){
        		System.out.println("File--->" + textFiles[i].getCanonicalPath() + " 正在被索引.....");
        		String str_temp = fileReaderAll(textFiles[i].getCanonicalPath(),"UTF-8");
        		System.out.println("檔案內容：" + str_temp);
        		
        		Document document = new Document();
        		Field field_path = new Field("path",textFiles[i].getCanonicalPath(),
        				Field.Store.YES,Field.Index.NO);
        		Field field_body = new Field("body",str_temp,Field.Store.YES,
        				Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS);
        		document.add(field_path);
        		document.add(field_body);
        		
        		indexWriter.addDocument(document);
        	}
        }
        
        indexWriter.close();
        
        long endTime = new Date().getTime();
        
        System.out.println("一共花費了" + 
               (endTime - startTime) + "毫秒將" + sourceDir.getPath() + "中的檔案增加到索引裡面去.....");
	}

private static String fileReaderAll(String filename,String charset) throws IOException{
		BufferedReader buffer_read = new BufferedReader(
				new InputStreamReader(new FileInputStream(filename),charset));
		String line = new String();
		String temp = new String();
		
		while((line = buffer_read.readLine()) != null){
			temp += line ;
		}
		
		buffer_read.close();
		
		return temp ;
	}

第二步，在索引中檢索關鍵字

/**
     * @param indexFile 索引所在的路徑
     * @param keyWords  需要檢索的關鍵字
     * @throws IOException
     * @throws ParseException
     */
     public static void queryKeyWords(String indexFile,String keyWords) throws IOException,ParseException{
    	IndexReader reader = IndexReader.open(
				FSDirectory.open(new File(indexFile)));
		IndexSearcher index_search = new IndexSearcher(reader);
									
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);				
	    QueryParser query_parser = new QueryParser(Version.LUCENE_36,"body",analyzer);	    
	    Query query = query_parser.parse(keyWords);
		
		if(index_search != null){
	    	TopDocs result = index_search.search(query, 10); //返回最多為10條記錄
	    	ScoreDoc[] hits = result.scoreDocs;
	    	
	    	if(hits.length > 0){
	    		System.out.println("關鍵字：" + keyWords + "，在  " + indexFile + "中，一共檢索到" + hits.length + "個...");
	    	}
	    	
	    	index_search.close();
	    }
    }

第三部，自己編寫一個測試類，測試一下上面的兩個方法，例如，我寫的測試類如下：

public class LuceneTest {
	public static void main(String[] args) throws IOException,ParseException,Exception{
		String sourcePath = "D:/tools/LearningByMyself/lucene/source" ;
		String indexPath = "D:/tools/LearningByMyself/lucene/index" ;
		String key_words = "伺服器" ;
		
		LuceneIndex.textFileIndexer(sourcePath, indexPath);
		LuceneIndex.queryKeyWords(indexPath, key_words);
		
	}
}

第四步，在控制檯上檢視結果。例如，我這邊的測試結果如下：

File--->D:\tools\LearningByMyself\lucene\source\demo1.txt 正在被索引.....

檔案內容：為了保證機房的網路安全，IDC內所有伺服器不被允許從辦公網直接ssh登入，必須通過跳板機進行間接登入。使用者通過跳板機執行的所有命令（包括通過跳板機登入的其他機器後的命令）都會被儲存並審計。

File--->D:\tools\LearningByMyself\lucene\source\demo2.txt 正在被索引.....

檔案內容：Relay是我們登入IDC伺服器的跳板機，在Relay上使用者只能執行ssh、passwd等簡單命令，Relay只做ssh跳板機兒不做日常工具機使用。

一共花費了235毫秒將D:\tools\LearningByMyself\lucene\source中的檔案增加到索引裡面去.....

關鍵字：伺服器，在 D:/tools/LearningByMyself/lucene/index中，一共檢索到2個.

利用lucene對檔案內容進行關鍵字檢索

一、概述關於lucene的具體介紹，請自行百度。二、例項講解在具體實現之前，請根據自己的要求，建立對應的路徑及檔案。例如，我這邊建立的路徑及檔案是： D:/tools/LearningByMyself/lucene/sou

C語言實現對檔案內容進行修改

專案中需要實現對檔案內容進行修改，所以有了以下的測試程式，其中遇到了很多問題，在這一一記錄下來。其實實現原理很簡單，首先是要實現定位到所需要的欄位的那一行，之後再對指定欄位進行修改。在我接觸了shell程式設計後，覺得shell程式設計會很容易實現（網上說可以用sed命令

java 對檔案內容進行替換工作

讀取檔案程式碼如下： File file = new File("C:/Users/Administrator/Desktop/test1.json"); try { String content = FileUtils.readFileToString(file

Linux下利用openssl對檔案進行加密和解密

--建立檔案test.txt, 特意寫入中英文 # cd /tmp # echo "test測試" > test.txt--開始加密, 使用aes-128-cbc演算法, 也可以使用其他演算法, 通過檢視openssl的幫助可獲知 # openssl aes-128-

IE瀏覽器利用ActiveXObject物件對檔案系統進行操作

如果要用網頁做一個客戶端的程式（無後臺），那麼就要能對檔案系統進行操作。想實現的功能如下：引數配置物件轉換成json字串儲存到檔案系統，從檔案系統讀取字串轉換成json物件。 CreateTextFile(FileName, Overwrite, Unico

python中對檔案內容多行內容進行刪除

# Author Richard_Kong # !/usr/bin/env python # --*-- encoding:utf-8 --*-- """ 思路：將要刪除的Str儲存為新的檔案，兩個檔案對內容比較後進行刪除 """ def delete_file(file,S

Python---對html檔案內容進行搜尋取出特定URL地址字串，儲存成列表，並使用每個url下載圖片，並儲存到硬碟上，使用正則re

Python—對html檔案內容進行搜尋取出特定URL地址字串，儲存成列表，並使用每個url下載圖片，並儲存到硬碟上，正則re 對目標回包內容取出這樣類似的內容： https://xian

[轉]使用自定義HttpMessageConverter對返回內容進行加密

返回結果 type 當前 solver png source nal list 自然今天上午技術群裏的一個人問” 如何在 Spring MVC 中統一對返回的 Json 進行加密？”。大部分人的第一反應是通過 Spring 攔截器（Interc

django 利用ORM對單表進行增刪改查

man api light 賦值連接取出簡單 extern follow 牛小妹上周末，一直在嘗試如何把數據庫的數據弄到界面上。畢竟是新手，搞不出來，文檔也看不懂。不過沒關系，才剛上大學。今晚我們就來解釋下，要把數據搞到界面的第一步。先把數據放到庫裏，然後再把數據從庫

HTML angular對表格內容進行排序,刪除,模糊查找

-m button inpu phone null var score ctype r.js <!DOCTYPE html><html> <head> <meta charset="UTF-8">

利用logrotate對Tomcat日誌進行切分

最近在做伺服器資源釋放的時候發現有一臺伺服器的find命令無法使用（原因不詳），所以之前利用cronolog對Tomcat日誌進行切分之後。是基於包含find命令的shell指令碼做的自動清理。這時就想到了用logrotate對Tomcat日誌進行切分。 1、指令碼如下： /usr/lo

利用logrotate對nginx日誌進行切分

1、指令碼如下： /gdsfapps/flgw/logs/nginx/*.log{ missingok dateext notifempty daily rotate 7 sharedscripts postrotate if [ -f /usr/local/nginx/logs/nginx

利用opencv對圖片大小進行修改

執行環境：ubuntu16.04 + opencv 2.4.13 + c++ (系統g++版本5.4.0） #include <iostream> #include <fstream> #include <opencv2/core/core.hpp> #inc

爬蟲：模擬瀏覽器對網站內容進行爬取

對於一些保護比較好的網站，他能識別你是用requests庫對其進行訪問，所以有些網站會禁止你用python對其進行訪問所以我們可以修改傳送給網站的頭部資訊，偽造瀏覽器對網站進行訪問檢視我們傳送給網站的頭部資訊：r.request.headers kv={'user-agent':

Linux命令根據某一列對檔案內容去重

大家可能經常遇到檔案內容排序去重處理的事情，使用 linux 命令可以很方便的處理，sort 命令在處理檔案排序和去重中起著非常重要的左右，是檔案處理的利器。比如有以下檔案內容： pythontab.com 1 2 3 4 5 6 7 8 9 10 11 12

使用Compression對檔案流進行壓縮後寫入資料庫

檔案轉換成二進位制流後，將二進位制流儲存到資料庫相應的欄位中,檔案稍大時,將嚴重影響資料庫的效能因此,我們可以將檔案進行壓縮後在進行儲存. 這裡我們使用System.IO 中的Compression類進行壓縮 (在 4.5 之前，處理壓縮檔案，我們經常需要使用第三方的類庫 SharpZip

c實現功能（10）對文字內容進行計算

#include <stdio.h> #include <string.h> #include <stdlib.h> //實現對文字內容的計算 //首先實現對文字中每一行內容的計算 int calcString(char *s){

本文部分轉載一.scanf函式的機理 scanf()不是以行單位對輸入內容進行解釋，而是對連續字元流進行解釋（換行字元也視為一個字元）。scanf()連續地從流讀入字元，並且對和格式說明符（

本文部分轉載一.scanf函式的機理 scanf()不是以行單位對輸入內容進行解釋，而是對連續字元流進行解釋（換行字元也視為一個字元）。scanf()連續地從流讀入字元，並且對和格式說明符（%d）相匹配的部分進行變換處理。例如，當格式說明符為%d 的時候，輸入123

利用Comparator對列舉型別進行排序的實現（ComparatorChain、BeanComparator、FixedOrderComparator）

背景：工作中遇到按照類的某個屬性排列，這個屬性是個列舉型別（完全是自定義的，沒有明顯的比較標誌），現要按照要求的優先順序排列。如一個蘋果類有大小和甜度屬性，大小有“特大”，“大”，“中”，“小”，“很小”的等級，甜度有“很甜”，“甜”

資料處理-------利用jieba對資料集進行分詞和統計頻數

一，對txt檔案中出現的詞語的頻數統計再找出出現頻率多的二，程式碼： import re from collections import Counter import jieba def cut_word(datapath): with open(

利用lucene對檔案內容進行關鍵字檢索

相關推薦