Apache Lucene的一個簡單示例

阿新 • • 發佈：2019-01-03

Lucene是apache軟體基金會jakarta專案組的一個子專案，是一個開放原始碼的全文檢索引擎工具包，但它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎，部分文字分析引擎（英文與德文兩種西方語言）。Lucene的目的是為軟體開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索的功能，或者是以此為基礎建立起完整的全文檢索引擎。Lucene是一套用於全文檢索和搜尋的開源程式庫，由Apache軟體基金會支援和提供。Lucene提供了一個簡單卻強大的應用程式介面，能夠做全文索引和搜尋。在Java開發環境裡Lucene是一個成熟的免費開源工具。就其本身而言，Lucene是當前以及最近幾年最受歡迎的免費Java資訊檢索程式庫。人們經常提到資訊檢索程式庫，雖然與搜尋引擎有關，但不應該將資訊檢索程式庫與搜尋引擎相混淆。

官方網站：http://lucene.apache.org

一個簡單的例子

1、引入Maven依賴
JDK版本：1.8.0_181
Lucene版本：4.0.0
POI版本：3.17，可處理2016之後的Word和Excel
最新版本可到此查詢mvnrepository

    <properties>
        <lucene.version>4.0.0</lucene.version>
        <poi.version>3.17</poi.version>
    </properties> 

    
    <dependencies>
        <!--Lucene 核心包 START -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>${lucene.version}</version>
        </dependency> 

        <!--一般分詞器，適用於英文分詞-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--中文分詞器-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--對分詞索引查詢解析-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--檢索關鍵字高亮顯示-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--Lucene 核心包 END -->

        <!-- Excel和Word文件處理依賴 START -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.8</version>
        </dependency>
        <!-- Excel和Word文件處理依賴 END -->
    </dependencies>

2、建立需要檢索的檔案
在資料夾D:\luceneData\下手動建立1.txt，2.docx，3.xlsx三個檔案，裡面含有“中國”兩個漢字的文字內容。

3、建立檔案目錄索引

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.*;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.List;

public class CreateLuceneIndex {
    private final static Logger log = LoggerFactory.getLogger(CreateLuceneIndex.class);

    private static String content = "";// 檔案裡的內容
    private static String INDEX_DIR = "D:\\luceneIndex";// 存放索引的位置
    private static String DATA_DIR = "D:\\luceneData";// 存放檔案的位置
    private static Analyzer analyzer = null;
    private static Directory directory = null;
    private static IndexWriter indexWriter = null;

    /**
     * 建立當前檔案目錄的索引
     *
     * @param path 當前檔案目錄
     * @return 是否成功
     */
    public static boolean createIndex(String path) {
        Date date1 = new Date();
        File indexFile = new File(INDEX_DIR);
        if (!indexFile.exists()) {
            indexFile.mkdirs();
        }
        List<File> fileList = getFileList(path);
        for (File file : fileList) {
            content = "";
            // 獲取檔案字尾
            String type = file.getName().substring(file.getName().lastIndexOf(".") + 1);
            if ("txt".equalsIgnoreCase(type)) {
                content += txt2String(file);
            } else if ("doc".equalsIgnoreCase(type) || "docx".equalsIgnoreCase(type)) {
                content += doc2String(file);
            } else if ("xls".equalsIgnoreCase(type) || "xlsx".equalsIgnoreCase(type)) {
                content += xls2String(file);
            }

            System.out.println("name :" + file.getName());
            System.out.println("path :" + file.getPath());
            System.out.println("content :" + content);
            System.out.println("=======================");

            try {
                analyzer = new StandardAnalyzer(Version.LUCENE_40);// 使用中文分詞器
                directory = FSDirectory.open(new File(INDEX_DIR));

                IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
                indexWriter = new IndexWriter(directory, config);
                Document document = new Document();
                document.add(new TextField("filename", file.getName(), Field.Store.YES));
                document.add(new TextField("content", content, Field.Store.YES));
                document.add(new TextField("path", file.getPath(), Field.Store.YES));
                indexWriter.addDocument(document);// 新增文件
                indexWriter.commit();
                closeWriter();// close了才真正寫到文件中
            } catch (Exception e) {
                log.error("建立檔案目錄索引異常:" + e.getMessage(), e);
                return false;
            }
        }
        Date date2 = new Date();
        System.out.println("建立索引-----耗時：" + (date2.getTime() - date1.getTime()) + "ms");
        return true;
    }

    /**
     * 過濾目錄下的檔案
     *
     * @param dirPath 想要獲取檔案的目錄
     * @return 返回檔案list
     */
    private static List<File> getFileList(String dirPath) {
        File[] files = new File(dirPath).listFiles();
        List<File> fileList = new ArrayList<File>();
        for (File file : files) {
            if (isTxtFile(file.getName())) {
                fileList.add(file);
            }
        }
        return fileList;
    }

    /**
     * 讀取txt檔案的內容
     *
     * @param file 想要讀取的檔案物件
     * @return 返回檔案內容
     */
    private static String txt2String(File file) {
        String result = "";
        try {
            // 構造一個BufferedReader類來讀取檔案(解決中文亂碼問題)
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "GBK"));
            String s = null;
            // 使用readLine方法，一次讀一行
            while ((s = br.readLine()) != null) {
                result += s;
            }
            br.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 讀取doc檔案內容
     *
     * @param file 想要讀取的檔案物件
     * @return 返回檔案內容
     */
    public static String doc2String(File file) {
        String result = "";
        if (file.exists() && file.isFile()) {
            InputStream is = null;
            HWPFDocument doc = null;
            XWPFDocument docx = null;
            POIXMLTextExtractor extractor = null;
            try {
                FileInputStream fis = new FileInputStream(file);
                // 判斷word的兩種格式doc,docx
                if (file.getPath().toLowerCase().endsWith("doc")) {
                    doc = new HWPFDocument(fis);
                    // 文件文字內容
                    result = doc.getDocumentText();
                } else if (file.getPath().toLowerCase().endsWith("docx")) {
                    docx = new XWPFDocument(fis);
                    extractor = new XWPFWordExtractor(docx);
                    // 文件文字內容
                    result = extractor.getText();
                } else {
                    log.error("不是word文件:" + file.getPath());
                }
            } catch (Exception e) {
                log.error("word檔案讀取異常:" + e.getMessage(), e);
            } finally {
                try {
                    if (doc != null) {
                        doc.close();
                    }
                    if (extractor != null) {
                        extractor.close();
                    }
                    if (docx != null) {
                        docx.close();
                    }
                    if (is != null) {
                        is.close();
                    }
                } catch (Exception e) {
                    log.error("關閉IO異常:" + e.getMessage(), e);
                }
            }
        }
        return result;
    }

    /**
     * 讀取xls檔案內容
     *
     * @param file 想要讀取的檔案物件
     * @return 返回檔案內容
     */
    public static String xls2String(File file) {
        String result = "";
        FileInputStream fis = null;
        try {
            fis = new FileInputStream(file);
            Workbook workbook = null;
            // 判斷excel的兩種格式xls,xlsx
            if (file.getPath().toLowerCase().endsWith("xlsx")) {
                workbook = new XSSFWorkbook(fis);
            } else if (file.getPath().toLowerCase().endsWith("xls")) {
                workbook = new HSSFWorkbook(fis);
            }
            // 得到sheet的總數  
            int numberOfSheets = workbook.getNumberOfSheets();
            //System.out.println("一共" + numberOfSheets + "個sheet");
            // 迴圈每一個sheet  
            for (int i = 0; i < numberOfSheets; i++) {
                //得到第i個sheet  
                Sheet sheet = workbook.getSheetAt(i);
                //System.out.println(sheet.getSheetName() + "  sheet");
                // 得到行的迭代器  
                Iterator<Row> rowIterator = sheet.iterator();
                int rowCount = 0;
                // 迴圈每一行
                while (rowIterator.hasNext()) {
                    //System.out.print("第" + (++rowCount) + "行  ");
                    // 得到一行物件  
                    Row row = rowIterator.next();
                    // 得到列物件 
                    Iterator<Cell> cellIterator = row.cellIterator();
                    int columnCount = 0;
                    // 迴圈每一列
                    while (cellIterator.hasNext()) {
                        //System.out.print("第" + (++columnCount) + "列:");
                        // 得到單元格物件
                        Cell cell = cellIterator.next

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    Apache Lucene的一個簡單示例
       
  
  
 Lucene是apache軟體基金會jakarta專案組的一個子專案，是一個開放原始碼的全文檢索引擎工具包，但它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎，部分文字分析引擎（英文與德文兩種西方語言）。Lucene的目的是為軟體開發人員提供一個簡單易 

  
 

    

    
    .Net Attribute詳解(上)-Attribute本質以及一個簡單示例
      作用   不同類   ocr   write   hellip   zed   lec   步驟   tsa   
Attribute的直接翻譯是屬性，這和Property容易產生混淆，所以一般翻譯成特性加以區分。Attribute常常的表現形式就是[AttributeName], 隨意地添加在class, 

  
 

    

    
    由一個簡單示例 引出java繼承中父子類成員變數共享問題 以及super的使用
      
                


輸出結果為
12
12
13
10


虛擬機器執行步驟：

1：虛擬機器載入測試類，提取型別資訊到方法區。
2：通過儲存在方法區的位元組碼，虛擬機器開始執行main方法，main方法入棧。
3：執行main方法的第一條指令，new B(); 這句話就是給B類例項物件分 

  
 

    

    
    Flask 成長之路（二）---- Flask的一個簡單示例
      
                上節我們已經安裝好了 Flask ，接下來我們就利用 Flask 寫一個最簡單的示例。

from flask import Flask
app = Flask(__name__)

@app.route('/')
def Hello():
    return 'Hello 

  
 

    

    
    分形技術的一個簡單示例——雪花圖案（AS3實現）
      
                分形的概念看這裡

這裡用AS3實現一個動態繪製雪花圖案的功能
你只要在介面用滑鼠拖動，就會妙筆生雪花
package 
{
	import flash.display.Sprite;
	import flash.events.Event;
	import flash.eve 

  
 

    

    
    Vue2.0在工程中加入vue-resource--一個簡單示例
      
export default {
data() {
  return {
        ...
  }
},
mounted: function() {
  this.$http.jsonp('https://api.douban.com/v2/movie/top250?count=10', {}, {
 

  
 

    

    
    Linux字元裝置驅動程式的一個簡單示例
      
                
 
一.原始碼：
// memdev.c
#define MEMDEV_MAJOR 254   /*預設的mem的主裝置號*/
#define MEMDEV_NR_DEVS 2    /*裝置數*/
#define MEMDEV_SIZE 4096
/*mem裝置描述結構體 

  
 

    

    
    使用libcurl傳送HTTP請求的一個簡單示例程式碼
      
							
							
							程式碼簡單解釋



設定header

首先要宣告header的結構體變數，然後設定對應header值，最後將其設定到curl結構體中

//宣告
CURL *curl;
struct curl_slist *headers = NULL;

//賦值head 

  
 

    

    
    spring4與Thymeleaf整合一個簡單示例
      
                
1、spring使用的版本是spring4.0.0：從spring4.0.0裡解壓出來的libs檔案裡的所有jar包。

2、thymeleaf版本是thymeleaf-3.0.9.RELEASE：從thymeleaf-3.0.9.RELEASE解壓出來後，在lib資料夾裡 

  
 

    

    
    libxml2 的一個簡單示例
      
								
								            
						
                

xml檔案如下：<?xml version="1.0"?>
<story>
	<storyinfo>
		<author>John Fleck< 

  
 

    

    
    一個簡單的MapReduce示例（多個MapReduce任務處理）
      .lib   exceptio   apr   private   util   sum   length   reat   lin   一、需求
　　有一個列表，只有兩列：id、pro，記錄了id與pro的對應關系，但是在同一個id下，pro有可能是重復的。
　　現在需要寫一個程序，統計一下每個id下有 

  
 

    

    
    一個簡單IOC與DI示例
      pre   throws   lac   span   class   cati   integer   valueof   exc   1、通過bean工廠實現讀取xml文件，並實例化對象，實現自動註入。

package com.pri.test;

import com.pri.factory.Bean 

  
 

    

    
    Web Service入門簡介(一個簡單的WebService示例)
      efi   都是   調用   soap   form   依賴   語言   1.3   tran   Web Service入門簡介(一個簡單的WebService示例)


Web Service入門簡介
一、Web Service簡介1.1、Web Service基本概念Web Service也叫XM 

  
 

    

    
    Libgdx Developer's Guide(Libgdx開發者手冊)-9（一個簡單的遊戲2--擴充套件示例遊戲）
       
 
 這篇文章的目的是擴充套件我們上次建立的遊戲"Drop"。我們要新增一個選單頁面和一對功能來讓遊戲更有趣一些。 
 讓我們從向遊戲中引入幾個高階類開始。 
 
 Screens 介面 
 Screens 對於多元件的遊戲非常重要。Screens包含了許多在ApplicationListener中所用的 

  
 

    

    
    使用Vue做一個簡單的todo應用的三種方式的示例程式碼
      這篇文章主要介紹了使用Vue做一個簡單的todo應用的三種方式的示例程式碼，小編覺得挺不錯的，現在分享給大家，也給大家做個參考。一起跟隨小編過來看看吧 
  
 
1. 引用vue.js 
`<!DOCTYPE html>`
`<html>`
`<head>` 

  
 

    

    
    ASP.net+MVC2+EasyUI搭建一個簡單表格示例
      
                一個非常基礎的小例子，主要是利用VS2010提供的MVC框架，後臺是ASP.NET，前臺是EasyUI，然後利用EasyUI的datagrid來顯示一個表格的資料。

1.首先建立MVC專案。VS2010自帶MVC2（雖然已經比較老了），所以直接新建一個專案如圖：



生成 

  
 

    

    
    一個簡單的Docker+Gunicorn+Flask示例
      
                使用Docker部署有諸多好處，flask程式也通常需要搭配一個高效能的wsgi容器，今天就記錄一下在使用docker+gunicorn+flask過程中的一些坑,錯誤之處歡迎指正。

一個簡單的demo(宿主機為ubuntu18.04)，先來看目錄結構：

目錄結構即 my 

  
 

    

    
    MyBatis學習筆記（1）---一個簡單MyBatis示例
      
                



利用JDBC仍舊存在的幾個侷限性：

在應用程式中存在的大量程式碼冗餘。
	業務程式碼與資料庫訪問程式碼混雜在一起。
	SQL語句與Java程式碼混雜在一起。
	JDBC丟擲費力難懂的checked異常，需要程式設計師花費精力小心處理。
	需要程式設計師自行解決ORM 

  
 

    

    
    一個簡單的CMake工程示例以及執行過程
      
                在工程目錄下，構建目錄src,include,lib,bin。在src目錄下存放原始碼檔案，include目錄下存放標頭檔案，lib目錄用於存放生成的庫（動態庫或者靜態庫），bin目錄存放最終生成的可執行檔案。

src目錄存放main.cpp和lib_demo.cpp的原始 

  
 

    

    
    00 MFC的本質及一個簡單的MFC程式示例
       
  
  
 MFC的本質就是對Win32的封裝。 微軟基礎類庫（英語：Microsoft Foundation Classes，簡稱MFC）是微軟公司提供的一個類庫（class libraries），以C++類的形式封裝了Windows API，並且包含一個應用程式框架，以減少應用程式開發人員的工作量。