搜索引擎系列四：Lucene提供的分詞器、IKAnalyze中文分詞器集成

阿新 • • 發佈：2018-05-05

author oid core 長度 maven項目 int get attribute clu

一、Lucene提供的分詞器StandardAnalyzer和SmartChineseAnalyzer

1.新建一個測試Lucene提供的分詞器的maven項目LuceneAnalyzer

技術分享圖片

2. 在pom.xml裏面引入如下依賴

        <!-- lucene 核心模塊  -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            < 
version>7.3.0</version>
        </dependency>

        <!-- Lucene提供的中文分詞器模塊，lucene-analyzers-smartcn:Lucene  的中文分詞器 SmartChineseAnalyzer -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</ 
artifactId>
            <version>7.3.0</version>
        </dependency>

3. 新建一個標準分詞器StandardAnalyzer的測試類LuceneStandardAnalyzerTest

package com.luceneanalyzer.use.standardanalyzer;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
 
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

/**
 * Lucene core模塊中的 StandardAnalyzer英文分詞器使用
 * 英文分詞效果好，中文分詞效果不好
 * @author THINKPAD
 *
 */
public class LuceneStandardAnalyzerTest {

    private static void doToken(TokenStream ts) throws IOException {
        ts.reset();
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        while (ts.incrementToken()) {
            System.out.print(cta.toString() + "|");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws IOException {
        String etext = "Analysis is one of the main causes of slow indexing. Simply put, the more you analyze the slower analyze the indexing (in most cases).";
        String chineseText = "張三說的確實在理。";
        // Lucene core模塊中的 StandardAnalyzer 英文分詞器
        try (Analyzer ana = new StandardAnalyzer();) {
            TokenStream ts = ana.tokenStream("coent", etext);
            System.out.println("標準分詞器，英文分詞效果：");
            doToken(ts);
            ts = ana.tokenStream("content", chineseText);
            System.out.println("標準分詞器，中文分詞效果：");
            doToken(ts);
        } catch (IOException e) {

        }

    }
}

運行效果：

標準分詞器，英文分詞效果：
analysis|one|main|causes|slow|indexing|simply|put|more|you|analyze|slower|analyze|indexing|most|cases|
標準分詞器，中文分詞效果：
張|三|說|的|確|實|在|理|

4. 新建一個Lucene提供的中文分詞器SmartChineseAnalyzer的測試類

package com.luceneanalyzer.use.smartchineseanalyzer;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

/**
 * Lucene提供的中文分詞器模塊，lucene-analyzers-smartcn:Lucene 的中文分詞器 SmartChineseAnalyzer
 * 中英文分詞效果都不好
 * 
 * @author THINKPAD
 *
 */
public class LuceneSmartChineseAnalyzerTest {

    private static void doToken(TokenStream ts) throws IOException {
        ts.reset();
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        while (ts.incrementToken()) {
            System.out.print(cta.toString() + "|");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws IOException {
        String etext = "Analysis is one of the main causes of slow indexing. Simply put, the more you analyze the slower analyze the indexing (in most cases).";
        String chineseText = "張三說的確實在理。";
        // Lucene 的中文分詞器 SmartChineseAnalyzer
        try (Analyzer smart = new SmartChineseAnalyzer()) {
            TokenStream ts = smart.tokenStream("content", etext);
            System.out.println("smart中文分詞器，英文分詞效果：");
            doToken(ts);
            ts = smart.tokenStream("content", chineseText);
            System.out.println("smart中文分詞器，中文分詞效果：");
            doToken(ts);
        }

    }
}

運行效果：

smart中文分詞器，英文分詞效果：
analysi|is|on|of|the|main|caus|of|slow|index|simpli|put|the|more|you|analyz|the|slower|analyz|the|index|in|most|case|
smart中文分詞器，中文分詞效果：
張|三|說|的|確實|在|理|

二、IKAnalyze中文分詞器集成

IKAnalyzer是開源、輕量級的中文分詞器，應用比較多

最先是作為lucene上使用而開發，後來發展為獨立的分詞組件。只提供到Lucene 4.0版本的支持。我們在4.0以後版本Lucene中使用就需要簡單集成一下。

需要做集成，是因為Analyzer的createComponents方法API改變了

IKAnalyzer提供兩種分詞模式：細粒度分詞和智能分詞

集成步驟

1、找到 IkAnalyzer包體提供的Lucene支持類，比較IKAnalyzer的createComponets方法。

技術分享圖片

4.0及之前版本的createComponets方法：

@Override
  protected TokenStreamComponents createComponents(String fieldName, final Reader in) {
    Tokenizer _IKTokenizer = new IKTokenizer(in, this.useSmart());
    return new TokenStreamComponents(_IKTokenizer);
  }

最新的createComponets方法：

  protected abstract TokenStreamComponents createComponents(String fieldName);

2、照這兩個類，創建新版本的，類裏面的代碼直接復制，修改參數即可。

下面開始集成：

1.新建一個maven項目IkanalyzerIntegrated

技術分享圖片

2. 在pom.xml裏面引入如下依賴

         <!-- lucene 核心模塊  -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>7.3.0</version>
        </dependency>   
            
        <!-- ikanalyzer 中文分詞器  -->
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
            <!--排除掉裏面舊的lucene包，因為我們要重寫裏面的分析器和分詞器  -->
            <exclusions>
                <exclusion>
                    <groupId>org.apache.lucene</groupId>
                    <artifactId>lucene-core</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.lucene</groupId>
                    <artifactId>lucene-queryparser</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.lucene</groupId>
                    <artifactId>lucene-analyzers-common</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

3. 重寫分析器

package com.study.lucene.ikanalyzer.Integrated;

import org.apache.lucene.analysis.Analyzer;

/**
 * 因為Analyzer的createComponents方法API改變了需要重新實現分析器
 * @author THINKPAD
 *
 */
public class IKAnalyzer4Lucene7 extends Analyzer {

    private boolean useSmart = false;

    public IKAnalyzer4Lucene7() {
        this(false);
    }

    public IKAnalyzer4Lucene7(boolean useSmart) {
        super();
        this.useSmart = useSmart;
    }

    public boolean isUseSmart() {
        return useSmart;
    }

    public void setUseSmart(boolean useSmart) {
        this.useSmart = useSmart;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        IKTokenizer4Lucene7 tk = new IKTokenizer4Lucene7(this.useSmart);
        return new TokenStreamComponents(tk);
    }

}

4. 重寫分詞器

package com.study.lucene.ikanalyzer.Integrated;

import java.io.IOException;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * 因為Analyzer的createComponents方法API改變了需要重新實現分詞器
 * @author THINKPAD
 *
 */
public class IKTokenizer4Lucene7 extends Tokenizer {

    // IK分詞器實現
    private IKSegmenter _IKImplement;

    // 詞元文本屬性
    private final CharTermAttribute termAtt;
    // 詞元位移屬性
    private final OffsetAttribute offsetAtt;
    // 詞元分類屬性（該屬性分類參考org.wltea.analyzer.core.Lexeme中的分類常量）
    private final TypeAttribute typeAtt;
    // 記錄最後一個詞元的結束位置
    private int endPosition;

    /**
     * @param in
     * @param useSmart
     */
    public IKTokenizer4Lucene7(boolean useSmart) {
        super();
        offsetAtt = addAttribute(OffsetAttribute.class);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
        _IKImplement = new IKSegmenter(input, useSmart);
    }

    /*
     * (non-Javadoc)
     * 
     * @see org.apache.lucene.analysis.TokenStream#incrementToken()
     */
    @Override
    public boolean incrementToken() throws IOException {
        // 清除所有的詞元屬性
        clearAttributes();
        Lexeme nextLexeme = _IKImplement.next();
        if (nextLexeme != null) {
            // 將Lexeme轉成Attributes
            // 設置詞元文本
            termAtt.append(nextLexeme.getLexemeText());
            // 設置詞元長度
            termAtt.setLength(nextLexeme.getLength());
            // 設置詞元位移
            offsetAtt.setOffset(nextLexeme.getBeginPosition(),
                    nextLexeme.getEndPosition());
            // 記錄分詞的最後位置
            endPosition = nextLexeme.getEndPosition();
            // 記錄詞元分類
            typeAtt.setType(nextLexeme.getLexemeTypeString());
            // 返會true告知還有下個詞元
            return true;
        }
        // 返會false告知詞元輸出完畢
        return false;
    }

    /*
     * (non-Javadoc)
     * 
     * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        _IKImplement.reset(input);
    }

    @Override
    public final void end() {
        // set final offset
        int finalOffset = correctOffset(this.endPosition);
        offsetAtt.setOffset(finalOffset, finalOffset);
    }
}

5. 新建一個IKAnalyzer的測試類IKAnalyzerTest

package com.study.lucene.ikanalyzer.Integrated;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;


/**
 * IKAnalyzer分詞器集成測試:
 * 細粒度切分：把詞分到最細
 * 智能切分：根據詞庫進行拆分符合我們的語言習慣
 * 
 * @author THINKPAD
 *
 */
public class IKAnalyzerTest {
    private static void doToken(TokenStream ts) throws IOException {
        ts.reset();
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        while (ts.incrementToken()) {
            System.out.print(cta.toString() + "|");
        }
        System.out.println();
        ts.end();
        ts.close();
    }

    public static void main(String[] args) throws IOException {

        String etext = "Analysis is one of the main causes of slow indexing. Simply put, the more you analyze the slower analyze the indexing (in most cases).";
        String chineseText = "張三說的確實在理。";
        /**
         * ikanalyzer 中文分詞器 因為Analyzer的createComponents方法API改變了 需要我們自己實現
         * 分析器IKAnalyzer4Lucene7和分詞器IKTokenizer4Lucene7
         */
        // IKAnalyzer 細粒度切分
        try (Analyzer ik = new IKAnalyzer4Lucene7();) {
            TokenStream ts = ik.tokenStream("content", etext);
            System.out.println("IKAnalyzer中文分詞器 細粒度切分，英文分詞效果：");
            doToken(ts);
            ts = ik.tokenStream("content", chineseText);
            System.out.println("IKAnalyzer中文分詞器 細粒度切分，中文分詞效果：");
            doToken(ts);
        }

        // IKAnalyzer 智能切分
        try (Analyzer ik = new IKAnalyzer4Lucene7(true);) {
            TokenStream ts = ik.tokenStream("content", etext);
            System.out.println("IKAnalyzer中文分詞器 智能切分，英文分詞效果：");
            doToken(ts);
            ts = ik.tokenStream("content", chineseText);
            System.out.println("IKAnalyzer中文分詞器 智能切分，中文分詞效果：");
            doToken(ts);
        }
    }
}

運行結果：

IKAnalyzer中文分詞器 細粒度切分，英文分詞效果：
analysis|is|one|of|the|main|causes|of|slow|indexing.|indexing|simply|put|the|more|you|analyze|the|slower|analyze|the|indexing|in|most|cases|
IKAnalyzer中文分詞器 細粒度切分，中文分詞效果：
張三|三|說的|的確|的|確實|實在|在理|
IKAnalyzer中文分詞器 智能切分，英文分詞效果：
analysis|is|one|of|the|main|causes|of|slow|indexing.|simply|put|the|more|you|analyze|the|slower|analyze|the|indexing|in|most|cases|
IKAnalyzer中文分詞器 智能切分，中文分詞效果：
張三|說的|確實|在理|

源碼獲取地址：

https://github.com/leeSmall/SearchEngineDemo

搜索引擎系列四：Lucene提供的分詞器、IKAnalyze中文分詞器集成

author oid core 長度 maven項目 int get attribute clu 一、Lucene提供的分詞器StandardAnalyzer和SmartChineseAnalyzer 1.新建一個測試Lucene提供的分詞器的maven項目LuceneAn

搜索引擎系列二：Lucene（Lucene介紹、Lucene架構、Lucene集成）

核心模塊純java 進行 org sea 能力高亮排序 hat 一、Lucene介紹 1. Lucene簡介　　最受歡迎的java開源全文搜索引擎開發工具包。提供了完整的查詢引擎和索引引擎，部分文本分詞引擎（英文與德文兩種西方語言）。Lucene的目的是為軟件開發人

搜索引擎系列五：Lucene索引詳解（IndexWriter詳解、Document詳解、索引更新）

let integer 自己 textfield app tdi AS query rect 一、IndexWriter詳解問題1：索引創建過程完成什麽事？　　　　分詞、存儲到反向索引中 1. 回顧Lucene架構圖：介紹我們編寫的應用程序要完成數據的收集，再將數據

搜索引擎系列八：solr-部署詳解（solr兩種部署模式介紹、獨立服務器模式詳解、SolrCloud分布式集群模式詳解）

nod 為什麽用途 serve creat 復制 stand 數據變量名一、solr兩種部署模式介紹 Standalone Server 獨立服務器模式：適用於數據規模不大的場景 SolrCloud 分布式集群模式：適用於數據規模大，高可靠、高可用、高並發的場景二

【搜索引擎（四）】文本分類

大小間隔引擎來看 www 基礎算法有用 resources 不同的 Q1. 為什麽搜索引擎要用到文本分類？　　搜索引擎要處理海量文本，人工分類不現實，機器的自動分類對提高文本的分類效率至少起到了一個基準的效果。另外，文本分類跟搜索引擎系統可以進行信息互通，文本分類

搜索引擎ElasticSearch系列（四）： ElasticSearch2.4.4 sql插件安裝

china code als 插件技術分享 -s fun nlp 4.0 一：ElasticSearch sql插件簡介　　With this plugin you can query elasticsearch using familiar SQL syntax.

搜索引擎算法研究專題四：隨機沖浪模型介紹

互聯 con 技術說明 lin 進入 title google 這就是 http://www.t086.com/class/seo 搜索引擎算法研究專題四：隨機沖浪模型介紹 2017年12月19日 ? 搜索技術 ? 共 2490字 ? 字號小中大 ? 評論關閉

學習用Node.js和Elasticsearch構建搜索引擎（6）：實際項目中常用命令使用記錄

nds 黃色 ati cat htm action last shard open 1、檢測集群是否健康。 curl -XGET ‘localhost:9200/_cat/health?v‘#後面加一個v表示讓輸出內容表格顯示表頭綠色表示一切正常，黃色表示所有

SEO編輯必看：撰寫搜索引擎喜愛的標題

多重寶寶樹有時查詢長尾關鍵詞兒童共同點北京佳能導讀：非常有幹貨，百度站長平臺剛發布了這篇篇文章，文章建議：1，標題字數控制在65個字節內，2，重要內容放在標題的最前面，3，添加與網頁內容最相關的、用戶更常用的、滿足用戶明確需求的、體現時效性、關鍵詞、直擊

第三百五十四節，Python分布式爬蟲打造搜索引擎Scrapy精講—數據收集(Stats Collection)

ack 高效所有 crawl resp spider 方法啟動定義第三百五十四節，Python分布式爬蟲打造搜索引擎Scrapy精講—數據收集(Stats Collection) Scrapy提供了方便的收集數據的機制。數據以key/value方式存儲，值大多是

Java搜索引擎選擇： Elasticsearch與Solr（轉）

文件格式 article base 使用社區 run 穩定 tails 定制 Elasticsearch簡介 Elasticsearch是一個實時的分布式搜索和分析引擎。它可以幫助你用前所未有的速度去處理大規模數據。它可以用於全文搜索，結構化搜索以及分析，當然你也可

搜索引擎之全文搜索算法功能實現（基於Lucene）

lucene java 算法搜索引擎之前做去轉盤網的時候，我已經公開了非全文搜索的代碼，需要的朋友希望能夠前去閱讀我的博客。本文主要討論如何進行全文搜索，由於本人花了很長時間設計了新作：觀點，觀點對全文搜索的要求還是很高的，所以我又花了不少時間研究全文搜索，你可以先體驗下：點我搜索。廢話也

搜索引擎算法研究專題七：Hilltop算法

打分 nbsp link 字號算法原始的鏈接專題 wrapper 搜索引擎算法研究專題七：Hilltop算法 2017年12月19日 ? 搜索技術 ? 共 1256字 ? 字號小中大 ? 評論關閉 HillTop也是搜索引擎結果排序的專利，是Go

利用Lucene.net搜索引擎進行多條件搜索的做法

條件 sea str lean 操作 bsp ise arch log 利用Lucene.net搜索引擎進行多條件搜索的做法 2018年01月09日 ? 搜索技術 ? 共 613字 ? 字號小中大 ? 評論關閉利用Lucene.net搜索引擎進行多條件搜索的做

搜索引擎算法研究專題三：聚集索引與非聚集索引介紹

運算符 sof 節點信息 ont ros 頁碼存儲定位搜索引擎算法研究專題三：聚集索引與非聚集索引介紹聚集索引介紹　　在聚集索引中，表中各行的物理順序與鍵值的邏輯(索引)順序相同。表只能包含一個聚集索引。　　如果不是聚集索引，表中各行的物理順序與鍵值的邏

基於Lucene框架的“虎撲籃球”網站搜索引擎（java版）

writer 用戶源碼 static 數組 head 完整需求 ash 　1 引言本次作業完成了基於Lucene的“虎撲籃球”網站搜索引擎，對其主要三個板塊---“最新新聞”（主要NBA新聞），“虎撲步行街”（類似貼吧性質），“虎撲濕乎乎”（籃球發帖區）進行頁

lucene構建restful風格的簡單搜索引擎服務

arr -i analyzer ota true tope fig close null 來自於本人博客： lucene構建restful風格的

黑客教父郭盛華：這些方法能快速提高你搜索引擎排名

字符信號 tro 社交媒體關於兩個業界錨文本方法　　如何提升你的網站排名？在此文章中，我們周刊邀請了中國知名黑客教父，東方聯盟創始人郭盛華進行訪談，他談到了分析網站時應該檢查的最重要的事情，所有這些因素中的哪一個是最重要的並且有可能帶來有機排名最大的影響是什麽

全文搜索引擎ElasticSearch學習記錄：mac下安裝

round -c segment 嘻嘻沒有 mas nod tin AS 　　最近開發組培訓了ElasticSearch，準備開展新項目，我也去湊了下熱鬧，下面把學習過程記錄一下。一、安裝　　1、環境需要jdk1.8; 　　2、下載：http://www.elas

爬蟲任務二：爬取(用到htmlunit和jsoup)通過百度搜索引擎關鍵字搜取到的新聞標題和url，並保存在本地文件中（主體借鑒了網上的資料）

標題 code rgs aps snap one reader url 預處理采用maven工程，免著到處找依賴jar包 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http:

搜索引擎系列四：Lucene提供的分詞器、IKAnalyze中文分詞器集成

一、Lucene提供的分詞器StandardAnalyzer和SmartChineseAnalyzer

二、IKAnalyze中文分詞器集成

下面開始集成：

相關推薦