Lucene筆記17-Lucene的分詞-中文分詞介紹

阿新 • • 發佈：2018-11-04

一、分詞器的作用

分詞器的作用就是得到一個TokenStream流，這個流中儲存了分詞相關的一些資訊，可以通過屬性獲取到分詞的詳細資訊。

二、自定義Stop分詞器

package com.wsy;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.util.Version;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Set;

public class MyStopAnalyzer extends Analyzer {
    private Set set;

    public MyStopAnalyzer(String[] stopWords) {
        // 檢視StopAnalyzer中的英文停用詞
        System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        set = StopFilter.makeStopSet(Version.LUCENE_35, stopWords, true);
        // 加入原來的停用詞
        set.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
    }

    public MyStopAnalyzer() {
        set = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set);
    }

    public static void displayAllToken(String string, Analyzer analyzer) {
        try {
            TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(string));
            // 放入屬性資訊，為了檢視流中的資訊
            // 位置增量資訊，語彙單元之間的距離
            PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);
            // 每個語彙單元的位置偏移量資訊
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            // 每一個語彙單元的分詞資訊
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
            // 使用的分詞器的型別資訊
            TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class);
            while (tokenStream.incrementToken()) {
                System.out.println(positionIncrementAttribute.getPositionIncrement() + ":" + charTermAttribute + "[" + offsetAttribute.startOffset() + "-" + offsetAttribute.endOffset() + "]-->" + typeAttribute.type());
            }
            System.out.println("----------------------------");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        // 自定義的停用詞分詞器
        Analyzer analyzer1 = new MyStopAnalyzer(new String[]{"I", "you", "hate"});
        // 預設的停用詞分詞器
        Analyzer analyzer2 = new StopAnalyzer(Version.LUCENE_35);
        String string = "how are you, thank you. I hate you.";
        MyStopAnalyzer.displayAllToken(string, analyzer1);
        MyStopAnalyzer.displayAllToken(string, analyzer2);
    }
}

下面這條語句非常重要，用於給分詞器設定過濾鏈和Tokenizer，如果還需要新增，繼續向裡面新增即可。

new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set);

三、中文分詞器

說道中文分詞器，種類還是不少的，paoding分詞，mmseg分詞，IK分詞等等。不過有些已經不更新了。

這裡拿mmseg做演示，mmseg是基於搜狗的詞庫的，下載mmseg4j-1.8.5的壓縮包。開啟檢視裡面的內容，其中data中存放的是分詞的詞庫，將jar包匯入進專案，mmseg-all有兩個jar包，一個是帶dic的，一個是不帶dic的，這裡我們使用不帶dic的。使用預設的分詞詞庫測試了一下這樣一段話“我來自山東聊城，我叫王劭陽。”，發現“聊城”被分成了“聊”和“城”，不服啊，竟然沒有我大聊城？那麼我們自己加上吧，開啟data資料夾中的words-my.dic，新增上“聊城”，再次去看分詞，“聊城”就會被分成“聊城”啦。

public static void main(String[] args) {
    // mmseg分詞器
    // 沒有指定分詞庫時候，一個字一個字的分詞
    Analyzer analyzer3 = new MMSegAnalyzer();
    // 指定本地分詞庫後，根據分詞器分詞
    Analyzer analyzer4 = new MMSegAnalyzer(new File("E:\\Lucene\\mmseg4j-1.8.5\\data"));
    String string2 = "我來自山東聊城，我叫王劭陽。";
    MyStopAnalyzer.displayAllToken(string2, analyzer3);
    MyStopAnalyzer.displayAllToken(string2, analyzer4);
}

Lucene筆記17-Lucene的分詞-中文分詞介紹

一、分詞器的作用

二、自定義Stop分詞器

三、中文分詞器

Lucene筆記17-Lucene的分詞-中文分詞介紹

Lucene筆記23-Lucene的使用-簡單複習索引、檢索和分詞

Lucene筆記20-Lucene的分詞-實現自定義同義詞分詞器-實現分詞器（良好設計方案）

Lucene筆記19-Lucene的分詞-實現自定義同義詞分詞器-實現分詞器

Lucene筆記18-Lucene的分詞-實現自定義同義詞分詞器-思路分析

Lucene筆記16-Lucene的分詞-通過TokenStream顯示分詞的詳細資訊

Lucene筆記15-Lucene的分詞-通過TokenStream顯示分詞

Lucene筆記14-Lucene的分詞-分詞器的原理講解

Lucene筆記12-Lucene的搜尋-複習和再查詢分頁搜尋

HanLP《自然語言處理入門》筆記--3.二元語法與中文分詞

ES[7.6.x]學習筆記（七）IK中文分詞器

Lucene筆記26-Lucene的使用-自定義QueryParser解決部分查詢的效能問題

Lucene筆記25-Lucene的使用-根據域進行評分設定

Lucene筆記24-Lucene的使用-自定義評分簡介

Lucene筆記22-Lucene的使用-Filter

Lucene筆記21-Lucene的自定義排序

Lucene筆記13-Lucene的搜尋-基於searchAfter的實現

Lucene筆記11-Lucene的搜尋-基於QueryParser的搜尋

Lucene筆記10-Lucene的搜尋-其他常用Query搜尋

Lucene筆記09-Lucene的搜尋-TermRange等基本搜尋

Lucene筆記17-Lucene的分詞-中文分詞介紹

一、分詞器的作用

二、自定義Stop分詞器

三、中文分詞器

相關推薦