1. 程式人生 > >elasticsearch 外掛開發-自定義分詞方法

elasticsearch 外掛開發-自定義分詞方法

自定義elasticsearch外掛實現

1 外掛專案結構

這是一個傳統的maven專案結構,主要是多了一些外掛需要的的目錄和檔案
在這裡插入圖片描述

plugin.xmlplugin-descriptor.properties這兩個是外掛的主要配置和描述
pom.xml裡面也有一些外掛的配置pom.xml檔案

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:
schemaLocation
="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <name>analysis-gridsum</name> <groupId>org.elasticsearch</groupId> <artifactId>gridsum-plugin</artifactId> <version
>
0.0.1</version> <description>gridsum elasticsearch plugin 國雙elasticsearch自定義分詞外掛</description> <properties> <elasticsearch.version>6.4.1</elasticsearch.version> <lucene.version>7.5.0</lucene.version> <maven.compiler.target
>
1.8</maven.compiler.target> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>${elasticsearch.version}</version> <scope>provided</scope> </dependency> <!-- Testing --> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.7</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.7</version> <scope>test</scope> </dependency> <dependency> <groupId>org.elasticsearch.test</groupId> <artifactId>framework</artifactId> <version>${elasticsearch.version}</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-test-framework</artifactId> <version>${lucene.version}</version> <scope>test</scope> </dependency> </dependencies> <build> <resources> <resource> <directory>src/main/resources</directory> <filtering>false</filtering> <excludes> <exclude>*.properties</exclude> </excludes> </resource> </resources> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.6</version> <configuration> <appendAssemblyId>false</appendAssemblyId> <outputDirectory>${project.build.directory}/releases/</outputDirectory> <descriptors> <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor> </descriptors> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.5.1</version> <configuration> <source>${maven.compiler.target}</source> <target>${maven.compiler.target}</target> </configuration> </plugin> </plugins> </build> </project>

plugin.xml檔案

<?xml version="1.0"?>
<assembly>
    <id>analysis-gridsum</id>
    <formats>
        <format>zip</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <fileSets>
        <fileSet>
            <directory>${project.basedir}/config</directory>
            <outputDirectory>config</outputDirectory>
        </fileSet>
    </fileSets>

    <files>
        <file>
            <source>${project.basedir}/src/main/resources/plugin-descriptor.properties</source>
            <outputDirectory/>
            <filtered>true</filtered>
        </file>
    </files>
    <dependencySets>
        <dependencySet>
            <outputDirectory/>
            <useProjectArtifact>true</useProjectArtifact>
            <useTransitiveFiltering>true</useTransitiveFiltering>
            <excludes>
                <exclude>org.elasticsearch:elasticsearch</exclude>
            </excludes>
        </dependencySet>
    </dependencySets>
</assembly>

plugin-descriptor.properties檔案

description=${project.description}
version=${project.version}
name=${project.name}
classname=org.elasticsearch.gridsum.plugin.GridsumPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}

把專案結構和這幾個檔案新增好之後就可以編寫外掛了。

2 外掛主要實現類和方法

2.1 開發外掛只需要繼承Plugin實現AnalysisPlugin就可以了

GridsumTokenizer是分詞器,繼承Tokenizer,通過重寫incrementToken方法來實現自己的分詞程式

GridsumAnalyzer是分析器,繼承Analyzer,裡面需要塞一個分詞器

GridsumAnalyzerProvider是分析器提供程式,繼承AbstractIndexAnalyzerProvider,通過重寫get方法返回自定義分析器

GridsumTokenizerFactory是分詞器工廠,繼承AbstractTokenizerFactory,通過重寫create方法返回自定義的分詞器

GridsumPlugin自定義外掛的主要實現,繼承Plugin實現AnalysisPlugin,通過重寫getTokenizers將分詞器工廠放入map,通過重寫getAnalyzers將分析器放入map(這裡的key後面會用到

結構圖如下

在這裡插入圖片描述

先來看一下自定義Tokenzier,最主要的是incrementToken方法

在這裡插入圖片描述

再看一下自定義Tokenizer工廠,主要的方法是create方法返回自定義Tokenizer
在這裡插入圖片描述

看一下自定義Analyzer
在這裡插入圖片描述

在createComponents方法中返回TokenStreamComponents,裡面塞了一個我們的自定義Tokenizer

再看一下Analyzer工廠
在這裡插入圖片描述

主要返回一個自定義Analyzer

最終我們看一下自定義Plugin

在這裡插入圖片描述

到這裡整個外掛的結構就完成了。

2.2 實現自己的分詞程式

整個自定義分詞的最關鍵方法就是自定義分詞器GridsumTokenizer的incrementToken方法,通過重寫該方法來實現自定義分詞功能

在網上找的一個空格分詞的實現

package org.elasticsearch.gridsum.plugin.extend;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

import java.io.IOException;

public class GridsumTokenizer extends Tokenizer {

    private final static Logger LOGGER = LogManager.getLogger(GridsumTokenizer.class);
    private final static String PUNCTION = " -()/";
    private final StringBuilder buffer = new StringBuilder();
    private int suffixOffset;
    private int tokenStart = 0, tokenEnd = 0;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);


    @Override
    public final boolean incrementToken() throws IOException {
        clearAttributes();
        buffer.setLength(0);
        int ci;
        char ch;
        tokenStart = tokenEnd;

        ci = input.read();
        if(ci>64&&ci<91){
            ci=ci+32;
        }
        ch = (char) ci;
        while (true) {
            if (ci == -1){
                if (buffer.length() == 0)
                    return false;
                else {
                    termAtt.setEmpty().append(buffer);
                    offsetAtt.setOffset(correctOffset(tokenStart),
                            correctOffset(tokenEnd));
                    return true;
                }
            }
            else if (PUNCTION.indexOf(ch) != -1) {
                //buffer.append(ch);
                tokenEnd++;
                if(buffer.length()>0){
                    termAtt.setEmpty().append(buffer);
                    offsetAtt.setOffset(correctOffset(tokenStart),
                            correctOffset(tokenEnd));
                    return true;
                }else
                {
                    ci = input.read();
                    if(ci>64&&ci<91){
                        ci=ci+32;
                    }
                    ch = (char) ci;
                }
            } else {
                buffer.append(ch);
                tokenEnd++;
                ci = input.read();
                if(ci>64&&ci<91){
                    ci=ci+32;
                }
                ch = (char) ci;
            }
        }
    }

    @Override
    public final void end() {
        final int finalOffset = correctOffset(suffixOffset);
        this.offsetAtt.setOffset(finalOffset, finalOffset);
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        tokenStart = tokenEnd = 0;
    }

}

3 驗證與安裝

本地測試方法如下

package org.elasticsearch.gridsum.plugin;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.elasticsearch.gridsum.plugin.extend.GridsumAnalyzer;
import org.junit.Test;

public class GridsumAnalyzerTest {
    @Test
    public void testAnalyzer() throws Exception {
        GridsumAnalyzer analyzer = new GridsumAnalyzer();
        TokenStream ts = analyzer.tokenStream("text", "我愛北京 天安門");
        CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while (ts.incrementToken()) {
            System.out.println(term.toString());
        }
        ts.end();
        ts.close();
    }

}

輸出結果

我愛北京
天安門

程式寫完後打包

mvn clean package 

打包後會生成一個本地的zip包,用來在elasticsearch進行安裝

安裝命令(windows)

elasticsearch bin目錄> elasticsearch-plugin.bat install file:D:/elastic-gridsum-plugin/target/releases/gridsum-plugin-0.0.1.zip

如果提示已經存在了,請先解除安裝,不過前提是不要和系統的其他外掛名稱一致,名稱是通過plugin.xml裡面的<assembly><id>來定義的

解除安裝方法

elasticsearch bin目錄> elasticsearch-plugin.bat remove analysis-gridsum

安裝成功後啟動elasticsearch

elasticsearch bin目錄> elasticsearch.bat

啟動之後我們可以在postman驗證一下

注意這裡analyzer的key就是第二步重寫getAnalyzers時map裡面的key

在這裡插入圖片描述

完整專案下載地址

4 參考文獻

官方文件

姓名分詞器

特殊分詞器