Elasticsearch（ES）分詞器的那些事兒

1. 概述

分詞器是Elasticsearch中很重要的一個元件，用來將一段文字分析成一個一個的詞，Elasticsearch再根據這些詞去做倒排索引。

今天我們就來聊聊分詞器的相關知識。

2. 內建分詞器

2.1 概述

Elasticsearch 中內建了一些分詞器，這些分詞器只能對英文進行分詞處理，無法將中文的詞識別出來。

2.2 內建分詞器介紹

standard：標準分詞器，是Elasticsearch中預設的分詞器，可以拆分英文單詞，大寫字母統一轉換成小寫。

simple：按非字母的字元分詞，例如：數字、標點符號、特殊字元等，會去掉非字母的詞，大寫字母統一轉換成小寫。

whitespace：簡單按照空格進行分詞，相當於按照空格split了一下，大寫字母不會轉換成小寫。

stop：會去掉無意義的詞，例如：the、a、an 等，大寫字母統一轉換成小寫。

keyword：不拆分，整個文本當作一個詞。

2.3 檢視分詞效果通用介面

GET http://192.168.1.11:9200/_analyze

引數：

{

    "analyzer": "standard",

    "text": "I am a man."

}

響應：

{

    "tokens": [

        {

            "token": "i",

            "start_offset": 0,

            "end_offset": 1,

            "type": "<ALPHANUM>",

            "position": 0

        },

        {

            "token": "am",

            "start_offset": 2,

            "end_offset": 4,

            "type": "<ALPHANUM>",

            "position": 1

        },

        {

            "token": "a",

            "start_offset": 5,

            "end_offset": 6,

            "type": "<ALPHANUM>",

            "position": 2

        },

        {

            "token": "man",

            "start_offset": 7,

            "end_offset": 10,

            "type": "<ALPHANUM>",

            "position": 3

        }

    ]

}

3. IK分詞器

3.1 概述

Elasticsearch中內建的分詞器不能對中文進行分詞，因此我們需要再安裝一個能夠支援中文的分詞器，IK分詞器就是個不錯的選擇。

3.2 下載IK分詞器

下載網址：https://github.com/medcl/elasticsearch-analysis-ik

3.3 IK分詞器的安裝

1）為IK分詞器建立目錄

# cd /usr/local/elasticsearch-7.14.1/plugins

# mkdir ik

2）將IK分詞器壓縮包拷貝到CentOS7的目錄下，例如：/home

3）將壓縮包解壓到剛剛建立的目錄

# unzip elasticsearch-analysis-ik-7.14.1.zip -d /usr/local/elasticsearch-7.14.1/plugins/ik/

4）重啟Elasticsearch

3.4 IK分詞器介紹

ik_max_word: 會將文字做最細粒度的拆分，適合 Term Query；

ik_smart: 會做最粗粒度的拆分，適合 Phrase 查詢。

IK分詞器介紹來源於GitHub：https://github.com/medcl/elasticsearch-analysis-ik

3.5 分詞效果

GET http://192.168.1.11:9200/_analyze

引數：

{

    "analyzer": "ik_max_word",

    "text": "我是一名Java高階程式設計師"

}

響應：

{

    "tokens": [

        {

            "token": "我",

            "start_offset": 0,

            "end_offset": 1,

            "type": "CN_CHAR",

            "position": 0

        },

        {

            "token": "是",

            "start_offset": 1,

            "end_offset": 2,

            "type": "CN_CHAR",

            "position": 1

        },

        {

            "token": "一名",

            "start_offset": 2,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 2

        },

        {

            "token": "一",

            "start_offset": 2,

            "end_offset": 3,

            "type": "TYPE_CNUM",

            "position": 3

        },

        {

            "token": "名",

            "start_offset": 3,

            "end_offset": 4,

            "type": "COUNT",

            "position": 4

        },

        {

            "token": "java",

            "start_offset": 4,

            "end_offset": 8,

            "type": "ENGLISH",

            "position": 5

        },

        {

            "token": "高階",

            "start_offset": 8,

            "end_offset": 10,

            "type": "CN_WORD",

            "position": 6

        },

        {

            "token": "程式設計師",

            "start_offset": 10,

            "end_offset": 13,

            "type": "CN_WORD",

            "position": 7

        },

        {

            "token": "程式",

            "start_offset": 10,

            "end_offset": 12,

            "type": "CN_WORD",

            "position": 8

        },

        {

            "token": "員",

            "start_offset": 12,

            "end_offset": 13,

            "type": "CN_CHAR",

            "position": 9

        }

    ]

}

4. 自定義詞庫

4.1 概述

在進行中文分詞時，經常出現分析出的詞不是我們想要的，這時我們就需要在IK分詞器中自定義我們自己詞庫。

例如：追風人，分詞後，只有追風和人，而沒有追風人，導致倒排索引後查詢時，使用者搜追風或人可以搜到追風人，搜追風人反而搜不到追風人。

4.2 自定義詞庫

# cd /usr/local/elasticsearch-7.14.1/plugins/ik/config

# vi IKAnalyzer.cfg.xml

在配置檔案中增加自己的字典

# vi my.dic

在文字中加入追風人，儲存。

重啟Elasticsearch即可。

5. 綜述

今天簡單聊了一下 Elasticsearch（ES）分詞器的相關知識，希望可以對大家的工作有所幫助。

歡迎大家幫忙點贊、評論、加關注：）

關注追風人聊Java，每天更新Java乾貨。

Elasticsearch（ES）分詞器的那些事兒

Elasticsearch（ES）分詞器的那些事兒

最新文章