Elasticsearch實踐（四）：IK分詞

阿新 • • 發佈：2018-12-01

完成詞語 char 1.2 prop tokenize 字母發生 odi

環境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4

Elasticsearch默認也能對中文進行分詞。

我們先來看看自帶的中文分詞效果：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d '{"analyzer": "default","text": "今天天氣真好"}'

GET /_analyze
{
  "analyzer": "default",
  "text": "今天天氣真好"
}

結果：

{
  "tokens": [
    {
      "token": "今",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "天",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "天",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "氣",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "真",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "好",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    }
  ]
}

我們發現，是按照每個字進行分詞的。這種在實際應用裏肯定達不到想要的效果。當然，如果是日誌搜索，使用自帶的就足夠了。

analyzer=default其實調用的是standard分詞器。

接下來，我們安裝IK分詞插件進行分詞。

安裝IK

IK項目地址：https://github.com/medcl/elasticsearch-analysis-ik

首先需要說明的是，IK插件必須和 ElasticSearch 的版本一直，否則不兼容。

安裝方法1：
從 https://github.com/medcl/elasticsearch-analysis-ik/releases 下載壓縮包，然後在ES的plugins

目錄創建analysis-ik子目錄，把壓縮包的內容復制到這個目錄裏面即可。最終plugins/analysis-ik/目錄裏面的內容：

plugins/analysis-ik/
    commons-codec-1.9.jar
    commons-logging-1.2.jar
    elasticsearch-analysis-ik-6.2.4.jar
    httpclient-4.5.2.jar
    httpcore-4.4.4.jar
    plugin-descriptor.properties

然後重啟 ElasticSearch。

安裝方法2：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

如果已下載壓縮包，直接使用：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install file:///tmp/elasticsearch-analysis-ik-6.2.4.zip

然後重啟 ElasticSearch。

IK分詞

IK支持兩種分詞模式：

ik_max_word: 會將文本做最細粒度的拆分，會窮盡各種可能的組合
ik_smart: 會做最粗粒度的拆分

接下來，我們測算IK分詞效果和自帶的有什麽不同：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "今天天氣真好"}'

結果：

{
  "tokens": [
    {
      "token": "今天天氣",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "真好",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

再試一下ik_max_word的效果：

{
  "tokens": [
    {
      "token": "今天天氣",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "今天",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "天天",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "天氣",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "真好",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

設置mapping默認分詞器

示例：

{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word"
        }
    }
}

註：這裏設置 search_analyzer 與 analyzer 相同是為了確保搜索時和索引時使用相同的分詞器，以確保查詢中的術語與反向索引中的術語具有相同的格式。如果不設置 search_analyzer，則 search_analyzer 與 analyzer 相同。詳細請查閱：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

自定義分詞詞典

我們也可以定義自己的詞典供IK使用。比如：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "去朝陽公園"}'

結果：

{
  "tokens": [
    {
      "token": "去",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "朝陽",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "公園",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

我們希望朝陽公園作為一個整體，這時候可以把該詞加入到自己的詞典裏。

新建自己的詞典只需要簡單幾步就可以完成：
1、在elasticsearch-6.2.4/config/analysis-ik/目錄增加一個my.dic:

$ touch my.dic
$ echo 朝陽公園 > my.dic

$ cat my.dic
朝陽公園

.dic為詞典文件，其實就是簡單的文本文件，詞語與詞語直接需要換行。註意是UTF8編碼。我們看一下自帶的分詞文件：

$ head -n 5 main.dic
一一列舉
一一對應
一一道來
一丁
一丁不識

2、然後修改elasticsearch-6.2.4/config/analysis-ik/IKAnalyzer.cfg.xml文件：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶可以在這裏配置自己的擴展字典 -->
    <entry key="ext_dict">my.dic</entry>
     <!--用戶可以在這裏配置自己的擴展停止詞字典-->
    <entry key="ext_stopwords"></entry>
    <!--用戶可以在這裏配置遠程擴展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用戶可以在這裏配置遠程擴展停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

增加了my.dic，然後重啟ES。我們再看一下效果：

curl http://localhost:9200/_analyze/?analyzer=ik_smart&text=去朝陽公園

結果：

{
  "tokens": [
    {
      "token": "去",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "朝陽公園",
      "start_offset": 1,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

說明自定義詞典生效了。如果有多個詞典，使用英文分號隔開：

<entry key="ext_dict">my.dic;custom/single_word_low_freq.dic</entry>

另外，我們看到配置裏還有個擴展停止詞字典，這個是用來輔助斷句的。我們可以看一下自帶的一個擴展停止詞字典：

$ head -n 5 extra_stopword.dic
也
了
仍
從
以

也就是IK分詞器遇到這些詞就認為前面的詞語不會與這些詞構成詞語。

IK分詞也支持遠程詞典，遠程詞典的好處是支持熱更新。詞典格式和本地的一致，都是一行一個分詞（換行符用 \n），還要求填寫的URL滿足：

該 http 請求需要返回兩個頭部(header)，一個是 Last-Modified，一個是 ETag，這兩者都是字符串類型，只要有一個發生變化，該插件就會去抓取新的分詞進而更新詞庫。

詳見：https://github.com/medcl/elasticsearch-analysis-ik 熱更新 IK 分詞使用方法部分。

註意：上面的示例裏我們改的是`elasticsearch-6.2.4/config/analysis-ik/目錄下內容，是因為IK是通過方法2裏elasticsearch-plugin安裝的。如果你是通過解壓方式安裝的，那麽IK配置會在plugins目錄，即：elasticsearch-6.2.4/plugins/analysis-ik/config。也就是說插件的配置既可以放在插件所在目錄，也可以放在Elasticsearch的config目錄裏面。

ES內置的Analyzer分析器

es自帶了許多內置的Analyzer分析器，無需配置就可以直接在index中使用：

標準分詞器（standard）：以單詞邊界切分字符串為terms，根據Unicode文本分割算法。它會移除大部分的標點符號，小寫分詞後的term，支持停用詞。
簡單分詞器（simple）：該分詞器會在遇到非字母時切分字符串，小寫所有的term。
空格分詞器（whitespace）：遇到空格字符時切分字符串，
停用詞分詞器（stop）：類似簡單分詞器，同時支持移除停用詞。
關鍵詞分詞器（keyword）：無操作分詞器，會輸出與輸入相同的內容作為一個single term。
模式分詞器（pattern）：使用正則表達式講字符串且分為terms。支持小寫字母和停用詞。
語言分詞器（language）：支持許多基於特定語言的分詞器，比如english或french。
簽名分詞器（fingerprint）：是一個專家分詞器，會產生一個簽名，可以用於去重檢測。
自定義分詞器：如果內置分詞器無法滿足你的需求，可以自定義custom分詞器，根據不同的character filters，tokenizer，token filters的組合。

詳見文檔：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

參考

1、medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.
https://github.com/medcl/elasticsearch-analysis-ik
2、ElesticSearch IK中文分詞使用詳解 - xsdxs的博客 - CSDN博客
https://blog.csdn.net/xsdxs/article/details/72853288

Elasticsearch實踐（四）：IK分詞

完成詞語 char 1.2 prop tokenize 字母發生 odi 環境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4 Elasticsearch默認也能對中文進行分詞。我們先來看看自帶的中文分詞效果： curl

Elasticsearch實踐（四）：IK分詞

安裝IK

IK分詞

設置mapping默認分詞器

自定義分詞詞典

ES內置的Analyzer分析器

參考

Elasticsearch實踐（四）：IK分詞

Elasticsearch外掛（一）：ik分詞

Elasticsearch 之（25）重寫IK分詞器原始碼來基於mysql熱更新詞庫

Elasticsearch教程（二），IK分詞器安裝

搜索引擎ElasticSearch系列（四）： ElasticSearch2.4.4 sql插件安裝

Angular開發實踐（四）：組件之間的交互

小程序實踐（四）：動態控制組件的顯示/隱藏

Elasticsearch實踐（二）：搜尋

ES學習（四）拼音外掛分詞elasticsearch-analysis-pinyin

Elasticsearch 系列指南（三）——整合ik分詞器

微信小程式（四）：Tab分頁

ElasticSearch（四）查詢、分詞器

Elasticsearch實踐（三）：Mapping

NLP詞法分析（一）：中文分詞

Python自然語言處理實戰（3）：中文分詞技術

微服務實戰（四）：服務發現的可行方案以及實踐案例

搜尋引擎（四）：如何使用ElasticSearch官方文件

Elasticsearch 通關教程（四）：分散式工作原理

ElasticSearch學習總結（四）：分散式特性

Spring Boot 實踐折騰記（四）：配置即使用，常用配置

Elasticsearch實踐（四）：IK分詞

安裝IK

IK分詞

設置mapping默認分詞器

自定義分詞詞典

ES內置的Analyzer分析器

參考

相關推薦