1. 程式人生 > >IK分詞器下載、使用和測試

IK分詞器下載、使用和測試

對於Win10x86、Ubuntu環境均適用~

1.下載

為什麼要使用IK分詞器呢?最後面有測評~

這裡寫圖片描述

如果選擇下載原始碼然後自己編譯的話,使用maven進行編譯:
在該目錄下,首先執行:mvn compile;,會生成一個target目錄,然後執行mvn package;,會在target目錄下生成一個releases目錄,在該目錄下有一個壓縮包,這就是編譯好的,與直接下載編譯好是一樣的~

或者把該專案在IDEA開啟,在客戶端執行maven的clearcompilepackage命令,效果都是一樣的,但是用命令列編譯好像稍微快一點~

2.使用

在es目錄下的plugins

目錄下建立一個新資料夾,命名為ik,然後把上面的壓縮包中的內容解壓到該目錄中。

比如在Ubuntu中,把解壓出來的內容放到es/plugins/ik中:

這裡寫圖片描述

之後,需要重新啟動es

3.測試

1). 建立索引,指定分詞器為“ik_max_word”

PUT index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik": {
          "tokenizer": "ik_max_word"
} } } }, "mappings": { "test1":{ "properties": { "content": { "type": "text", "analyzer": "ik", "search_analyzer": "ik_max_word" } } } } }

2). 寫入資料到索引中

POST index/test1/1 
{
  "content": "裡皮是一位牌足夠大、支援率足夠高的教練"
} POST index/test1/2 { "content": "他不僅在義大利國家隊取得過成功" } POST index/test1/3 { "content": "教練還帶領廣州恆大稱霸中超並首次奪得亞冠聯賽" }

3). 執行搜尋,比如匹配有“教練”字樣的文件

GET index/_search
{
  "query": {
    "match": {
      "content": "教練"
    }
  },
  "highlight": {
    "pre_tags": ["<span style = 'color:red'>"],
    "post_tags": ["</span>"],
    "fields": {"content": {}}
  }
}

4). 搜尋效果

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.18232156,
    "hits": [
      {
        "_index": "index",
        "_type": "test1",
        "_id": "1",
        "_score": 0.18232156,
        "_source": {
          "content": "裡皮是一位牌足夠大、支援率足夠高的教練"
        },
        "highlight": {
          "content": [
            "裡皮是一位牌足夠大、支援率足夠高的<span style = 'color:red'>教練</span>"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "test1",
        "_id": "3",
        "_score": 0.16203022,
        "_source": {
          "content": "教練還帶領廣州恆大稱霸中超並首次奪得亞冠聯賽"
        },
        "highlight": {
          "content": [
            "<span style = 'color:red'>教練</span>還帶領廣州恆大稱霸中超並首次奪得亞冠聯賽"
          ]
        }
      }
    ]
  }
}

沒有問題哦~

4.關於IK分詞器的幾點說明

IK分詞器對中文具有良好支援的分詞器,相比於ES自帶的分詞器,IK分詞器更能適用中文博大精深的語言環境,
此外,IK分詞器包括ik_max_wordik_smart,它們有什麼區別呢?

ik_max_word會將文字做最細粒度的拆分;
ik_smart 會做最粗粒度的拆分。

可通過下面的測試自己感受它們的不同,測試語句為“這是一個對分詞器的測試”,測試效果如下:

1). ik_max_word

GET index/_analyze?analyzer=ik_max_word
{
  "text": "這是一個對分詞器的測試"
}

分詞結果:

{
  "tokens": [
    {
      "token": "這是",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "一個",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 2,
      "end_offset": 3,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "個",
      "start_offset": 3,
      "end_offset": 4,
      "type": "COUNT",
      "position": 3
    },
    {
      "token": "對分",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "分詞器",
      "start_offset": 5,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "分詞",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "詞",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "器",
      "start_offset": 7,
      "end_offset": 8,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "測試",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

2). ik_smart

GET index/_analyze?analyzer=ik_smart
{
  "text": "這是一個對分詞器的測試"
}

分詞結果:

{
  "tokens": [
    {
      "token": "這是",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "一個",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "分詞器",
      "start_offset": 5,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "測試",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

3). 自帶的分詞器

GET index/_analyze?analyzer=standard
{
  "text": "這是一個對分詞器的測試"
}

分詞結果:

{
  "tokens": [
    {
      "token": "這",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "個",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "對",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "分",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "詞",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "器",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "的",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "測",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "試",
      "start_offset": 10,
      "end_offset": 11,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    }
  ]
}

綜上,同樣是對“這是一個對分詞器的測試”進行分詞,不同的分詞器分詞結果不同:
ik_max_word:這是/一個/一/個/對分/分詞器/分詞/詞/器/測試
ik_smart:這是/一個/分詞器/測試
standard:這/是/一/個/對/分/詞/器/的/測/試
體會一下,嘿嘿~

這裡寫圖片描述