1. 程式人生 > >elasticsearch6.x ik中文分詞整合

elasticsearch6.x ik中文分詞整合

Elasticsearch是一個基於Apache Lucene(TM)的開源、實時分散式搜尋和分析引擎。它用於全文搜尋、結構化搜尋、分析以及將這三者混合使用。IK Analysis外掛將Lucene IK分析器整合到elasticsearch中,支援自定義詞典。

1. 選擇ik版本

IK版本安裝是由Elasticsearch版本決定的,如下圖所示。

IK版本 ES版本
6.x - >主人
6.3.0    6.3.0
6.2.4     6.2.4
6.1.3     6.1.3
5.6.8    5.6.8
5.5.3    5.5.3
5.4.3    5.4.3
5.3.3     5.3.3
5.2.2   5.2.2
5.1.2     5.1.2
1.10.6     2.4.6
1.9.5     2.3.5
1.8.1     2.2.1
1.7.0     2.1.1
1.5.0     2.0.0
1.2.6     1.0.0
1.2.5     0.90.x
1.1.3     0.20.x
1.0.0     0.16.2 - > 0.19.0

ELK 6.3.1安裝與部署

中,已經介紹elasticsearch6.3.1安裝部署,因此與之對應IK版本也選擇為6.3.1。

2. 線上安裝

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

3. 重啟es

ps -ef | grep elasticsearch   #查詢es程序號

kill -9 **   #殺掉es程序

bin/elasticsearch -d && tail -f logs/elasticsearch.log   #重啟es,log列印

4. IK測試

ik中文分詞支援ik_smart和ik_max_word兩種方式,區別在於:

ik_max_word: 會將文字做最細粒度的拆分,比如會將“內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密”拆分為“內地、港澳同胞、港澳、同胞、港、珠、澳、大橋、讓、港澳、與國、國家、融合、更緊、緊密”,會窮盡各種可能的組合;

ik_smart: 會做最粗粒度的拆分,比如會將“內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密”拆分為“內地、港澳同胞、港、珠、澳、大橋、讓、港澳、與、國家、融合、更、緊密”。

4.1 ik_max_word分詞

輸入文字json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_max_word",

"text": "內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密"

}'

輸出分詞結果:

{

"tokens" : [

{

"token" : "內地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港澳",

"start_offset" : 2,

"end_offset" : 4,

"type" : "CN_WORD",

"position" : 2

},

{

"token" : "同胞",

"start_offset" : 4,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 3

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 5

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "大橋",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "讓",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "與國",

"start_offset" : 15,

"end_offset" : 17,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "國家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 11

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 12

},

{

"token" : "更緊",

"start_offset" : 20,

"end_offset" : 22,

"type" : "CN_WORD",

"position" : 13

},

{

"token" : "緊密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 14

}

]

}

4.2 ik_smart分詞

輸入本文json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_smart",

"text": "內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密"

}'

輸出分詞結果:

{

"tokens" : [

{

"token" : "內地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 2

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 3

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "大橋",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 5

},

{

"token" : "讓",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "與",

"start_offset" : 15,

"end_offset" : 16,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "國家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "更",

"start_offset" : 20,

"end_offset" : 21,

"type" : "CN_CHAR",

"position" : 11

},

{

"token" : "緊密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 12

}

]

}

4..3 分詞檢索

4.3.1 建立索引

curl -XPUT http://lee:9200/index

4.3.2 索引對映

curl -XPOST http://lee:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d '

{

"properties": {

"content": {

"type": "text",

"analyzer": "ik_max_word",

"search_analyzer": "ik_max_word"

}

}

}'

4.3.3 索引文件

curl -XPOST http://lee:9200/index/fulltext/1 -H 'Content-Type:application/json' -d '

{"content":"美國留給伊拉克的是個爛攤子嗎"}

'

curl -XPOST http://lee:9200/index/fulltext/2 -H 'Content-Type:application/json' -d '

{"content":"公安部:各地校車將享最高路權"}

'

curl -XPOST http://lee:9200/index/fulltext/3 -H 'Content-Type:application/json' -d '

{"content":"中韓漁警衝突調查:韓警平均每天扣1艘中國漁船"}

'

curl -XPOST http://lee:9200/index/fulltext/4 -H 'Content-Type:application/json' -d '

{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}

'

4.3.4 查詢

curl -XPOST http://lee:9200/index/fulltext/_search -H 'Content-Type:application/json' -d '

{

"query" : { "match" : { "content" : "中國" }},

"highlight" : {

"pre_tags" : ["<tag1>", "<tag2>"],

"post_tags" : ["</tag1>", "</tag2>"],

"fields" : {

"content" : {}

}

}

}

'

4.3.5 查詢結果

{

"took": 136,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"skipped": 0,

"failed": 0

},

"hits": {

"total": 2,

"max_score": 0.6489038,

"hits": [{

"_index": "index",

"_type": "fulltext",

"_id": "4",

"_score": 0.6489038,

"_source": {

"content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"

},

"highlight": {

"content": ["<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"]

}

}, {

"_index": "index",

"_type": "fulltext",

"_id": "3",

"_score": 0.2876821,

"_source": {

"content": "中韓漁警衝突調查:韓警平均每天扣1艘中國漁船"

},

"highlight": {

"content": ["中韓漁警衝突調查:韓警平均每天扣1艘<tag1>中國</tag1>漁船"]

}

}]

}

}

參考資料