1. 程式人生 > >Elasticsearch之中文分詞器外掛es-ik的自定義詞庫


[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl '' -d '{"text":"好記性不如爛筆頭感嘆號部落格園"}'
"tokens" : [ {
"token" : "好記",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "記性",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "不如",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "爛",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 3
}, {
"token" : "筆頭",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
}, {

"token" : "筆",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "頭",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 6
}, {
"token" : "感嘆號",
"start_offset" : 8,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "感嘆",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "歎號",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 9
}, {

"token" : "嘆",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 10
}, {
"token" : "號",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 11
}, {
"token" : "部落格園",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 12
}, {
"token" : "部落格",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 13
}, {
"token" : "園",
"start_offset" : 13,
"end_offset" : 14,
"type" : "CN_CHAR",
"position" : 14
} ]
[hadoop@HadoopMaster elasticsearch-2.4.3]$








【 ik 自定義詞庫步驟】
1: 首先在 ik 外掛的 config/custom 目錄下建立一個檔案 zhouls.dic (當然這個你可以自己命名,如my.dic都行)
在檔案中新增詞語即可, 每一個詞語一行。
注意: 這個檔案可以在 linux 中直接 vi 生成, 或者在 windows 中建立之後上傳到這裡。
如果是在 linux 中直接 vi 生成的, 可以直接使用。
如果是在 windows中建立的,需要注意檔案的編碼必須是 UTF-8 without BOM 格式 【 UTF-8 無
BOM 格式】



2: 修改 ik 的配置檔案
預設情況下 ik 的配置檔案就在 ik 外掛的 config 目錄下面。【 IKAnalyzer.cfg.xml】
把剛才建立的檔案的位置新增到 ik 的配置檔案中即可。
vi config/IKAnalyzer.cfg.xml
<comment>IK Analyzer 擴充套件配置</comment>
<!--使用者可以在這裡配置自己的擴充套件字典 -->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--使用者可以在這裡配置遠端擴充套件字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
注意: 需要把 my.dic 檔案的位置新增到 key=ext_dict 這個 entry 中, 切記不要隨意新增 entry,
隨意新增的 entry 是不被識別的。
並且 entry 的名稱也不能亂改, 否則也不會識別。



3: 重啟 es 驗證分詞效果






[hadoop@HadoopMaster elasticsearch-2.4.3]$ ll
total 56
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 22 01:37 bin
drwxrwxr-x. 3 hadoop hadoop 4096 Feb 22 18:46 config
drwxrwxr-x. 3 hadoop hadoop 4096 Feb 22 06:05 data
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 22 01:37 lib
-rw-rw-r--. 1 hadoop hadoop 11358 Aug 24 2016 LICENSE.txt
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 25 05:15 logs
drwxrwxr-x. 5 hadoop hadoop 4096 Dec 8 00:41 modules
-rw-rw-r--. 1 hadoop hadoop 150 Aug 24 2016 NOTICE.txt
drwxrwxr-x. 5 hadoop hadoop 4096 Feb 25 06:31 plugins
-rw-rw-r--. 1 hadoop hadoop 8700 Aug 24 2016 README.textile
[hadoop@HadoopMaster elasticsearch-2.4.3]$ cd plugins/
[hadoop@HadoopMaster plugins]$ ll
total 12
drwxrwxr-x. 5 hadoop hadoop 4096 Feb 22 05:28 head
drwxrwxr-x. 3 hadoop hadoop 4096 Feb 25 06:32 ik
drwxrwxr-x. 8 hadoop hadoop 4096 Feb 22 05:34 kopf
[hadoop@HadoopMaster plugins]$ cd ik/
[hadoop@HadoopMaster ik]$ ll
total 5828
-rw-r--r--. 1 hadoop hadoop 263965 Dec 1 2015 commons-codec-1.9.jar
-rw-r--r--. 1 hadoop hadoop 61829 Dec 1 2015 commons-logging-1.2.jar
drwxr-xr-x. 3 hadoop hadoop 4096 Jan 1 12:46 config
-rw-r--r--. 1 hadoop hadoop 55998 Jan 1 13:27 elasticsearch-analysis-ik-1.10.3.jar
-rw-r--r--. 1 hadoop hadoop 4505518 Jan 15 08:59 elasticsearch-analysis-ik-1.10.3.zip
-rw-r--r--. 1 hadoop hadoop 736658 Jan 1 13:26 httpclient-4.5.2.jar
-rw-r--r--. 1 hadoop hadoop 326724 Jan 1 13:07 httpcore-4.4.4.jar
-rw-r--r--. 1 hadoop hadoop 2667 Jan 1 13:27 plugin-descriptor.properties
[hadoop@HadoopMaster ik]$ cd config/

[hadoop@HadoopMaster config]$ ll
total 3016
drwxr-xr-x. 2 hadoop hadoop 4096 Jan 1 12:46 custom
-rw-r--r--. 1 hadoop hadoop 697 Dec 14 10:34 IKAnalyzer.cfg.xml
-rw-r--r--. 1 hadoop hadoop 3058510 Dec 14 10:34 main.dic
-rw-r--r--. 1 hadoop hadoop 123 Dec 14 10:34 preposition.dic
-rw-r--r--. 1 hadoop hadoop 1824 Dec 14 10:34 quantifier.dic
-rw-r--r--. 1 hadoop hadoop 164 Dec 14 10:34 stopword.dic
-rw-r--r--. 1 hadoop hadoop 192 Dec 14 10:34 suffix.dic
-rw-r--r--. 1 hadoop hadoop 752 Dec 14 10:34 surname.dic
[hadoop@HadoopMaster config]$ cd custom/
[hadoop@HadoopMaster custom]$ ll
total 5252
-rw-r--r--. 1 hadoop hadoop 156 Dec 14 10:34 ext_stopword.dic
-rw-r--r--. 1 hadoop hadoop 130 Dec 14 10:34 mydict.dic
-rw-r--r--. 1 hadoop hadoop 63188 Dec 14 10:34 single_word.dic
-rw-r--r--. 1 hadoop hadoop 63188 Dec 14 10:34 single_word_full.dic
-rw-r--r--. 1 hadoop hadoop 10855 Dec 14 10:34 single_word_low_freq.dic
-rw-r--r--. 1 hadoop hadoop 5225922 Dec 14 10:34 sougou.dic
[hadoop@HadoopMaster custom]$ vim zhouls.dic




[hadoop@HadoopMaster custom]$ pwd
[hadoop@HadoopMaster custom]$ vim zhouls.dic
[hadoop@HadoopMaster custom]$ cat zhouls.dic
[hadoop@HadoopMaster custom]$





[hadoop@HadoopMaster custom]$ ll
total 5256
-rw-r--r--. 1 hadoop hadoop 156 Dec 14 10:34 ext_stopword.dic
-rw-r--r--. 1 hadoop hadoop 130 Dec 14 10:34 mydict.dic
-rw-r--r--. 1 hadoop hadoop 63188 Dec 14 10:34 single_word.dic
-rw-r--r--. 1 hadoop hadoop 63188 Dec 14 10:34 single_word_full.dic
-rw-r--r--. 1 hadoop hadoop 10855 Dec 14 10:34 single_word_low_freq.dic
-rw-r--r--. 1 hadoop hadoop 5225922 Dec 14 10:34 sougou.dic
-rw-rw-r--. 1 hadoop hadoop 43 Feb 25 17:16 zhouls.dic
[hadoop@HadoopMaster custom]$ cd ..
[hadoop@HadoopMaster config]$ ll
total 3016
drwxr-xr-x. 2 hadoop hadoop 4096 Feb 25 17:16 custom
-rw-r--r--. 1 hadoop hadoop 697 Dec 14 10:34 IKAnalyzer.cfg.xml
-rw-r--r--. 1 hadoop hadoop 3058510 Dec 14 10:34 main.dic
-rw-r--r--. 1 hadoop hadoop 123 Dec 14 10:34 preposition.dic
-rw-r--r--. 1 hadoop hadoop 1824 Dec 14 10:34 quantifier.dic
-rw-r--r--. 1 hadoop hadoop 164 Dec 14 10:34 stopword.dic
-rw-r--r--. 1 hadoop hadoop 192 Dec 14 10:34 suffix.dic
-rw-r--r--. 1 hadoop hadoop 752 Dec 14 10:34 surname.dic
[hadoop@HadoopMaster config]$ vim IKAnalyzer.cfg.xml




<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<comment>IK Analyzer 擴充套件配置</comment>
<!--使用者可以在這裡配置自己的擴充套件字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--使用者可以在這裡配置遠端擴充套件字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->




[hadoop@HadoopMaster config]$ vim IKAnalyzer.cfg.xml
[hadoop@HadoopMaster config]$ cat IKAnalyzer.cfg.xml
?<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<comment>IK Analyzer 擴充套件配置</comment>
<!--使用者可以在這裡配置自己的擴充套件字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/zhouls.dic</entry>
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--使用者可以在這裡配置遠端擴充套件字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
[hadoop@HadoopMaster config]$





 為了更好地看出效果,啟動es服務程序,我用bin/elasticsearch。一般建議用bin/elasticsearch -d。當然,生產環境下,可以將其設定為服務程序,作為service下的一種服務程序,這樣更為方便。

[hadoop@HadoopMaster plugins]$ cd ..
[hadoop@HadoopMaster elasticsearch-2.4.3]$ jps
1974 Elasticsearch
2137 Jps
[hadoop@HadoopMaster elasticsearch-2.4.3]$ kill -9 1974
[hadoop@HadoopMaster elasticsearch-2.4.3]$ jps
2148 Jps
[hadoop@HadoopMaster elasticsearch-2.4.3]$ bin/elasticsearch
[2017-02-25 17:27:56,301][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: requires kernel 3.5+ with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER compiled in
[2017-02-25 17:27:57,741][INFO ][node ] [Tethlam] version[2.4.3], pid[2158], build[d38a34e/2016-12-07T16:28:56Z]
[2017-02-25 17:27:57,741][INFO ][node ] [Tethlam] initializing ...
[2017-02-25 17:27:59,504][INFO ][plugins ] [Tethlam] modules [lang-groovy, reindex, lang-expression], plugins [analysis-ik, kopf, head], sites [kopf, head]
[2017-02-25 17:27:59,553][INFO ][env ] [Tethlam] using [1] data paths, mounts [[/home (/dev/sda5)]], net usable_space [23.4gb], net total_space [26.1gb], spins? [possibly], types [ext4]
[2017-02-25 17:27:59,553][INFO ][env ] [Tethlam] heap size [1015.6mb], compressed ordinary object pointers [true]
[2017-02-25 17:27:59,553][WARN ][env ] [Tethlam] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-02-25 17:28:02,922][INFO ][ik-analyzer ] try load config from /home/hadoop/app/elasticsearch-2.4.3/config/analysis-ik/IKAnalyzer.cfg.xml
[2017-02-25 17:28:02,923][INFO ][ik-analyzer ] try load config from /home/hadoop/app/elasticsearch-2.4.3/plugins/ik/config/IKAnalyzer.cfg.xml
[2017-02-25 17:28:03,748][INFO ][ik-analyzer ] [Dict Loading] custom/mydict.dic
[2017-02-25 17:28:03,749][INFO ][ik-analyzer ] [Dict Loading] custom/single_word_low_freq.dic
[2017-02-25 17:28:03,755][INFO ][ik-analyzer ] [Dict Loading] custom/zhouls.dic
[2017-02-25 17:28:03,760][INFO ][ik-analyzer ] [Dict Loading] custom/ext_stopword.dic
[2017-02-25 17:28:06,914][INFO ][node ] [Tethlam] initialized
[2017-02-25 17:28:06,915][INFO ][node ] [Tethlam] starting ...
[2017-02-25 17:28:07,168][INFO ][transport ] [Tethlam] publish_address {}, bound_addresses {[::]:9300}
[2017-02-25 17:28:07,203][INFO ][discovery ] [Tethlam] elasticsearch/dXjRTwNJRdyzQWPbHIzGiQ
[2017-02-25 17:28:10,589][INFO ][cluster.service ] [Tethlam] detected_master {Peter Parker}{3TwJeRfnRH-EttHntj0OdQ}{}{}, added {{Peter Parker}{3TwJeRfnRH-EttHntj0OdQ}{}{},{Living Laser}{bqV_F5bLRdq9AGtv3WLx4A}{}{},}, reason: zen-disco-receive(from master [{Peter Parker}{3TwJeRfnRH-EttHntj0OdQ}{}{}])
[2017-02-25 17:28:10,920][INFO ][http ] [Tethlam] publish_address {}, bound_addresses {[::]:9200}
[2017-02-25 17:28:10,923][INFO ][node ] [Tethlam] started




[hadoop@HadoopMaster elasticsearch-2.4.3]$ jps
2280 Jps
2231 Elasticsearch
[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl '' -d '{"text":"好記性不如爛筆頭感嘆號部落格園"}'
"tokens" : [ {
"token" : "好記性不如爛筆頭感嘆號部落格園",
"start_offset" : 0,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "好記",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "記性",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "不如",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "爛",
"start_offset" : 5,
"end_offset" : 6,

"type" : "CN_CHAR",
"position" : 4
}, {
"token" : "筆頭",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "筆",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "頭",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 7
}, {
"token" : "感嘆號",
"start_offset" : 8,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "感嘆",
"start_offset" : 8,
"end_offset" : 10,

"type" : "CN_WORD",
"position" : 9
}, {
"token" : "歎號",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 10
}, {
"token" : "嘆",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 11
}, {
"token" : "號",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 12
}, {
"token" : "部落格園",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 13
}, {
"token" : "部落格",
"start_offset" : 11,
"end_offset" : 13,

"type" : "CN_WORD",
"position" : 14
}, {
"token" : "園",
"start_offset" : 13,
"end_offset" : 14,
"type" : "CN_CHAR",
"position" : 15
} ]
[hadoop@HadoopMaster elasticsearch-2.4.3]$

