1. 程式人生 > >ElasticSearch分詞器總結

ElasticSearch分詞器總結

一、ik、pinyin分詞器

今天用通訊錄演示ES檢索功能,在對姓名檢索時,想實現中文和拼音均可檢索,於是除之前常用的中文分詞器ik外,又下載了拼音分詞器pinyin,使用情況總結如下:

1、下載

ik:https://github.com/medcl/elasticsearch-analysis-ik
pinyin:https://github.com/medcl/elasticsearch-analysis-pinyin

2、安裝

將下載的檔案解壓後放入es資料夾plugins下,可新建ik,pinyin資料夾;
其中pinyin分詞器我不知為何無法直接下載zip檔案,所以是下載的原始碼然後打包,再解壓後放入plugins/pinyin下

3、pinyin分詞器測試
GET _analyze?pretty
{
  "analyzer": "pinyin",
  "text": "劉德華"
}

結果:

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0
, "type": "word", "position": 1 }, { "token": "hua", "start_offset": 0, "end_offset": 0, "type": "word", "position": 2 }, { "token": "ldh", "start_offset": 0, "end_offset": 0, "type": "word", "position": 2 } ]
}
4、索引模板中分詞器配置

在模板setting中分詞器的配置

"analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                },
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "keep_separate_first_letter" : "false",
                    "lowercase" : "true",
                    "type" : "pinyin",
                    "limit_first_letter_length" : "16",
                    "remove_duplicated_term" : "true",
                    "keep_original" : "true",
                    "keep_full_pinyin" : "true",
                    "keep_joined_full_pinyin":"true",
                    "keep_none_chinese_in_joined_full_pinyin":"true"
            }
          }
        }

其中my_pinyin中配置項在https://github.com/medcl/elasticsearch-analysis-pinyin文件中有說明,可根據自己需求進行配置。

5、mapping中建立type

可以在一個屬性中設定多個分詞器fields:

 "mappings": {
         "doc": {
            "properties": {
                "PERSON_ENAME": {
                  "type" : "text",
                  "fields" : {
                        "ik" : {"type" : "text", "analyzer" :"ik"},
                        "english": { "type":"text","analyzer": "english"},
                        "standard" : {"type" : "text"}
                    }
               },
                "CONTACTER_NAME": {
                  "type" : "text",
                  "fields" : {
                        "ik" : {"type" : "text", "analyzer" :"ik"},    
                        "pinyin": { "type":"text","analyzer": "pinyin_analyzer"},                       
                        "standard" : {"type" : "text"}
                    }
               }               
            }
         }
      } 
6、測試

在多個欄位中查詢

POST sim/doc/_search
{
  "query": {
    "multi_match" : {
    "query" : "dfbb",
    "fields" : [
      "PERSON_ENAME.ik", 
      "PERSON_ENAME.standard",
      "PERSON_ENAME.english",
      "CONTACTER_NAME.ik", 
      "CONTACTER_NAME.standard",
      "CONTACTER_NAME.pinyin"]
    }
  }
}