1. 程式人生 > >Elasticsearch 自定義多個分析器

Elasticsearch 自定義多個分析器

分析器(Analyzer)

Elasticsearch 無論是內建分析器還是自定義分析器,都由三部分組成:字元過濾器(Character Filters)、分詞器(Tokenizer)、詞元過濾器(Token Filters)。

分析器Analyzer工作流程

Input Text => Character Filters(如果有多個,按順序應用) => Tokenizer => Token Filters(如果有多個,按順序應用) => Output Token

字元過濾器(Character Filters)

字元過濾器:對原始文字預處理,如去除HTML標籤,”&”轉成”and”等。

注意:一個分析器同時有多個字元過濾器時,按順序應用。

分詞器(Tokenizer)

分詞器:將字串分解成一系列的詞元Token。如根據空格將英文單詞分開。

詞元過濾器(Token Filters)

詞元過濾器:對分詞器分出來的詞元Token做進一步處理,如轉換大小寫、移除停用詞、單複數轉換、同義詞轉換等。

注意:一個分析器同時有多個詞元過濾器時,按順序應用。

分析器analyze API的使用

分析器analyze API可驗證分析器的分析效果並解釋分析過程。

# text: 待分析文字
# explain:解釋分析過程
# char_filter:字元過濾器
# tokenizer:分詞器 # filter:詞元過濾器 GET _analyze { "char_filter" : ["html_strip"], "tokenizer": "standard", "filter": [ "lowercase"], "text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>", "explain" : true }

自定義多個分析器

建立索引並自定義多個分析器

這裡對一個索引同時定義了多個分析器。

PUT my_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1, 
    "analysis": { 
      "char_filter": { //自定義多個字元過濾器
        "my_charfilter1": {
          "type": "mapping",
          "mappings": ["& => and"]
        },
        "my_charfilter2": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      },
      "tokenizer":{  //自定義多個分詞器
          "my_tokenizer1": {
              "pattern":"\\s+",
              "type":"pattern"
            },
          "my_tokenizer2":{
                "pattern":"_",
                "type":"pattern"
            }
      },
      "filter": {  //自定義多個詞元過濾器
        "my_tokenfilter1": {
          "type": "stop",
          "stopwords": ["the", "a","an"]
        },
        "my_tokenfilter2": {
          "type": "stop",
          "stopwords": ["info", "debug"]
        }
      },
      "analyzer": { //自定義多個分析器
         "my_analyzer1":{  //分析器my_analyzer1 
           "char_filter": ["html_strip", "my_charfilter1","my_charfilter2"],
           "tokenizer":"my_tokenizer1",
           "filter": ["lowercase", "my_tokenfilter1"]
         },
         "my_analyzer2":{  //分析器my_analyzer2
           "char_filter": ["html_strip"],
           "tokenizer":"my_tokenizer2",
           "filter": ["my_tokenfilter2"]
         }
      }
    }
  }
}

驗證索引my_index的多個分析器

驗證分析器my_analyzer1分析效果
GET /my_index/_analyze
{
  "text": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "analyzer": "my_analyzer1"//,
  //"explain": true
}

#返回結果
{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "jerry",
      "start_offset": 17,
      "end_offset": 26,
      "type": "word",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 27,
      "end_offset": 29,
      "type": "word",
      "position": 3
    },
    {
      "token": "room",
      "start_offset": 34,
      "end_offset": 38,
      "type": "word",
      "position": 5
    },
    {
      "token": "number",
      "start_offset": 39,
      "end_offset": 45,
      "type": "word",
      "position": 6
    },
    {
      "token": "1_1_1",
      "start_offset": 46,
      "end_offset": 51,
      "type": "word",
      "position": 7
    }
  ]
}
驗證分析器my_analyzer2分析效果
GET /my_index/_analyze
{
  "text": "<b>debug_192.168.113.1_971213863506812928</b>",
  "analyzer": "my_analyzer2"//,
  //"explain": true
}


#返回結果
{
  "tokens": [
    {
      "token": "192.168.113.1",
      "start_offset": 9,
      "end_offset": 22,
      "type": "word",
      "position": 1
    },
    {
      "token": "971213863506812928",
      "start_offset": 23,
      "end_offset": 45,
      "type": "word",
      "position": 2
    }
  ]
}

新增Mapping併為不同欄位設定不同分析器

PUT my_index/_mapping/my_type
{
      "properties": {
      "my_field1": {
        "type": "text",
        "analyzer": "my_analyzer1",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "my_field2": {
        "type": "text",
        "analyzer": "my_analyzer2",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
}

建立文件

PUT my_index/my_type/1
{
  "my_field1":"<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "my_field2":"<b>debug_192.168.113.1_971213863506812928</b>"
}

Query-Mathch全文檢索

查詢時,ES會根據欄位使用的分析器進行分析,然後檢索。

#查詢my_field2欄位包含IP:192.168.113.1的文件
GET my_index/_search
{
  "query": {
    "match": {
      "my_field2": "192.168.113.1"
    }
  }
}

#返回結果
{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
          "my_field2": "<b>debug_192.168.113.1_971213863506812928</b>"
        }
      }
    ]
  }
}