elasticsearch系統分析器及自定義分析器

阿新 • • 發佈：2018-12-27

一、系統自帶的分析器：
（1）standard 分析器
standard 分析器是用於全文欄位的預設分析器。
它考慮了以下幾點：
standard 分詞器，在詞層級上分割輸入的文字。
standard 標記過濾器，被設計用來整理分詞器觸發的所有標記（但是目前什麼都沒做）。
lowercase 標記過濾器，將所有標記轉換為小寫。
stop 標記過濾器，刪除所有可能會造成搜尋歧義的停用詞，如 a，the，and，is。
（2）keyword分析器
（3）whitespace分析器

1.系統自帶的字元過濾器：
（1） html_strip 字元過濾器來刪除所有的 HTML 標籤，並且將 HTML 實體轉換成對應的 Unicode 字元，比如將 Á 轉成 Á。

2.系統自帶的分詞器：
（1）[keyword 分詞器]輸出和它接收到的相同的字串，不做任何分詞處理。
（2）[whitespace 分詞器]只通過空格來分割文字。
（3）[pattern 分詞器]可以通過正則表示式來分割文字

3.系統自帶的標記過濾器：
（1）[lowercase 標記過濾器]
（2）[stop 標記過濾器]
（3）[stemmer 標記過濾器]將單詞轉化為他們的根形態（root form）。
（4）[ascii_folding 標記過濾器]會刪除變音符號，比如從 très 轉為 tres。
（5）[ngram] 和 [edge_ngram]可以讓標記更適合特殊匹配情況或自動完成

二、建立自定義分析器
（可以在 analysis 欄位下配置字元過濾器char_filter，分詞器tokenizer和標記過濾器filter）：
分析器是三個順序執行的元件的結合（字元過濾器，分詞器，標記過濾器）。

--建立自定義分析器的語法格式
PUT /testindex
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... 
 },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

–demo1:建立了一個新的分析器es_std，並使用預定義的西班牙語停用詞：
（注：es_std 分析器不是全域性的，它僅僅存在於我們定義的 testindex 索引中）

PUT /testindex
{
    "settings": {
        "analysis": {
            "analyzer": {
                "es_std": {
                    "type":      "standard",
                    "stopwords": "_spanish_"
                }
            }
        }
    }
}

–demo2:建立一個自定義分析器
實現功能如下：
用 html_strip 字元過濾器去除所有的 HTML 標籤
將 & 替換成 and，使用一個自定義的 mapping 字元過濾器
使用 standard 分詞器分割單詞
使用 lowercase 標記過濾器將詞轉為小寫
用 stop 標記過濾器去除一些自定義停用詞。


PUT /testindex
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

三、測試新的分析器：


--demo1:
GET testindex/_analyze?analyzer=standard
{
  "text": "The quick & brown fox."
}

--demo2:
GET testindex/_analyze 
{
  "field": "name",
  "text": "The quick & Brown Foxes."
}

--demo3:
GET testindex/_analyze 
{
  "field": "name.english",
  "text": "The quick & Brown Foxes."
}

四、給指定欄位配置分析器

--demo1:給指定欄位message配置分析器
PUT /testindex/_mapping/testtable
{
    "properties": {
        "message": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

--demo2:
PUT /testindex
{
  "mappings": {
    "testtable": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "english": { 
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

五、將分析器應用到索引中

在給目標索引建對映時，指定待分析的欄位的分析器來使用我們構造的分析器。如：

PUT /testindex/_mapping/testtable
{
  "testtable": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "custom"
      }
    }
  }
}
查詢時也可以指定分析器。如：

POST /testindex/testtable/_search
{
  "query": {
    "match": {
      "name": {
        "query": "it's brown",
        "analyzer": "standard"
      }
    }
  }
}
或者在對映中分別指定他們。如：

PUT /testindex/_mapping/testtable
{
  "testtable": {
    "properties": {
      "name": {
        "type": "string",
        "index_analyzer": "custom",
        "search_analyzer": "standard" 
      }
    }
  }
}

然後索引一些文件，使用簡單的 match 查詢檢查一下，如果發現問題，使用 Validate API 檢查一下。如：

POST /testindex/testtable/_validate/query?explain
{
  "query": {
    "match": {
      "name": "it's brown"
    }
  }
}

elasticsearch系統分析器及自定義分析器

elasticsearch系統分析器及自定義分析器

ES學習——分析器和自定義分析器

C#/.NET 列舉特性擴充套件——系統特性及自定義特性

Elasticsearch(自定義分析器)

elasticsearch 深入 —— 分析器執行順序與Mapping自定義分析器配置

Elasticsearch 自定義分析器 analyzer API 檢視文字內容如何被分析

ElasticSearch自定義分析器-整合結巴分詞外掛

Elasticsearch筆記六之中文分詞器及自定義分詞器

SpringBoot通過AOP實現系統日誌記錄（三）-Mapper層日誌監控及自定義異常攔截

elasticsearch的索引自動清理及自定義清理

solr7.3配置中文分析器和自定義業務域

android 系統自帶主題樣式及自定義主題樣式

Laravel之加密解密/日誌/異常處理及自定義錯誤

hibernate validation內置註解及自定義註解

shiro授權及自定義realm授權(七)

Springboot-讀取核心配置文件及自定義配置文件

監控linux的系統資源和自定義進程的cpu 內存占用。

weex 項目開發（五）自定義過濾函數和混合及自定義 Header 組件

Springboot讀取配置文件及自定義配置文件

如何在Windows Server 2008R2上面批量添加AD用戶及自定義OU批量添加用戶

elasticsearch系統分析器及自定義分析器

相關推薦