Elasticsearch 自定義多個分析器
阿新 • • 發佈:2019-02-18
分析器(Analyzer)
Elasticsearch 無論是內建分析器還是自定義分析器,都由三部分組成:字元過濾器(Character Filters)、分詞器(Tokenizer)、詞元過濾器(Token Filters)。
分析器Analyzer工作流程:
Input Text => Character Filters(如果有多個,按順序應用) => Tokenizer => Token Filters(如果有多個,按順序應用) => Output Token
字元過濾器(Character Filters)
字元過濾器:對原始文字預處理,如去除HTML標籤,”&”轉成”and”等。
注意:一個分析器同時有多個字元過濾器時,按順序應用。
分詞器(Tokenizer)
分詞器:將字串分解成一系列的詞元Token。如根據空格將英文單詞分開。
詞元過濾器(Token Filters)
詞元過濾器:對分詞器分出來的詞元Token做進一步處理,如轉換大小寫、移除停用詞、單複數轉換、同義詞轉換等。
注意:一個分析器同時有多個詞元過濾器時,按順序應用。
分析器analyze API的使用
分析器analyze API可驗證分析器的分析效果並解釋分析過程。
# text: 待分析文字
# explain:解釋分析過程
# char_filter:字元過濾器
# tokenizer:分詞器
# filter:詞元過濾器
GET _analyze
{
"char_filter" : ["html_strip"],
"tokenizer": "standard",
"filter": [ "lowercase"],
"text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>",
"explain" : true
}
自定義多個分析器
建立索引並自定義多個分析器
這裡對一個索引同時定義了多個分析器。
PUT my_index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"analysis": {
"char_filter": { //自定義多個字元過濾器
"my_charfilter1": {
"type": "mapping",
"mappings": ["& => and"]
},
"my_charfilter2": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
},
"tokenizer":{ //自定義多個分詞器
"my_tokenizer1": {
"pattern":"\\s+",
"type":"pattern"
},
"my_tokenizer2":{
"pattern":"_",
"type":"pattern"
}
},
"filter": { //自定義多個詞元過濾器
"my_tokenfilter1": {
"type": "stop",
"stopwords": ["the", "a","an"]
},
"my_tokenfilter2": {
"type": "stop",
"stopwords": ["info", "debug"]
}
},
"analyzer": { //自定義多個分析器
"my_analyzer1":{ //分析器my_analyzer1
"char_filter": ["html_strip", "my_charfilter1","my_charfilter2"],
"tokenizer":"my_tokenizer1",
"filter": ["lowercase", "my_tokenfilter1"]
},
"my_analyzer2":{ //分析器my_analyzer2
"char_filter": ["html_strip"],
"tokenizer":"my_tokenizer2",
"filter": ["my_tokenfilter2"]
}
}
}
}
}
驗證索引my_index的多個分析器
驗證分析器my_analyzer1分析效果
GET /my_index/_analyze
{
"text": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
"analyzer": "my_analyzer1"//,
//"explain": true
}
#返回結果
{
"tokens": [
{
"token": "tom",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "and",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 1
},
{
"token": "jerry",
"start_offset": 17,
"end_offset": 26,
"type": "word",
"position": 2
},
{
"token": "in",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 3
},
{
"token": "room",
"start_offset": 34,
"end_offset": 38,
"type": "word",
"position": 5
},
{
"token": "number",
"start_offset": 39,
"end_offset": 45,
"type": "word",
"position": 6
},
{
"token": "1_1_1",
"start_offset": 46,
"end_offset": 51,
"type": "word",
"position": 7
}
]
}
驗證分析器my_analyzer2分析效果
GET /my_index/_analyze
{
"text": "<b>debug_192.168.113.1_971213863506812928</b>",
"analyzer": "my_analyzer2"//,
//"explain": true
}
#返回結果
{
"tokens": [
{
"token": "192.168.113.1",
"start_offset": 9,
"end_offset": 22,
"type": "word",
"position": 1
},
{
"token": "971213863506812928",
"start_offset": 23,
"end_offset": 45,
"type": "word",
"position": 2
}
]
}
新增Mapping併為不同欄位設定不同分析器
PUT my_index/_mapping/my_type
{
"properties": {
"my_field1": {
"type": "text",
"analyzer": "my_analyzer1",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"my_field2": {
"type": "text",
"analyzer": "my_analyzer2",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
建立文件
PUT my_index/my_type/1
{
"my_field1":"<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
"my_field2":"<b>debug_192.168.113.1_971213863506812928</b>"
}
Query-Mathch全文檢索
查詢時,ES會根據欄位使用的分析器進行分析,然後檢索。
#查詢my_field2欄位包含IP:192.168.113.1的文件
GET my_index/_search
{
"query": {
"match": {
"my_field2": "192.168.113.1"
}
}
}
#返回結果
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.2876821,
"_source": {
"my_field1": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
"my_field2": "<b>debug_192.168.113.1_971213863506812928</b>"
}
}
]
}
}