ElasticSearch教程——自定義分詞器
阿新 • • 發佈:2018-11-08
ElasticSearch彙總請檢視:ElasticSearch教程——彙總篇
分詞器
Elasticsearch中,內建了很多分詞器(analyzers),例如standard
(標準分詞器)、english
(英文分詞)和chinese
(中文分詞),預設的是standard,
standard tokenizer:以單詞邊界進行切分
standard token filter:什麼都不做
lowercase token filter:將所有字母轉換為小寫
stop token filer(預設被禁用):移除停用詞,比如a the it等等
修改分詞器設定
啟用english,停用詞token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}
標準分詞測試程式碼
GET /my_index/_analyze { "analyzer": "standard", "text": "a dog is in the house" }
結果
{ "tokens": [ { "token": "a", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 }, { "token": "dog", "start_offset": 2, "end_offset": 5, "type": "<ALPHANUM>", "position": 1 }, { "token": "is", "start_offset": 6, "end_offset": 8, "type": "<ALPHANUM>", "position": 2 }, { "token": "in", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 3 }, { "token": "the", "start_offset": 12, "end_offset": 15, "type": "<ALPHANUM>", "position": 4 }, { "token": "house", "start_offset": 16, "end_offset": 21, "type": "<ALPHANUM>", "position": 5 } ] }
設定的英文分詞測試程式碼
GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}
結果
{
"tokens": [
{
"token": "dog",
"start_offset": 2,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "house",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
}
]
}
自定義分詞器
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": ["&=> and"]
}
},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": ["the", "a"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "&_to_and"],
"tokenizer": "standard",
"filter": ["lowercase", "my_stopwords"]
}
}
}
}
}
內容解析
測試程式碼
GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}
測試結果
{
"tokens": [
{
"token": "tomandjerry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "are",
"start_offset": 10,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "friend",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "in",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "house",
"start_offset": 30,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "HAHA",
"start_offset": 42,
"end_offset": 46,
"type": "<ALPHANUM>",
"position": 7
}
]
}
type中的使用
PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}