Elasticsearch如何實現篩選功能(設定欄位不分詞和聚合操作)
阿新 • • 發佈:2018-11-05
0 起因
中文分詞中比較常用的分詞器是es-ik,建立索引的方式如下:
這裡我們為index personList新建了兩個欄位:name和district,注意索引名稱必須是小寫
(以下格式都是在kibana上做的)
PUT /person_list { "mappings": { "info": { "properties": { "name": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" }, "district": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } } }'
檢視索引詳情和索引某一些屬性
GET person_list
GET /person_list/_settings
GET /person_list/_mapping
新增一些資料用於我們的測試
你可以批量新增(推薦):
POST /person_list/info/_bulk {"index":{"_id":"1"}} {"name":"李明","district":"上海市"} {"index":{"_id":"2"}} {"name":"李明","district":"上海市"} {"index":{"_id":"3"}} {"name":"李明","district":"北京市"} {"index":{"_id":"4"}} {"name":"張偉","district":"上海市"} {"index":{"_id":"5"}} {"name":"張偉","district":"北京市"} {"index":{"_id":"6"}} {"name":"張偉","district":"北京市"}
也可以逐條新增
POST /person_list/info
{
"name": "李明",
"district":"上海"
}
下面看看需求
0.1 需求一:實現對name的的搜尋功能
這個很簡單,模糊搜尋和精確搜尋都能實現,同時設定以下offset和size
GET person_list/info/_search
{
"query": {
"match_phrase_prefix": {"name": "張偉"}
},
"size": 10,
"from": 0
}
搜尋結果:
{ "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.35667494, "hits": [ { "_index": "person_list", "_type": "info", "_id": "4", "_score": 0.35667494, "_source": { "name": "張偉", "district": "北京" } } ] } }
0.2 需求二:實現對name的的聚合,當搜尋某個人名時,顯示同一人名在不同地區的數量
聚合語句如下,我們需要得到張偉在不同地區的人數
GET person_list/info/_search
{
"query":{
"match_phrase_prefix":{"name":"張偉"}
},
"aggs":{
"result":{
"terms":{"field":"district"}
}
}
}
此時返回的結果是
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [district] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "person_list",
"node": "SOK5mAntQ8SYv6BuOGYuMg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [district] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
]
},
"status": 400
}
報錯了,說要我們設定 fielddata=true,怎麼改呢?
1 配置可被篩選的Index
我們可以通過語句直接修改
POST /person_list/_mapping/info
{
"properties": {
"district": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"fielddata": true
}
}
}
成功了
{
"acknowledged": true
}
現在再次執行聚合操作
GET person_list/info/_search
{
"query":{
"match_phrase_prefix":{"name":"張偉"}
},
"aggs":{
"result":{
"terms":{"field":"district"}
}
}
}
看一下結果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.47000363,
"hits": [
{
"_index": "person_list",
"_type": "info",
"_id": "4",
"_score": 0.47000363,
"_source": {
"name": "張偉",
"district": "上海市"
}
},
{
"_index": "person_list",
"_type": "info",
"_id": "6",
"_score": 0.47000363,
"_source": {
"name": "張偉",
"district": "北京市"
}
},
{
"_index": "person_list",
"_type": "info",
"_id": "5",
"_score": 0.2876821,
"_source": {
"name": "張偉",
"district": "北京市"
}
}
]
},
"aggregations": {
"result": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "北京",
"doc_count": 2
},
{
"key": "北京市",
"doc_count": 2
},
{
"key": "市",
"doc_count": 2
},
{
"key": "上海",
"doc_count": 1
},
{
"key": "上海市",
"doc_count": 1
},
{
"key": "海市",
"doc_count": 1
}
]
}
}
}
出問題了,district欄位被拆了!
大概想一想,由於我們用的ik分詞,所以在聚合的過程中,是先把district分詞然後聚合並統計數量的。
現在思路清晰了,對於district的analyzer設定,我們不應該分詞。
通過搜尋網上的方案,我們再次修改對映
POST /person_list/_mapping/info
{
"properties": {
"district": {
"type": "text",
"fielddata": true,
"fields": {"raw": {"type": "keyword"}}
}
}
}
不行,報錯analyzer衝突,怎麼辦?
現在的問題就是我們取消對district的ik分詞,應該就可以了
捋一捋,analyzer有兩種方案:
- 官方自帶的:standard,simple,whitespace,language,具體左右可以檢視官方文件
- 自定義/第三方analyzer
考慮了一下,district欄位的所有資料都沒有空格,使用whitespace正好能夠避免被分詞
於是乎我們重新建立了一遍索引:
# 刪除資料
POST person_list/_delete_by_query
{
"query": {
"match_all": {
}
}
}
# 刪除索引
DELETE /person_list
# 新建索引
PUT /person_list
{
"mappings": {
"info": {
"properties": {
"name": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"district": {
"type": "text",
"analyzer": "whitespace",
"search_analyzer": "whitespace",
"fielddata": true
}
}
}
}
}'
# 匯入資料
POST /person_list/info/_bulk
{"index":{"_id":"1"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"2"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"3"}}
{"name":"李明","district":"北京市"}
{"index":{"_id":"4"}}
{"name":"張偉","district":"上海市"}
{"index":{"_id":"5"}}
{"name":"張偉","district":"北京市"}
{"index":{"_id":"6"}}
{"name":"張偉","district":"北京市"}
查詢聚合結果
GET person_list/info/_search
{
"query":{
"match_phrase_prefix":{"name":"張偉"}
},
"aggs":{
"result":{
"terms":{"field":"district"}
}
}
}
結果正是我們想要的:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.47000363,
"hits": [
{
"_index": "person_list",
"_type": "info",
"_id": "4",
"_score": 0.47000363,
"_source": {
"name": "張偉",
"district": "上海市"
}
},
{
"_index": "person_list",
"_type": "info",
"_id": "6",
"_score": 0.47000363,
"_source": {
"name": "張偉",
"district": "北京市"
}
},
{
"_index": "person_list",
"_type": "info",
"_id": "5",
"_score": 0.2876821,
"_source": {
"name": "張偉",
"district": "北京市"
}
}
]
},
"aggregations": {
"result": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "北京市",
"doc_count": 2
},
{
"key": "上海市",
"doc_count": 1
}
]
}
}
}
至此,我們已經學會了如何通過ES實現基本的篩選功能了