1. 程式人生 > >Elasticsearch如何實現篩選功能(設定欄位不分詞和聚合操作)

Elasticsearch如何實現篩選功能(設定欄位不分詞和聚合操作)

0 起因

中文分詞中比較常用的分詞器是es-ik,建立索引的方式如下:
這裡我們為index personList新建了兩個欄位:name和district,注意索引名稱必須是小寫
(以下格式都是在kibana上做的)

PUT /person_list
{
  "mappings": {
    "info": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "district": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}'

檢視索引詳情和索引某一些屬性

GET person_list
GET /person_list/_settings
GET /person_list/_mapping

新增一些資料用於我們的測試
你可以批量新增(推薦):

POST /person_list/info/_bulk
{"index":{"_id":"1"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"2"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"3"}}
{"name":"李明","district":"北京市"}
{"index":{"_id":"4"}}
{"name":"張偉","district":"上海市"}
{"index":{"_id":"5"}}
{"name":"張偉","district":"北京市"}
{"index":{"_id":"6"}}
{"name":"張偉","district":"北京市"}

也可以逐條新增

POST /person_list/info
 {
   "name": "李明",
   "district":"上海"
 }

下面看看需求

0.1 需求一:實現對name的的搜尋功能

這個很簡單,模糊搜尋和精確搜尋都能實現,同時設定以下offset和size

GET person_list/info/_search
{
  "query": {
    "match_phrase_prefix": {"name": "張偉"}
  },
  "size": 10,
  "from": 0
}

搜尋結果:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.35667494,
    "hits": [
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "4",
        "_score": 0.35667494,
        "_source": {
          "name": "張偉",
          "district": "北京"
        }
      }
    ]
  }
}

0.2 需求二:實現對name的的聚合,當搜尋某個人名時,顯示同一人名在不同地區的數量

聚合語句如下,我們需要得到張偉在不同地區的人數

 GET person_list/info/_search
{
  "query":{
     "match_phrase_prefix":{"name":"張偉"}
  },
  "aggs":{
    "result":{
      "terms":{"field":"district"}
    }
  }
}

此時返回的結果是

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [district] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "person_list",
        "node": "SOK5mAntQ8SYv6BuOGYuMg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [district] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

報錯了,說要我們設定 fielddata=true,怎麼改呢?

1 配置可被篩選的Index

我們可以通過語句直接修改

 POST /person_list/_mapping/info
{
  "properties": {
        "district": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "fielddata": true
            }
        }
}

成功了

{
  "acknowledged": true
}

現在再次執行聚合操作

 GET person_list/info/_search
{
  "query":{
     "match_phrase_prefix":{"name":"張偉"}
  },

  "aggs":{
    "result":{
      "terms":{"field":"district"}
    }
  }
}

看一下結果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.47000363,
    "hits": [
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "4",
        "_score": 0.47000363,
        "_source": {
          "name": "張偉",
          "district": "上海市"
        }
      },
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "6",
        "_score": 0.47000363,
        "_source": {
          "name": "張偉",
          "district": "北京市"
        }
      },
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "5",
        "_score": 0.2876821,
        "_source": {
          "name": "張偉",
          "district": "北京市"
        }
      }
    ]
  },
  "aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "北京",
          "doc_count": 2
        },
        {
          "key": "北京市",
          "doc_count": 2
        },
        {
          "key": "市",
          "doc_count": 2
        },
        {
          "key": "上海",
          "doc_count": 1
        },
        {
          "key": "上海市",
          "doc_count": 1
        },
        {
          "key": "海市",
          "doc_count": 1
        }
      ]
    }
  }
}

出問題了,district欄位被拆了!
大概想一想,由於我們用的ik分詞,所以在聚合的過程中,是先把district分詞然後聚合並統計數量的。
現在思路清晰了,對於district的analyzer設定,我們不應該分詞。
通過搜尋網上的方案,我們再次修改對映

 POST /person_list/_mapping/info
{
  "properties": {
        "district": {
            "type": "text",
            "fielddata": true,
            "fields": {"raw": {"type": "keyword"}}
            }
        }
}

不行,報錯analyzer衝突,怎麼辦?
現在的問題就是我們取消對district的ik分詞,應該就可以了
捋一捋,analyzer有兩種方案:

  1. 官方自帶的:standard,simple,whitespace,language,具體左右可以檢視官方文件
  2. 自定義/第三方analyzer

考慮了一下,district欄位的所有資料都沒有空格,使用whitespace正好能夠避免被分詞
於是乎我們重新建立了一遍索引:

# 刪除資料
POST person_list/_delete_by_query
{
  "query": { 
    "match_all": {
    }
  }
}

# 刪除索引
DELETE /person_list

# 新建索引
PUT /person_list
{
  "mappings": {
    "info": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "district": {
            "type": "text",
            "analyzer": "whitespace",
            "search_analyzer": "whitespace",
            "fielddata": true
        }
      }
    }
  }
}'


# 匯入資料
POST /person_list/info/_bulk
{"index":{"_id":"1"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"2"}}
{"name":"李明","district":"上海市"}
{"index":{"_id":"3"}}
{"name":"李明","district":"北京市"}
{"index":{"_id":"4"}}
{"name":"張偉","district":"上海市"}
{"index":{"_id":"5"}}
{"name":"張偉","district":"北京市"}
{"index":{"_id":"6"}}
{"name":"張偉","district":"北京市"}

查詢聚合結果

 GET person_list/info/_search
{
  "query":{
     "match_phrase_prefix":{"name":"張偉"}
  },
  "aggs":{
    "result":{
      "terms":{"field":"district"}
    }
  }
}

結果正是我們想要的:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.47000363,
    "hits": [
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "4",
        "_score": 0.47000363,
        "_source": {
          "name": "張偉",
          "district": "上海市"
        }
      },
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "6",
        "_score": 0.47000363,
        "_source": {
          "name": "張偉",
          "district": "北京市"
        }
      },
      {
        "_index": "person_list",
        "_type": "info",
        "_id": "5",
        "_score": 0.2876821,
        "_source": {
          "name": "張偉",
          "district": "北京市"
        }
      }
    ]
  },
  "aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "北京市",
          "doc_count": 2
        },
        {
          "key": "上海市",
          "doc_count": 1
        }
      ]
    }
  }
}

至此,我們已經學會了如何通過ES實現基本的篩選功能了