1. 程式人生 > >ElasticSearch的match和match_phrase查詢

ElasticSearch的match和match_phrase查詢

問題:

索引中有『第十人民醫院』這個欄位,使用IK分詞結果如下 :

POST http://localhost:9200/development_hospitals/_analyze?pretty&field=hospital.names&analyzer=ik

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "人民醫院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人",
      "start_offset": 2,
      "end_offset": 3,
      "type": "COUNT",
      "position": 5
    },
    {
      "token": "民醫院",
      "start_offset": 3,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "醫院",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

使用Postman構建match查詢:

可以得到結果,但是使用match_phrase查詢『第十』卻沒有任何結果

問題分析:

參考文件 The Definitive Guide [2.x] | Elastic

phrase搜尋跟關鍵字的位置有關, 『第十』採用ik_max_word分詞結果如下

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 1
    }
  ]
}
雖然『第十』和『十』都可以命中,但是match_phrase的特點是分詞後的相對位置也必須要精準匹配,『第十人民醫院』採用id_max_word分詞後,『第十』和『十』之間有一個『十人』,所以無法命中。

解決方案:

採用ik_smart分詞可以避免這樣的問題,對『第十人民醫院』和『第十』採用ik_smart分詞的結果分別是:

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "人民醫院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}
{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

穩穩命中

最佳實踐:

採用match_phrase匹配,結果會非常嚴格,但是也會漏掉相關的結果,個人覺得混合兩種方式進行bool查詢比較好,並且對match_phrase匹配採用boost加權,比如對name進行2種分詞並索引,ik_smart分詞采用match_phrase匹配,ik_max_word分詞采用match匹配,如:

{
  "query": {
    "bool": {
      "should": [
          {"match_phrase": {"name1": {"query": "第十", "boost": 2}}},
          {"match": {"name2": "第十"}}
      ]
    }
  },
  explain: true

}

轉自:https://zhuanlan.zhihu.com/p/25970549