1. 程式人生 > >elasticsearch(11)通過ngram分詞機制實現搜尋推薦

elasticsearch(11)通過ngram分詞機制實現搜尋推薦

轉載自簡書本文連結地址: Elasticsearch通過ngram分詞機制實現搜尋推薦

1、什麼是ngram

例如英語單詞 quick,5種長度下的ngram

ngram length=1,q u i c k
ngram length=2,qu ui ic ck
ngram length=3,qui uic ick
ngram length=4,quic uick
ngram length=5,quick

2、什麼是edge ngram

quick這個詞,拋錨首字母后進行ngram

q
qu
qui
quic
quick

使用edge ngram將每個單詞都進行進一步的分詞和切分,用切分後的ngram來實現字首搜尋推薦功能

hello world
hello we
h
he
hel
hell
hello    doc1,doc2

w         doc1,doc2
wo
wor
worl
world
e       doc2

比如搜尋hello w

doc1和doc2都匹配hello和w,而且position也匹配,所以doc1和doc2被返回。

搜尋的時候,不用在根據一個字首,然後掃描整個倒排索引了;簡單的拿字首去倒排索引中匹配即可,如果匹配上了,那麼就完事了。

3、最大最小引數

min ngram = 1
max ngram = 3

最小几位最大幾位。(這裡是最小1位最大3位)

比如有helloworld單詞

那麼就是如下

h
he
hel

最大三位就停止了。

4、試驗一下ngram

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter" : {
          "type" : "edge_ngram",
          "min_gram" : 1,
          "max_gram" : 20
        }
      },
      "analyzer": {
        "autocomplete"
: { "type" : "custom", "tokenizer" : "standard", "filter" : [ "lowercase", "autocomplete_filter" ] } } } } }
PUT /my_index/_mapping/my_type
{
  "properties": {
      "title": {
          "type":     "string",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

注意這裡search_analyzer為什麼是standard而不是autocomplete?

因為搜尋的時候沒必要在進行每個字母都拆分,比如搜尋hello w。直接拆分成hello和w去搜索就好了,沒必要弄成如下這樣:

h
he
hel
hell
hello   

w

弄成這樣的話效率反而更低了。

插入4條資料

PUT /my_index/my_type/1
{
  "title" : "hello world"
}

PUT /my_index/my_type/2
{
  "title" : "hello we"
}

PUT /my_index/my_type/3
{
  "title" : "hello win"
}

PUT /my_index/my_type/4
{
  "title" : "hello dog"
}

執行搜尋

GET /my_index/my_type/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}

結果

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.1983768,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1.1983768,
        "_source": {
          "title": "hello we"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.8271048,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.797104,
        "_source": {
          "title": "hello win"
        }
      }
    ]
  }
}

本來match_phrase不會分詞。只匹配短語,但是為什麼這樣卻能匹配出三條?

是因為我們建立mapping的時候對title進行了分詞設定,運用了ngram將他進行了拆分,而搜尋的時候按照標準的standard分詞器去拆分term,這樣效率槓槓的!!