Elasticsearch實踐（二）：搜尋

ElasticSearch 中文分詞 · 發表 2018-11-25 11:07:00

摘要：本文以 Elasticsearch 6.2.4為例。經過前面的基礎入門，我們對ES的基本操作也會了。現在來學習ES最強大的部分：全文檢索。準備工作批量匯入資料先需要準備點資料，然後匯入： wget https://raw.githubusercontent.com/...

本文以 Elasticsearch 6.2.4為例。

經過前面的基礎入門，我們對ES的基本操作也會了。現在來學習ES最強大的部分：全文檢索。

準備工作

批量匯入資料

先需要準備點資料，然後匯入：

wget https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/account/_bulk" --data-binary "@accounts.json"

這樣我們就匯入了1000條資料到ES。index是bank。我們可以檢視現在有哪些index：

curl "localhost:9200/_cat/indices?format=json&pretty"

結果：

[
{
"health" : "yellow",
"status" : "open",
"index" : "bank",
"uuid" : "IhyOzz3WTFuO5TNgPJUZsw",
"pri" : "5",
"rep" : "1",
"docs.count" : "1000",
"docs.deleted" : "0",
"store.size" : "640.3kb",
"pri.store.size" : "640.3kb"
},
{
"health" : "yellow",
"status" : "open",
"index" : "customer",
"uuid" : "f_nzBLypSUK2SVjL2AoKxQ",
"pri" : "5",
"rep" : "1",
"docs.count" : "9",
"docs.deleted" : "0",
"store.size" : "31kb",
"pri.store.size" : "31kb"
},
{
"health" : "yellow",
"status" : "open",
"index" : ".kibana",
"uuid" : "tnWbNLSMT7273UEh6RfcBg",
"pri" : "1",
"rep" : "1",
"docs.count" : "5",
"docs.deleted" : "0",
"store.size" : "29.4kb",
"pri.store.size" : "29.4kb"
}
]

使用kibana視覺化資料

該小節是可選的，如果不感興趣，可以跳過。

該小節要求你已經搭建好了ElasticSearch + Kibana。

開啟kibana web地址：http://127.0.0.1:5601，依次開啟： Management
-> Kibana -> Index Patterns ,選擇 Create Index Pattern ：

a. Index pattern 輸入： bank ；

b. 點選Create。

然後開啟Discover，選擇 bank 就能看到剛才匯入的資料了。

我們在視覺化介面裡檢索資料：

是不是很酷！

接下來我們使用API來實現檢索。

查詢

模糊檢索

GET /bank/_search?q="Virginia"&pretty

解釋：檢索關鍵字為"Virginia"的結果。結果示例：

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 4.631368,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "298",
"_score": 4.631368,
"_source": {
"account_number": 298,
"balance": 34334,
"firstname": "Bullock",
"lastname": "Marsh",
"age": 20,
"gender": "M",
"address": "589 Virginia Place",
"employer": "Renovize",
"email": "[email protected]",
"city": "Coinjock",
"state": "UT"
}
},
{
"_index": "bank",
"_type": "account",
"_id": "25",
"_score": 4.6146765,
"_source": {
"account_number": 25,
"balance": 40540,
"firstname": "Virginia",
"lastname": "Ayala",
"age": 39,
"gender": "F",
"address": "171 Putnam Avenue",
"employer": "Filodyne",
"email": "[email protected]",
"city": "Nicholson",
"state": "PA"
}
}
]
}
}

返回欄位含義：

took – Elasticsearch執行搜尋的時間（以毫秒為單位）
timed_out – 搜尋是否超時
_shards – 搜尋了多少個分片，以及搜尋成功/失敗分片的計數
hits – 搜尋結果，是個物件
hits.total – 符合我們搜尋條件的文件總數
hits.hits – 實際的搜尋結果陣列（預設為前10個文件）
hits.sort - 對結果進行排序（如果按score排序則沒有該欄位）
hits._score、max_score - 暫時忽略這些欄位

GET /bank/_search?q=*&sort=account_number:asc&pretty

解釋：所有結果通過account_number欄位升序排列。預設只返回前10條。

下面的查詢與上面的含義一致：

GET /bank/_search
{
"query": {
"multi_match" : {
"query" : "Virginia",
"fields" : ["_all"]
}
}
}

GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}

通常我們會採用傳JSON方式查詢。Elasticsearch提供了一種JSON樣式的特定於域的語言，可用於執行查詢。這被稱為查詢DSL。

注意：上述的查詢裡面我們僅指定了index，並沒有指定type，那麼ES將不會區分type。如果想區分，請在URI後面追加type。示例： GET /bank/account/_search 。

欄位檢索

再看按欄位查詢：

GET /bank/_search
{
"query": {
"multi_match" : {
"query" : "Virginia",
"fields" : ["firstname"]
}
}
}

GET /bank/_search
{
"query": {
"match" : {
"firstname" : "Virginia"
}
}
}

上面2種查詢是等效的，都是查詢 firstname 為 Virginia 的結果。

不分詞

預設檢索都是分詞的，如果我們希望精確匹配，可以這樣實現：

1、首先在mapping裡設定欄位不分詞

GET /bank/_search
{
"query": {
"match" : {
"address.keyword" : "171 Putnam Avenue"
}
}
}

在欄位後面加上 .keyword 表示不分詞，使用精確匹配。大家可以測試下面2種查詢結果的區別:

GET /bank/_search
{
"query": {
"match" : {
"address" : "Putnam"
}
}
}

GET /bank/_search
{
"query": {
"match" : {
"address.keyword" : "Putnam"
}
}
}

第二種將查不到任何結果。

分頁

分頁使用關鍵字from、size，分別表示偏移量、分頁大小。

GET /bank/_search
{
"query": { "match_all": {} },
"from": 0,
"size": 2
}

from預設是0，size預設是10。

欄位排序

欄位排序關鍵字是sort。支援升序(asc)、降序(desc)。

GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
"from":0,
"size":10
}

過濾欄位

預設情況下，ES返回所有欄位。這被稱為源（ _source 搜尋命中中的欄位）。如果我們不希望返回所有欄位，我們可以只請求返回源中的幾個欄位。

GET /bank/_search
{
"query": { "match_all": {} },
"_source": ["account_number", "balance"]
}

通過 _source 關鍵字可以實現欄位過濾。

AND查詢

如果我們想同時查詢符合A和B欄位的結果，該怎麼查呢？可以使用must關鍵字組合。

GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}


GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "account_number":136 } },
{ "match": { "address": "lane" } },
{ "match": { "city": "Urie" } }
]
}
}
}

must也等價於：

GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } }
],
"must": [
{ "match": { "address": "lane" } }
]
}
}
}

這種相當於先查詢A再查詢B，而上面的則是同時查詢符合A和B，但結果是一樣的，執行效率可能有差異。有知道原因的朋友可以告知。

OR查詢

ES使用should關鍵字來實現OR查詢。

GET /bank/_search
{
"query": {
"bool": {
"should": [
{ "match": { "account_number":136 } },
{ "match": { "address": "lane" } },
{ "match": { "city": "Urie" } }
]
}
}
}

AND取反查

must_not 關鍵字實現了既不包含A也不包含B的查詢。

GET /bank/_search
{
"query": {
"bool": {
"must_not": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}

表示 address 欄位需要符合既不包含 mill 也不包含 lane。

布林組合查詢

我們可以組合 must 、should 、must_not 進行復雜的查詢。

A AND NOT B

GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": 40 } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}

相當於SQL：

select * from bank where age=40 and state!= "ID";

A AND (B OR C)

GET /bank/_search
{
"query":{
"bool":{
"must":[
{"match":{"age":39}},
{"bool":{"should":[
{"match":{"city":"Nicholson"}},
{"match":{"city":"Yardville"}}
]}
}
]
}
}
}

相當於SQL：

select * from bank where age=39 and (city="Nicholson" or city="Yardville");

範圍查詢

GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}

相當於SQL：

select * from bank where balance between 20000 and 30000;

多欄位範圍查詢：

GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"bool":{
"must":[
{"range": {"balance": {"gte": 20000,"lte": 30000}}},
{"range": {"age": {"gte": 30}}}
]
}
}
}
}
}

聚合查詢

GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}

結果：

{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound": 20,
"sum_other_doc_count": 770,
"buckets" : [ {
"key" : "ID",
"doc_count" : 27
}, {
"key" : "TX",
"doc_count" : 27
}, {
"key" : "AL",
"doc_count" : 25
}, {
"key" : "MD",
"doc_count" : 25
}, {
"key" : "TN",
"doc_count" : 23
}, {
"key" : "MA",
"doc_count" : 21
}, {
"key" : "NC",
"doc_count" : 21
}, {
"key" : "ND",
"doc_count" : 21
}, {
"key" : "ME",
"doc_count" : 20
}, {
"key" : "MO",
"doc_count" : 20
} ]
}
}
}

查詢結果返回了ID州(Idaho)有27個賬戶，TX州(Texas)有27個賬戶。

相當於SQL：

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

該查詢意思是按照欄位state分組，返回前10個聚合結果。

其中size設定為0意思是不返回文件內容，僅返回聚合結果。 state.keyword 表示欄位精確匹配，因為使用模糊匹配效能很低，所以不支援。

多重聚合

我們可以在聚合的基礎上再進行聚合，例如求和、求平均值等等。

GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}

上述查詢實現了在前一個聚合的基礎上，按州計算平均帳戶餘額（同樣僅針對按降序排序的前10個州）。

我們可以在聚合中任意巢狀聚合，以從資料中提取所需的統計資料。

在前一個聚合的基礎上，我們現在按降序排列平均餘額：

GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}

這裡基於第二個聚合結果進行倒序排列。其實上一個例子隱藏了預設排序，也就是預設按照 _sort (分值)倒序：

GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"_sort": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}

此示例演示了我們如何按年齡段（20-29歲，30-39歲和40-49歲）進行分組，然後按性別分組，最後得到每個年齡段的平均帳戶餘額：

GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
}

這個結果就複雜了，屬於巢狀分組，結果也是巢狀的：

{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_age": {
"buckets": [
{
"key": "20.0-30.0",
"from": 20,
"to": 30,
"doc_count": 451,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 232,
"average_balance": {
"value": 27374.05172413793
}
},
{
"key": "F",
"doc_count": 219,
"average_balance": {
"value": 25341.260273972603
}
}
]
}
},
{
"key": "30.0-40.0",
"from": 30,
"to": 40,
"doc_count": 504,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "F",
"doc_count": 253,
"average_balance": {
"value": 25670.869565217392
}
},
{
"key": "M",
"doc_count": 251,
"average_balance": {
"value": 24288.239043824702
}
}
]
}
},
{
"key": "40.0-50.0",
"from": 40,
"to": 50,
"doc_count": 45,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 24,
"average_balance": {
"value": 26474.958333333332
}
},
{
"key": "F",
"doc_count": 21,
"average_balance": {
"value": 27992.571428571428
}
}
]
}
}
]
}
}
}

term與match查詢

首先大家看下面的例子有什麼區別：

已知條件：ES裡 address 為 171 Putnam Avenue 的資料有1條； address 為 Putnam 的資料有0條。index為bank，type為account，文件ID為25。

GET /bank/_search
{
"query": {
"match" : {
"address" : "Putnam"
}
}
}

GET /bank/_search
{
"query": {
"match" : {
"address.keyword" : "Putnam"
}
}
}

GET /bank/_search
{
"query": {
"term" : {
"address" : "Putnam"
}
}
}

結果：

1、第一個能匹配到資料，因為會分詞查詢。

2、第二個不能匹配到資料，因為不分詞的話沒有該條資料。

3、結果不確定。需要看實際是怎麼分詞的。

我們通過下列查詢可以知曉該條資料欄位 address 的分詞情況:

GET /bank/account/25/_termvectors?fields=address

結果：

{
"_index": "bank",
"_type": "account",
"_id": "25",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"address": {
"field_statistics": {
"sum_doc_freq": 591,
"doc_count": 197,
"sum_ttf": 591
},
"terms": {
"171": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 3
}
]
},
"avenue": {
"term_freq": 1,
"tokens": [
{
"position": 2,
"start_offset": 11,
"end_offset": 17
}
]
},
"putnam": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 4,
"end_offset": 10
}
]
}
}
}
}
}

可以看出該條資料欄位 address 一共分了3個詞：

171
avenue
putnam

現在可以得出第三個查詢的答案：匹配不到！但值改成小寫的 putnam 又能匹配到了！

原因是：

term query 查詢的是倒排索引中確切的term
match query 會對filed進行分詞操作，然後再查詢

由於 Putnam 不在分詞裡（大小寫敏感），所以匹配不到。match query先對filed進行分詞，也就是分成 putnam ，再去匹配倒排索引中的term,所以能匹配到。

standard analyzer 分詞器分詞預設會將大寫字母全部轉為小寫字母。

參考

1、Getting Started | Elasticsearch Reference [6.2] | Elastic

https://www.elastic.co/guide/en/elasticsearch/reference/6.2/getting-started.html

2、Elasticsearch 5.x 關於term query和match query的認識 - wangchuanfu - 部落格園

https://www.cnblogs.com/wangchuanfu/p/7444253.html

Elasticsearch實踐（二）：搜尋

準備工作

批量匯入資料

使用kibana視覺化資料

該小節要求你已經搭建好了ElasticSearch + Kibana。

查詢

模糊檢索

欄位檢索

不分詞

分頁

欄位排序

過濾欄位

AND查詢

OR查詢

AND取反查

布林組合查詢

範圍查詢

聚合查詢

多重聚合

term與match查詢

參考

您可能也會喜歡…