Elasticsearch搜尋引擎第十二篇-聚合分析
文章目錄
聚合分析簡介
聚合分析是資料庫中重要的功能特性,完成對一個查詢的資料集中資料的聚合計算,如:找出某欄位(或計算表示式的結果)的最大值、最小值,計算和、平均值等。ES作為搜尋引擎兼資料庫,同樣提供了強大的聚合分析能力。
- 指標聚合metric:是對一個數據集求最大、最小、和、平均值等指標的聚合
- 桶聚合bucketing:
關係型資料庫中除了有聚合函式外,還可以對查詢出的資料進行分組group by,再在組上進行指標聚合,在 ES 中group by 稱為分桶- ES中還提供了矩陣聚合(matrix)、管道聚合(pipleline),但還在完善中。
在查詢請求體中以aggregations節點按如下語法定義聚合分析(aggregations可以簡寫成aggs):
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
聚合計算的值可以取欄位的值,也可是指令碼計算的結果。
指標聚合
max min sum avg
查詢所有客戶中餘額最大值(size=0表示不返回其他欄位):
POST /bank/_search?
{
"size": 0,
"aggs": {
"masssbalance": {
"max": {
"field": "balance"
}
}
}
}
年齡為24歲的客戶中餘額最大值:
POST /bank/_search?
{
"size": 2,
"query": {
"match": {
"age": 24
}
},
"sort": [
{
"balance": {
"order": "desc"
}
}
],
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
查詢所有客戶的平均年齡是多少(值來源於指令碼):
POST /bank/_search?size=0
{
"aggs" : {
"avg_age" : {
"avg" : {
"script" : {
"source" : "doc.age.value"
}
}
},
"avg_age10" : {
"avg" : {
"script" : {
"source" : "doc.age.value + 10"
}
}
}
}
}
指定欄位field,然後在指令碼中用_value取欄位的值:
POST /bank/_search?size=0
{
"aggs": {
"sum_balance": {
"sum": {
"field": "balance",
"script": {
"source": "_value * 1.03"
}
}
}
}
}
為缺失欄位指定值,如未指定,缺失欄位的值將被忽略:
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
}
}
}
文件計數
文件計數count:
POST /bank/_doc/_count
{
"query": {
"match": {
"age" : 24
}
}
}
cardinality值去重計數:
POST /bank/_search?size=0
{
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"state_count": {
"cardinality": {
"field": "state.keyword"
}
}
}
}
統計某欄位有值的文件數:
POST /bank/_search?size=0
{
"aggs" : {
"age_count" : { "value_count" : { "field" : "age" } }
}
}
stats可以統計count、max、min、avg、sum5個值:
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
高階統計,比stats多4個統計結果:平方和、方差、標準差、平均值加/減兩個標準差的區間
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}
佔比百分位對應的值統計
對指定欄位(指令碼)的值按從小到大累計每個值對應的文件數的佔比(佔所有命中文件數的百分比),返回指定佔比比例對應的值。預設返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中間的結果,可以理解為:佔比為50%的文件的age值 <= 31,或反過來:age<=31的文件數佔總命中文件數的50%
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}
#返回結果
"aggregations": {
"age_percents": {
"values": {
"1.0": 20,
"5.0": 21,
"25.0": 25,
"50.0": 31,
"75.0": 35,
"95.0": 39,
"99.0": 40
}
}
}
也可以指定分位值:
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents" : [95, 99, 99.9]
}
}
}
}
#結果
"aggregations": {
"age_percents": {
"values": {
"95.0": 39,
"99.0": 40,
"99.9": 40
}
}
}
統計值小於等於指定值的文件佔比
POST /bank/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [
25,
30
]
}
}
}
}
#結果
"aggregations": {
"gge_perc_rank": {
"values": {
"25.0": 26.1,
"30.0": 49.3
}
}
}
求文件幾種的座標點範圍
求中心點座標值
桶聚合
Terms Aggregation 根據欄位值項分組聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age" #根據age值項進行分組聚合
}
}
}
}
#返回結果
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0, #文件計數的最大偏差值
"sum_other_doc_count": 463, #未返回的其他項的文件數
"buckets": [
{
"key": 31, #age的值
"doc_count": 61 #出現的文件總數
},
{
"key": 39,
"doc_count": 60
},
{
"key": 26,
"doc_count": 59
},
….
]
}
}
預設情況下返回按文件計數從高到低的前10個分組
size可以指定返回多少個分組
shard_size可以指定每個分片上返回多少個分組,預設值如下:
- 索引只有一個分片的情況下,shard_size=size
- 索引有多個分片的情況下,shard_size=size*1.5+10
show_term_doc_count_error可以指定每個分組上是否顯示偏差值
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size":20,
"show_term_doc_count_error": true
}
}
}
}
order可以指定根據文件計數排序或根據分組值排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_count" : "asc" } #根據文件計數排序
}
}
}
}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_key" : "asc" } #根據分組值排序
}
}
}
}
取分組指標值,比如按年齡age分組,然後顯示出該年齡的最小收入balance和最大收入balance:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
},
"min_balance": {
"min": {
"field": "balance"
}
}
}
}
}
}
#返回結果
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 511,
"buckets": [
{
"key": 27,
"doc_count": 39,
"min_balance": {
"value": 1110
},
"max_balance": {
"value": 46868
}
},
{
"key": 39,
"doc_count": 60,
"min_balance": {
"value": 3589
},
"max_balance": {
"value": 47257
}
},
.....
]
}
}
根據分組指標值排序,比如按最大收入進行排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
}
}
還可以統計收入的最大、最小、平均、總數,並按照任意一個值進行排序:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"stats_balance.max": "asc"
}
},
"aggs": {
"stats_balance": {
"stats": {
"field": "balance"
}
}
}
}
}
}
篩選分組,可以過濾文件計數最小值達到多少,還可以篩選指定的key值列表:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"min_doc_count": 60 #文件數60或以上的顯示出來
}
}
}
}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"include": [20,24] #只顯示年齡為20和24的資料
}
}
}
}
還可以指定欄位中包含或不包含哪些內容,或者使用正則表示式進行匹配值:
GET /_search
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"] #make中包含這些欄位的
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"] #make中不包含這些欄位的
}
}
}
}
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
對缺失值處理,比如有的文件中tags欄位是不存在或沒有值的,那麼我們可以為這些欄位指定這種情況下應該返回什麼紙:
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A"
}
}
}
}
Filter Aggregation 對滿足過濾查詢的文件進行聚合
在查詢命中的文件中選取符合過濾條件的文件進行聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"filter": {"match":{"gender":"F"}},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
Filters Aggregation 多個過濾組聚合計算
索引一段資料:
PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
然後進行多個過濾組統計查詢
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
Range Aggregation 範圍分組聚合
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{"to":25},
{"from": 25,"to": 35},
{"from": 35}
]
},
"aggs": {
"bmax": {
"max": {
"field": "balance"
}
}
}
}
}
}
#返回結果,分成三組,to、from to、from
"aggregations": {
"age_range": {
"buckets": [
{
"key": "*-25.0",
"to": 25,
"doc_count": 225,
"bmax": {
"value": 49587
}
},
{
"key": "25.0-35.0",
"from"