1. 程式人生 > >Elasticsearch搜尋引擎第十二篇-聚合分析

Elasticsearch搜尋引擎第十二篇-聚合分析

文章目錄

聚合分析簡介

聚合分析是資料庫中重要的功能特性,完成對一個查詢的資料集中資料的聚合計算,如:找出某欄位(或計算表示式的結果)的最大值、最小值,計算和、平均值等。ES作為搜尋引擎兼資料庫,同樣提供了強大的聚合分析能力。

  • 指標聚合metric:是對一個數據集求最大、最小、和、平均值等指標的聚合
  • 桶聚合bucketing:
    關係型資料庫中除了有聚合函式外,還可以對查詢出的資料進行分組group by,再在組上進行指標聚合,在 ES 中group by 稱為分桶
  • ES中還提供了矩陣聚合(matrix)、管道聚合(pipleline),但還在完善中。

在查詢請求體中以aggregations節點按如下語法定義聚合分析(aggregations可以簡寫成aggs):

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
} [,"meta" : { [<meta_data_body>] } ]? [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }

聚合計算的值可以取欄位的值,也可是指令碼計算的結果。

指標聚合

max min sum avg

查詢所有客戶中餘額最大值(size=0表示不返回其他欄位):

POST /bank/_search?
{
  "size": 0, 
  "aggs": {
    "masssbalance": {
      "max": {
        "field": "balance"
      }
    }
  }
}

年齡為24歲的客戶中餘額最大值:

POST /bank/_search?
{
  "size": 2, 
  "query": {
    "match": {
      "age": 24
    }
  },
  "sort": [
    {
      "balance": {
        "order": "desc"
      }
    }
  ],
  "aggs": {
    "max_balance": {
      "max": {
        "field": "balance"
      }
    }
  }
}

查詢所有客戶的平均年齡是多少(值來源於指令碼):

POST /bank/_search?size=0
{
    "aggs" : {
        "avg_age" : {
            "avg" : {
                "script" : {
                    "source" : "doc.age.value"
                }
            }
        },
        "avg_age10" : {
            "avg" : {
                "script" : {
                    "source" : "doc.age.value + 10"
                }
            }
        }
    }
}

指定欄位field,然後在指令碼中用_value取欄位的值:

POST /bank/_search?size=0
{
  "aggs": {
    "sum_balance": {
      "sum": {
        "field": "balance",
        "script": {
            "source": "_value * 1.03"
        }
      }
    }
  }
}

為缺失欄位指定值,如未指定,缺失欄位的值將被忽略:

POST /bank/_search?size=0
{
  "aggs": {
    "avg_age": {
      "avg": {
        "field": "age",
        "missing": 18
      }
    }
  }
}

文件計數

文件計數count:

POST /bank/_doc/_count
{
  "query": {
    "match": {
      "age" : 24
    }
  }
}

cardinality值去重計數:

POST /bank/_search?size=0
{
  "aggs": {
    "age_count": {
      "cardinality": {
        "field": "age"
      }
    },
    "state_count": {
      "cardinality": {
        "field": "state.keyword"
      }
    }
  }
}

統計某欄位有值的文件數:

POST /bank/_search?size=0
{
    "aggs" : {
        "age_count" : { "value_count" : { "field" : "age" } }
    }
}

stats可以統計count、max、min、avg、sum5個值:

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

高階統計,比stats多4個統計結果:平方和、方差、標準差、平均值加/減兩個標準差的區間

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "extended_stats": {
        "field": "age"
      }
    }
  }
}

佔比百分位對應的值統計

對指定欄位(指令碼)的值按從小到大累計每個值對應的文件數的佔比(佔所有命中文件數的百分比),返回指定佔比比例對應的值。預設返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中間的結果,可以理解為:佔比為50%的文件的age值 <= 31,或反過來:age<=31的文件數佔總命中文件數的50%

POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age"
      }
    }
  }
}

#返回結果
 "aggregations": {
    "age_percents": {
      "values": {
        "1.0": 20,
        "5.0": 21,
        "25.0": 25,
        "50.0": 31,
        "75.0": 35,
        "95.0": 39,
        "99.0": 40
      }
    }
  }

也可以指定分位值:

POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age",
        "percents" : [95, 99, 99.9] 
      }
    }
  }
}

#結果
"aggregations": {
    "age_percents": {
      "values": {
        "95.0": 39,
        "99.0": 40,
        "99.9": 40
      }
    }
}

統計值小於等於指定值的文件佔比

POST /bank/_search?size=0
{
  "aggs": {
    "gge_perc_rank": {
      "percentile_ranks": {
        "field": "age",
        "values": [
          25,
          30
        ]
      }
    }
  }
}

#結果
"aggregations": {
    "gge_perc_rank": {
      "values": {
        "25.0": 26.1,
        "30.0": 49.3
      }
    }
  }

求文件幾種的座標點範圍

參考官網:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html

求中心點座標值

參考官網:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html

桶聚合

Terms Aggregation 根據欄位值項分組聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age"  #根據age值項進行分組聚合
      }
    }
  }
}

#返回結果
"aggregations": {
	"age_terms": {
	  "doc_count_error_upper_bound": 0,  #文件計數的最大偏差值
	  "sum_other_doc_count": 463,  #未返回的其他項的文件數
	  "buckets": [
		{
		  "key": 31,  #age的值
		  "doc_count": 61  #出現的文件總數
		},
		{
		  "key": 39,
		  "doc_count": 60
		},
		{
		  "key": 26,
		  "doc_count": 59
		},.
	   ]
	}
}

預設情況下返回按文件計數從高到低的前10個分組


size可以指定返回多少個分組

shard_size可以指定每個分片上返回多少個分組,預設值如下:

  • 索引只有一個分片的情況下,shard_size=size
  • 索引有多個分片的情況下,shard_size=size*1.5+10

show_term_doc_count_error可以指定每個分組上是否顯示偏差值

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "size": 5,
        "shard_size":20,
        "show_term_doc_count_error": true
      }    
	 }  
   }
}

order可以指定根據文件計數排序或根據分組值排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_count" : "asc" }  #根據文件計數排序
      }
    }
  }
}

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_key" : "asc" }  #根據分組值排序
      }
    }
  }
}

取分組指標值,比如按年齡age分組,然後顯示出該年齡的最小收入balance和最大收入balance:

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        },
        "min_balance": {
          "min": {
            "field": "balance"
          }
        }      
      }    
    }  
  }
}

#返回結果
"aggregations": {
    "age_terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 511,
      "buckets": [
        {
          "key": 27,
          "doc_count": 39,
          "min_balance": {
            "value": 1110
          },
          "max_balance": {
            "value": 46868
          }
        },
        {
          "key": 39,
          "doc_count": 60,
          "min_balance": {
            "value": 3589
          },
          "max_balance": {
            "value": 47257
          }
        },
        .....
      ]
    }
  }

根據分組指標值排序,比如按最大收入進行排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        }
      }
    }  
  }
}

還可以統計收入的最大、最小、平均、總數,並按照任意一個值進行排序:

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "stats_balance.max": "asc"
        }
      },
      "aggs": {
        "stats_balance": {
          "stats": {
            "field": "balance"
          }
        }
      }
    }  
  }
}

篩選分組,可以過濾文件計數最小值達到多少,還可以篩選指定的key值列表:

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "min_doc_count": 60  #文件數60或以上的顯示出來
      }
    }
  }
}

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "include": [20,24]  #只顯示年齡為2024的資料
      }
    }
  }
}

還可以指定欄位中包含或不包含哪些內容,或者使用正則表示式進行匹配值:

GET /_search
{
    "aggs" : {
        "JapaneseCars" : {
             "terms" : {
                 "field" : "make",
                 "include" : ["mazda", "honda"]  #make中包含這些欄位的
             }
         },
        "ActiveCarManufacturers" : {
             "terms" : {
                 "field" : "make",
                 "exclude" : ["rover", "jensen"]  #make中不包含這些欄位的
             }
         }
    }
}

GET /_search
{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "include" : ".*sport.*",
                "exclude" : "water_.*"
            }
        }
    }
}

對缺失值處理,比如有的文件中tags欄位是不存在或沒有值的,那麼我們可以為這些欄位指定這種情況下應該返回什麼紙:

GET /_search
{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" 
             }
         }
    }
}

Filter Aggregation 對滿足過濾查詢的文件進行聚合

在查詢命中的文件中選取符合過濾條件的文件進行聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "filter": {"match":{"gender":"F"}},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

Filters Aggregation 多個過濾組聚合計算

索引一段資料:

PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }

然後進行多個過濾組統計查詢

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }   
    }  
  }
}

Range Aggregation 範圍分組聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_range": {
      "range": {
        "field": "age",
        "ranges": [
          {"to":25},
          {"from": 25,"to": 35},
          {"from": 35}
        ]
      },
      "aggs": {
        "bmax": {
          "max": {
            "field": "balance"
          }
        }
      }    
    }  
  }
}

#返回結果,分成三組,to、from to、from
"aggregations": {
    "age_range": {
      "buckets": [
        {
          "key": "*-25.0",
          "to": 25,
          "doc_count": 225,
          "bmax": {
            "value": 49587
          }
        },
        {
          "key": "25.0-35.0",
          "from"