1. 程式人生 > >Elastic Stack 筆記(七)Elasticsearch5.6 聚合分析

Elastic Stack 筆記(七)Elasticsearch5.6 聚合分析

style posit 輸出 ase exe reference date ref sam

博客地址:http://www.moonxy.com

一、前言

Elasticsearch 是一個分布式的全文搜索引擎,索引和搜索是 Elasticsarch 的基本功能。同時,Elasticsearch 的聚合(Aggregations)功能也時分強大,允許在數據上做復雜的分析統計。ES 提供的聚合分析功能主要有指標聚合、桶聚合、管道聚合和矩陣聚合。需要主要掌握的是前兩個,即標聚合和桶聚合。

聚合分析的官方文檔:Aggregations

二、聚合分析

2.1 指標聚合

指標聚合官網文檔:Metric

指標聚合中包括如下聚合:

  • Avg Aggregation
  • Cardinality Aggregation
  • Extended Stats Aggregation
  • Geo Bounds Aggregation
  • Geo Centroid Aggregation
  • Max Aggregation
  • Min Aggregation
  • Percentiles Aggregation
  • Percentile Ranks Aggregation
  • Scripted Metric Aggregation
  • Stats Aggregation
  • Sum Aggregation
  • Top Hits Aggregation
  • Value Count Aggregation

指標聚合中主要包括 min、max、sum、avg、stats、extended_stats、value_count 等聚合。

Aggregations that keep track and compute metrics over a set of documents.

在一組文檔中跟蹤和計算度量的聚合。如下以 max 聚合為例:

Max Aggregation

max 聚合官網文檔:Max Aggregation

max 聚合用於最大值統計,與 SQL 中的聚合函數 max() 的作用類似,其中 "max_price" 為自定義的聚合名稱。

##Max Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "max_price": {
      
"max": { "field": "price" } } } }

返回結果如下:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "max_price": {
      "value": 81.4
    }
  }
}

Cardinality Aggregation

基數統計聚合官網文檔:Cardinality Aggregation

Cardinality Aggregation 用於基數查詢,其作用是先執行類似 SQL 中的 distinct 操作,去掉集合中的重復項,然後統計排重後的集合長度。

##Cardinality Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "all_language": {
      "cardinality":  {
        "field": "language"
      }
    }
  }
}

返回結果如下:

{
  "took": 41,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "all_language": {
      "value": 3
    }
  }
}

Stats Aggregation

基本統計聚合官網文檔:Stats Aggregation

Stats Aggregation 用於基本統計,會一次返回 count、max、min、avg 和 sum 這 5 個指標。如下:

##Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "stats_pirce": {
      "stats":  {
        "field": "price"
      }
    }
  }
}

返回結果如下:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319
    }
  }
}

Extended Stats Aggregation

高級統計聚合官網文檔:Extended Stats Aggregation

用於高級統計,和基本統計功能類似,但是會比基本統計多4個統計結果:平方和、方差、標準差、平均值加/減兩個標準差的區間。

##Extended Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "extend_stats_pirce": {
      "extended_stats":  {
        "field": "price"
      }
    }
  }
}

返回響應結果:

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "extend_stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319,
      "sum_of_squares": 21095.46,
      "variance": 148.65199999999967,
      "std_deviation": 12.19229264740638,
      "std_deviation_bounds": {
        "upper": 88.18458529481276,
        "lower": 39.41541470518724
      }
    }
  }
}

Value Count Aggregation

文檔數量聚合官網文檔:Value Count Aggregation

Value Count Aggregation 可按字段統計文檔數量。

##Value Count Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "author"
      }
    }
  }
}

返回結果如下:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "doc_count": {
      "value": 5
    }
  }
}

註意:

text 類型的字段不能做排序和聚合(terms Aggregation 除外),如下對 title 字段做聚合,title 定義為 text:

GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "title"
      }
    }
  }
}

返回結果如下:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "books",
        "node": "6n3douACShiPmlA9j2soBw",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

2.2 桶聚合

桶聚合官網文檔:Bucket Aggregations

桶聚合包括如下聚合:

  • Adjacency Matrix Aggregation
  • Children Aggregation
  • Composite Aggregation
  • Date Histogram Aggregation
  • Date Range Aggregation
  • Diversified Sampler Aggregation
  • Filter Aggregation
  • Filters Aggregation
  • Geo Distance Aggregation
  • GeoHash grid Aggregation
  • Global Aggregation
  • Histogram Aggregation
  • IP Range Aggregation
  • Missing Aggregation
  • Nested Aggregation
  • Range Aggregation
  • Reverse nested Aggregation
  • Sampler Aggregation
  • Significant Terms Aggregation
  • Significant Text Aggregation
  • Terms Aggregation

Bucket 可以理解為一個桶,它會遍歷文檔中的內容,凡是符合某一要求的就放入一個桶中,分桶相當與 SQL 中 SQL 中的 group by。

terms Aggregation 用於分組聚合,統計屬於各編程語言的書籍數量,如下:

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      }
    }
  }
}

返回結果如下:

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2
        },
        {
          "key": "python",
          "doc_count": 2
        },
        {
          "key": "javascript",
          "doc_count": 1
        }
      ]
    }
  }
}

在 terms 分桶的基礎上,還可以對每個桶進行指標聚合。例如,想統計每一類圖書的平局價格,可以先按照 language 字段進行 Terms Aggregation,再進行 Avg Aggregattion,查詢語句如下:

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

返回結果如下:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2,
          "avg_price": {
            "value": 58.35
          }
        },
        {
          "key": "python",
          "doc_count": 2,
          "avg_price": {
            "value": 67.95
          }
        },
        {
          "key": "javascript",
          "doc_count": 1,
          "avg_price": {
            "value": 66.4
          }
        }
      ]
    }
  }
}

Range Aggregation

Range Aggregation 是範圍聚合,用於反映數據的分布情況。比如,對 books 索引中的圖書按照價格區間在 0~50、50~80、80 以上進行範圍聚合,如下:

GET books/_search
{
  "size": 0, 
  "aggs": {
    "price_range": {
      "range": {
        "field": "price",
        "ranges": [
          {"to": 50},
          {"from": 50, "to": 80},
          {"from": 80}
        ]
      }
    }
  }
}

返回結果如下:

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "price_range": {
      "buckets": [
        {
          "key": "*-50.0",
          "to": 50,
          "doc_count": 1
        },
        {
          "key": "50.0-80.0",
          "from": 50,
          "to": 80,
          "doc_count": 3
        },
        {
          "key": "80.0-*",
          "from": 80,
          "doc_count": 1
        }
      ]
    }
  }
}

Range Aggregation 不僅可以對數值型字段進行範圍統計,也可以作用在日期類型上。Date Range Aggregation 專門用於日期類型的範圍聚合,和 Range Aggregation 的區別在於日期的起止值可以使用數學表達式。

2.3 管道聚合

管道聚合官網文檔:Pipeline Aggregations

  • Avg Bucket Aggregation
  • Derivative Aggregation
  • Max Bucket Aggregation
  • Min Bucket Aggregation
  • Sum Bucket Aggregation
  • Stats Bucket Aggregation
  • Extended Stats Bucket Aggregation
  • Percentiles Bucket Aggregation
  • Moving Average Aggregation
  • Cumulative Sum Aggregation
  • Bucket Script Aggregation
  • Bucket Selector Aggregation
  • Bucket Sort Aggregation
  • Serial Differencing Aggregation

Pipeline Aggregations 處理的對象是其他聚合的輸出(而不是文檔)。

2.4 矩陣聚合

矩陣聚合官網文檔:Matrix Aggregations

  • Matrix Stats

Matrix Stats 聚合是一種面向數值型的聚合,用於計算一組文檔字段中的以下統計信息:

計數:計算過程中每種字段的樣本數量;

平均值:每個字段數據的平均值;

方差:每個字段樣本數據偏離平均值的程度;

偏度:量化每個字段樣本數據在平均值附近的非對稱分布情況;

峰度:量化每個字段樣本數據分布的形狀;

協方差:一種量化描述一個字段數據隨另一個字段數據變化程度的矩陣;

相關性:描述兩個字段數據之間的分布關系,其協方差矩陣取值在[-1,1]之間。

主要用於計算兩個數值型字段之間的關系。如對日誌記錄長度和 HTTP 狀態碼之間關系的計算。

GET /_search
{
    "aggs": {
        "statistics": {
            "matrix_stats": {
                "fields": ["log_size", "status_code"]
            }
        }
    }
}

Elastic Stack 筆記(七)Elasticsearch5.6 聚合分析