自定義Metrics：讓Prometheus監控你的應用程式

阿新 • • 發佈：2018-11-11

前言

Prometheus社群提供了大量的官方以及第三方Exporters，可以滿足Prometheus的採納者快速實現對關鍵業務，以及基礎設施的監控需求。

如上所示，一個簡單的應用以及環境架構。一般而言，我們通常會從幾個層面進行監控指標的採集：

入口閘道器：這裡可以是Nginx/HaProxy這一類的負載均衡器，也可以是注入Spring Cloud Zuul這一類框架提供的微服務入口。一般來說我們需要對所有Http Request相關的指標資料進行採集。如請求地址，Http Method,返回狀態碼，響應時長等。從而可以通過這些指標歷史資料去分析業務壓力，服務狀態等資訊。
應用服務：對於應用服務而言，基本的如應用本身的資源使用率,比如如果是Java類程式可以直接通過JVM資訊來進行統計，如果是部署到容器中，則可以通過Container的資源使用情況來統計。除了資源用量外，某些特殊情況下，我們可能還會對應用中的某些業務指標進行採集。

基礎設施：虛擬機器或者物理機的資源使用情況等。
其它：叢集環境中所使用到的資料庫，快取，訊息佇列等中介軟體狀態等。

對於以上的集中場景中，除了直接使用Prometheus社群提供的Exporter外，不同的專案可能還需要實現一些自定義的Exporter用於實現對於特定目的的指標的採集和監控需求。

本文將以Spring Boot/Spring Cloud為例，介紹如果使用Prometheus SDK實現自定義監控指標的定義以及暴露，並且會介紹Prometheus中四種不同指標型別(Counter, Gauge, Histogram, Summary)的實際使用場景；

擴充套件Spring應用程式，支援Prometheus採集

新增Prometheus Java Client依賴

> 這裡使用0.0.24的版本，在之前的版本中Spring Boot暴露的監控地址，無法正確的處理Prometheus Server的請求，詳情：https://github.com/prometheus/ ... s/265

build.gradle

dependencies {
...
compile 'io.prometheus:simpleclient:0.0.24'
compile "io.prometheus:simpleclient_spring_boot:0.0.24"
compile "io.prometheus:simpleclient_hotspot:0.0.24"
}

啟用Prometheus Metrics Endpoint

添加註解@EnablePrometheusEndpoint啟用Prometheus Endpoint,這裡同時使用了simpleclient_hotspot中提供的DefaultExporter該Exporter會在metrics endpoint中放回當前應用JVM的相關資訊

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication implements CommandLineRunner {

public static void main(String[] args) {
    SpringApplication.run(GatewayApplication.class, args);
}

@Override
public void run(String... strings) throws Exception {
    DefaultExports.initialize();
}
}

預設情況下Prometheus暴露的metrics endpoint為 /prometheus，可以通過endpoint配置進行修改

endpoints:
prometheus:
id: metrics
metrics:
id: springmetrics
sensitive: false
enabled: true

啟動應用程式訪問 http://localhost:8080/metrics 可以看到以下輸出：

HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.

TYPE jvm_gc_collection_seconds summary

jvm_gc_collection_seconds_count{gc="PS Scavenge",} 11.0
jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.18
jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 2.0
jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.121

HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM

TYPE jvm_classes_loaded gauge

jvm_classes_loaded 8376.0

HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution

TYPE jvm_classes_loaded_total counter

...

新增攔截器，為監控埋點做準備

除了獲取應用JVM相關的狀態以外，我們還可能需要新增一些自定義的監控Metrics實現對系統性能，以及業務狀態進行採集，以提供日後優化的相關支撐資料。首先我們使用攔截器處理對應用的所有請求。

繼承WebMvcConfigurerAdapter類，複寫addInterceptors方法，對所有請求/**新增攔截器

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication extends WebMvcConfigurerAdapter implements CommandLineRunner {
@Override
public void addInterceptors(InterceptorRegistry registry) {
    registry.addInterceptor(new PrometheusMetricsInterceptor()).addPathPatterns("/**");
}
}

PrometheusMetricsInterceptor整合HandlerInterceptorAdapter，通過複寫父方法，實現對請求處理前/處理完成的處理。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    return super.preHandle(request, response, handler);
}

@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    super.afterCompletion(request, response, handler, ex);
}
}

自定義Metrics指標

Prometheus提供了4中不同的Metrics型別:Counter,Gauge,Histogram,Summary

1）Counter:只增不減的計數器

計數器可以用於記錄只會增加不會減少的指標型別,比如記錄應用請求的總量(http_requests_total)，cpu使用時間(process_cpu_seconds_total)等。

對於Counter型別的指標，只包含一個inc()方法，用於計數器+1

一般而言，Counter型別的metrics指標在命名中我們使用_total結束。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

static final Counter requestCounter = Counter.build()
        .name("io_namespace_http_requests_total").labelNames("path", "method", "code")
        .help("Total requests.").register();

@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    String requestURI = request.getRequestURI();
    String method = request.getMethod();
    int status = response.getStatus();

    requestCounter.labels(requestURI, method, String.valueOf(status)).inc();
    super.afterCompletion(request, response, handler, ex);
}
}

使用Counter.build()建立Counter metrics，name()方法，用於指定該指標的名稱 labelNames()方法，用於宣告該metrics擁有的維度label。在addInterceptors方法中，我們獲取當前請求的，RequesPath，Method以及狀態碼。並且呼叫inc()方法，在每次請求發生時計數+1。

Counter.build()...register(),會像Collector中註冊該指標，並且當訪問/metrics地址時，返回該指標的狀態。

通過指標io_namespace_http_requests_total我們可以：

查詢應用的請求總量

PromQL

sum(io_namespace_http_requests_total)

查詢每秒Http請求量

PromQL

sum(rate(io_wise2c_gateway_requests_total[5m]))

查詢當前應用請求量Top N的URI

PromQL

topk(10, sum(io_namespace_http_requests_total) by (path))

2）Gauge: 可增可減的儀表盤

對於這類可增可減的指標，可以用於反應應用的__當前狀態__,例如在監控主機時，主機當前空閒的記憶體大小(node_memory_MemFree)，可用記憶體大小(node_memory_MemAvailable)。或者容器當前的cpu使用率,記憶體使用率。

對於Gauge指標的物件則包含兩個主要的方法inc()以及dec(),使用者新增或者減少計數。在這裡我們使用Gauge記錄當前正在處理的Http請求數量。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

...省略的程式碼
static final Gauge inprogressRequests = Gauge.build()
        .name("io_namespace_http_inprogress_requests").labelNames("path", "method", "code")
        .help("Inprogress requests.").register();

@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的程式碼
    // 計數器+1
    inprogressRequests.labels(requestURI, method, String.valueOf(status)).inc();
    return super.preHandle(request, response, handler);
}

@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的程式碼
    // 計數器-1
    inprogressRequests.labels(requestURI, method, String.valueOf(status)).dec();

    super.afterCompletion(request, response, handler, ex);
}
}

通過指標io_namespace_http_inprogress_requests我們可以直接查詢應用當前正在處理中的Http請求數量:

PromQL

io_namespace_http_inprogress_requests{}

3）Histogram：自帶buckets區間用於統計分佈統計圖

主要用於在指定分佈範圍內(Buckets)記錄大小(如http request bytes)或者事件發生的次數。

以請求響應時間requests_latency_seconds為例，假如我們需要記錄http請求響應時間符合在分佈範圍{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}中的次數時。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

static final Histogram requestLatencyHistogram = Histogram.build().labelNames("path", "method", "code")
        .name("io_namespace_http_requests_latency_seconds_histogram").help("Request latency in seconds.")
        .register();

private Histogram.Timer histogramRequestTimer;

@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的程式碼
    histogramRequestTimer = requestLatencyHistogram.labels(requestURI, method, String.valueOf(status)).startTimer();
    ...省略的程式碼
}

@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的程式碼
    histogramRequestTimer.observeDuration();
    ...省略的程式碼
}
}

使用Histogram構造器可以建立Histogram監控指標。預設的buckets範圍為{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}。如何需要覆蓋預設的buckets，可以使用.buckets(double... buckets)覆蓋。

Histogram會自動建立3個指標，分別為：

事件發生總次數： basename_count

實際含義：當前一共發生了2次http請求

io_namespace_http_requests_latency_seconds_histogram_count{path="/",method="GET",code="200",} 2.0

所有事件產生值的大小的總和: basename_sum

實際含義：發生的2次http請求總的響應時間為13.107670803000001 秒

io_namespace_http_requests_latency_seconds_histogram_sum{path="/",method="GET",code="200",} 13.107670803000001

事件產生的值分佈在bucket中的次數： basename_bucket{le="上包含"}

在總共2次請求當中,http請求響應時間 <=0.005 秒的請求次數為0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0

在總共2次請求當中,http請求響應時間 <=0.01 秒的請求次數為0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0

在總共2次請求當中,http請求響應時間 <=0.025 秒的請求次數為0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.025",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.05",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.075",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.1",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.25",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.75",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="1.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="2.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="5.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="7.5",} 2.0

在總共2次請求當中,http請求響應時間 <=10 秒的請求次數為0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="10.0",} 2.0

在總共2次請求當中,ttp請求響應時間 10 秒的請求次數為0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="+Inf",} 2.0

Summary: 客戶端定義的資料分佈統計圖

Summary和Histogram非常型別相似，都可以統計事件發生的次數或者大小，以及其分佈情況。

Summary和Histogram都提供了對於事件的計數_count以及值的彙總_sum。因此使用_count,和_sum時間序列可以計算出相同的內容，例如http每秒的平均響應時間：rate(basename_sum[5m]) / rate(basename_count[5m])。

同時Summary和Histogram都可以計算和統計樣本的分佈情況，比如中位數，9分位數等等。其中 0.0<= 分位數Quantiles <= 1.0。

不同在於Histogram可以通過histogram_quantile函式在伺服器端計算分位數。而Sumamry的分位數則是直接在客戶端進行定義。因此對於分位數的計算。 Summary在通過PromQL進行查詢時有更好的效能表現，而Histogram則會消耗更多的資源。相對的對於客戶端而言Histogram消耗的資源更少。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

static final Summary requestLatency = Summary.build()
        .name("io_namespace_http_requests_latency_seconds_summary")
        .quantile(0.5, 0.05)
        .quantile(0.9, 0.01)
        .labelNames("path", "method", "code")
        .help("Request latency in seconds.").register();

@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的程式碼
    requestTimer = requestLatency.labels(requestURI, method, String.valueOf(status)).startTimer();
    ...省略的程式碼
}

@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的程式碼
    requestTimer.observeDuration();
    ...省略的程式碼
}
}

使用Summary指標，會自動建立多個時間序列：

事件發生總的次數

含義：當前http請求發生總次數為12次

io_namespace_http_requests_latency_seconds_summary_count{path="/",method="GET",code="200",} 12.0

事件產生的值的總和

含義：這12次http請求的總響應時間為 51.029495508s

io_namespace_http_requests_latency_seconds_summary_sum{path="/",method="GET",code="200",} 51.029495508

事件產生的值的分佈情況

含義：這12次http請求響應時間的中位數是3.052404983s

io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983

含義：這12次http請求響應時間的9分位數是8.003261666s

io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666

使用Collector暴露業務指標

除了在攔截器中使用Prometheus提供的Counter,Summary,Gauage等構造監控指標以外，我們還可以通過自定義的Collector實現對相關業務指標的暴露

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication extends WebMvcConfigurerAdapter implements CommandLineRunner {

@Autowired
private CustomExporter customExporter;

...省略的程式碼

@Override
public void run(String... args) throws Exception {
    ...省略的程式碼
    customExporter.register();
}
}

CustomExporter整合自io.prometheus.client.Collector，在呼叫Collector的register()方法後，當訪問/metrics時，則會自動從Collector的collection()方法中獲取採集到的監控指標。

由於這裡CustomExporter存在於Spring的IOC容器當中，這裡可以直接訪問業務程式碼，返回需要的業務相關的指標。

import io.prometheus.client.Collector;
import io.prometheus.client.GaugeMetricFamily;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

@Component
public class CustomExporter extends Collector {
@Override
public List<MetricFamilySamples> collect() {
    List<MetricFamilySamples> mfs = new ArrayList<>();

    # 建立metrics指標
    GaugeMetricFamily labeledGauge =
            new GaugeMetricFamily("io_namespace_custom_metrics", "custom metrics", Collections.singletonList("labelname"));

    # 設定指標的label以及value
    labeledGauge.addMetric(Collections.singletonList("labelvalue"), 1);

    mfs.add(labeledGauge);
    return mfs;
}
}

當然這裡也可以使用CounterMetricFamily，SummaryMetricFamily宣告其它的指標型別。

小結

好了。目前為止，啟動應用程式，並且訪問 http://localhost:8080/metrics。我們可以看到如下結果。

這部分分別介紹了兩種方式，在Spring應用中實現對於自定義Metrics指標的定義：

攔截器/過濾器：用於統計所有應用請求的情況
自定義Collector: 可以用於統計應用業務能力相關的監控情況

同時介紹了4中Metrics指標型別以及使用場景：

Counter，只增不減的計數器
Gauge，可增可減的儀表盤
Histogram，自帶buckets區間用於統計分佈統計圖
Summary，客戶端定義的資料分佈統計圖

自定義Metrics：讓Prometheus監控你的應用程式

前言

擴充套件Spring應用程式，支援Prometheus採集

新增Prometheus Java Client依賴

build.gradle

啟用Prometheus Metrics Endpoint

HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.

TYPE jvm_gc_collection_seconds summary

HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM

TYPE jvm_classes_loaded gauge

HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution

TYPE jvm_classes_loaded_total counter

新增攔截器，為監控埋點做準備

自定義Metrics指標

1）Counter:只增不減的計數器

PromQL

PromQL

PromQL

2）Gauge: 可增可減的儀表盤

PromQL

3）Histogram：自帶buckets區間用於統計分佈統計圖

實際含義： 當前一共發生了2次http請求

實際含義： 發生的2次http請求總的響應時間為13.107670803000001 秒

在總共2次請求當中,http請求響應時間 <=0.005 秒 的請求次數為0

在總共2次請求當中,http請求響應時間 <=0.01 秒 的請求次數為0

在總共2次請求當中,http請求響應時間 <=0.025 秒 的請求次數為0

在總共2次請求當中,http請求響應時間 <=10 秒 的請求次數為0

在總共2次請求當中,ttp請求響應時間 10 秒 的請求次數為0

Summary: 客戶端定義的資料分佈統計圖

含義：當前http請求發生總次數為12次

含義：這12次http請求的總響應時間為 51.029495508s

含義：這12次http請求響應時間的中位數是3.052404983s

含義：這12次http請求響應時間的9分位數是8.003261666s

使用Collector暴露業務指標

小結

相關推薦

實際含義：當前一共發生了2次http請求

實際含義：發生的2次http請求總的響應時間為13.107670803000001 秒

在總共2次請求當中,http請求響應時間 <=0.005 秒的請求次數為0

在總共2次請求當中,http請求響應時間 <=0.01 秒的請求次數為0

在總共2次請求當中,http請求響應時間 <=0.025 秒的請求次數為0

在總共2次請求當中,http請求響應時間 <=10 秒的請求次數為0

在總共2次請求當中,ttp請求響應時間 10 秒的請求次數為0