1. 程式人生 > >Prometheus 實戰於原始碼分析之API與聯邦

Prometheus 實戰於原始碼分析之API與聯邦

在進行原始碼講解關於prometheus還有一些配置和使用,需要解釋一下。首先是API的使用,prometheus提供了一套HTTP的介面

curl http://localhost:9090/api/v1/query?query=go_goroutines|python -m json.tool

{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "go_goroutines",
                    "instance"
: "localhost:9090", "job": "prometheus" }
, "value": [ 1493347106.901, "119" ] }
, { "metric": { "__name__": "go_goroutines", "instance"
: "10.39.0.45:9100", "job": "node" }
, "value": [ 1493347106.901, "13" ] }
, { "metric": { "__name__": "go_goroutines", "instance"
: "10.39.0.53:9100", "job": "node" }
, "value": [ 1493347106.901, "11" ] }
], "resultType": "vector" }
, "status": "success" }

上面演示一個查詢go_goroutines這一個監控指標的資料。讓然也可以基於開始時間和截止時間查詢,但更強大的功能應該是支援OR查詢


[root@slave3 ~]# curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'|python -m json.tool
{
    "data": [
        {
            "__name__": "up",
            "instance": "10.39.0.53:9100",
            "job": "node"
        },
        {
            "__name__": "up",
            "instance": "localhost:9090",
            "job": "prometheus"
        },
        {
            "__name__": "up",
            "instance": "10.39.0.45:9100",
            "job": "node"
        },
        {
            "__name__": "process_start_time_seconds",
            "instance": "localhost:9090",
            "job": "prometheus"
        }
    ],
    "status": "success"
}

查詢一個系列的資料,當然還可以通過DELETE去刪除系列。還記得上一篇說的設定job和targets了嗎?也可以通過API查詢

 curl http://localhost:9090/api/v1/label/job/values
{"status":"success","data":["node","prometheus"]}

當然有哪些監控物件也可以查詢

curl http://localhost:9090/api/v1/targets|python -m json.tool
{
    "data": {
        "activeTargets": [
            {
                "discoveredLabels": {
                    "__address__": "10.39.0.53:9100",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "node"
                },
                "health": "up",
                "labels": {
                    "instance": "10.39.0.53:9100",
                    "job": "node"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:40.871586825Z",
                "scrapeUrl": "http://10.39.0.53:9100/metrics"
            },
            {
                "discoveredLabels": {
                    "__address__": "10.39.0.45:9100",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "node"
                },
                "health": "up",
                "labels": {
                    "instance": "10.39.0.45:9100",
                    "job": "node"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:45.144032466Z",
                "scrapeUrl": "http://10.39.0.45:9100/metrics"
            },
            {
                "discoveredLabels": {
                    "__address__": "localhost:9090",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "prometheus"
                },
                "health": "up",
                "labels": {
                    "instance": "localhost:9090",
                    "job": "prometheus"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:44.079111193Z",
                "scrapeUrl": "http://localhost:9090/metrics"
            }
        ]
    },
    "status": "success"
}

查詢這些target。alertmanagers也是通過/api/v1/alertmanagers可以查詢的。對應prometheus的本地儲存還有一些關鍵的配置需要注意:
prometheus_local_storage_memory_series:當前的系列數量在記憶體中儲存。
prometheus_local_storage_open_head_chunks:開啟頭塊的數量。
prometheus_local_storage_chunks_to_persist:仍然需要將其持續到磁碟的記憶體塊數。
prometheus_local_storage_memory_chunks:目前在記憶中的塊數。如果減去前兩個,則可以得到持久化塊的數量(如果查詢當前沒有使用,則它們是可驅動的)。
prometheus_local_storage_series_chunks_persisted:每個批次持續存在塊數的直方圖。
prometheus_local_storage_rushed_mode如果prometheus斯處於“衝動模式”,則為1,否則為0。可用於計算prometheus處於衝動模式的時間百分比。
prometheus_local_storage_checkpoint_last_duration_seconds:最後一個檢查點需要多長時間
prometheus_local_storage_checkpoint_last_size_bytes:最後一個檢查點的大小(以位元組為單位)。
prometheus_local_storage_checkpointing是1,而prometheus是檢查點,否則為0。可以用來計算普羅米修斯檢查點的時間百分比。
prometheus_local_storage_inconsistencies_total:找到儲存不一致的計數器。如果大於0,請重新啟動伺服器進行恢復。
prometheus_local_storage_persist_errors_total:反對持續錯誤。
prometheus_local_storage_memory_dirty_series:當前髒系列數量。
process_resident_memory_bytes廣義地說,prometheus程序所佔據的實體記憶體。
go_memstats_alloc_bytes:去堆大小(分配的物件在使用中加分配物件不再使用,但尚未被垃圾回收)。

prometheus還另一個高階應用就是叢集聯邦,通過定義slave,這樣就可以在每個資料中心部署一個,然後通過聯邦匯聚。

- scrape_config:
  - job_name: dc_prometheus
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^job:.*"}'   # Request all job-level time series
    static_configs:
      - targets:
        - dc1-prometheus:9090
        - dc2-prometheus:9090

當然如果儲存量不夠還可以通過分片去採集,

global:
  external_labels:
    slave: 1  # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
  - job_name: some_job
    # Add usual service discovery here, such as static_configs
    relabel_configs:
    - source_labels: [__address__]
      modulus:       4    # 4 slaves
      target_label:  __tmp_hash
      action:        hashmod
    - source_labels: [__tmp_hash]
      regex:         ^1$  # This is the 2nd slave
      action:        keep

上面定義hash的方式去決定每個prometheus負責的targe他。

- scrape_config:
  - job_name: slaves
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^slave:.*"}'   # Request all slave-level time series
    static_configs:
      - targets:
        - slave0:9090
        - slave1:9090
        - slave3:9090
        - slave4:9090

下面定義了多個slave。這樣資料就可以分片儲存了。