Prometheus 實戰於原始碼分析之API與聯邦
在進行原始碼講解關於prometheus還有一些配置和使用,需要解釋一下。首先是API的使用,prometheus提供了一套HTTP的介面
curl http://localhost:9090/api/v1/query?query=go_goroutines|python -m json.tool
{
"data": {
"result": [
{
"metric": {
"__name__": "go_goroutines",
"instance" : "localhost:9090",
"job": "prometheus"
},
"value": [
1493347106.901,
"119"
]
},
{
"metric": {
"__name__": "go_goroutines",
"instance" : "10.39.0.45:9100",
"job": "node"
},
"value": [
1493347106.901,
"13"
]
},
{
"metric": {
"__name__": "go_goroutines",
"instance" : "10.39.0.53:9100",
"job": "node"
},
"value": [
1493347106.901,
"11"
]
}
],
"resultType": "vector"
},
"status": "success"
}
上面演示一個查詢go_goroutines這一個監控指標的資料。讓然也可以基於開始時間和截止時間查詢,但更強大的功能應該是支援OR查詢
[root@slave3 ~]# curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'|python -m json.tool
{
"data": [
{
"__name__": "up",
"instance": "10.39.0.53:9100",
"job": "node"
},
{
"__name__": "up",
"instance": "localhost:9090",
"job": "prometheus"
},
{
"__name__": "up",
"instance": "10.39.0.45:9100",
"job": "node"
},
{
"__name__": "process_start_time_seconds",
"instance": "localhost:9090",
"job": "prometheus"
}
],
"status": "success"
}
查詢一個系列的資料,當然還可以通過DELETE去刪除系列。還記得上一篇說的設定job和targets了嗎?也可以通過API查詢
curl http://localhost:9090/api/v1/label/job/values
{"status":"success","data":["node","prometheus"]}
當然有哪些監控物件也可以查詢
curl http://localhost:9090/api/v1/targets|python -m json.tool
{
"data": {
"activeTargets": [
{
"discoveredLabels": {
"__address__": "10.39.0.53:9100",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "node"
},
"health": "up",
"labels": {
"instance": "10.39.0.53:9100",
"job": "node"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:40.871586825Z",
"scrapeUrl": "http://10.39.0.53:9100/metrics"
},
{
"discoveredLabels": {
"__address__": "10.39.0.45:9100",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "node"
},
"health": "up",
"labels": {
"instance": "10.39.0.45:9100",
"job": "node"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:45.144032466Z",
"scrapeUrl": "http://10.39.0.45:9100/metrics"
},
{
"discoveredLabels": {
"__address__": "localhost:9090",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "prometheus"
},
"health": "up",
"labels": {
"instance": "localhost:9090",
"job": "prometheus"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:44.079111193Z",
"scrapeUrl": "http://localhost:9090/metrics"
}
]
},
"status": "success"
}
查詢這些target。alertmanagers也是通過/api/v1/alertmanagers可以查詢的。對應prometheus的本地儲存還有一些關鍵的配置需要注意:
prometheus_local_storage_memory_series:當前的系列數量在記憶體中儲存。
prometheus_local_storage_open_head_chunks:開啟頭塊的數量。
prometheus_local_storage_chunks_to_persist:仍然需要將其持續到磁碟的記憶體塊數。
prometheus_local_storage_memory_chunks:目前在記憶中的塊數。如果減去前兩個,則可以得到持久化塊的數量(如果查詢當前沒有使用,則它們是可驅動的)。
prometheus_local_storage_series_chunks_persisted:每個批次持續存在塊數的直方圖。
prometheus_local_storage_rushed_mode如果prometheus斯處於“衝動模式”,則為1,否則為0。可用於計算prometheus處於衝動模式的時間百分比。
prometheus_local_storage_checkpoint_last_duration_seconds:最後一個檢查點需要多長時間
prometheus_local_storage_checkpoint_last_size_bytes:最後一個檢查點的大小(以位元組為單位)。
prometheus_local_storage_checkpointing是1,而prometheus是檢查點,否則為0。可以用來計算普羅米修斯檢查點的時間百分比。
prometheus_local_storage_inconsistencies_total:找到儲存不一致的計數器。如果大於0,請重新啟動伺服器進行恢復。
prometheus_local_storage_persist_errors_total:反對持續錯誤。
prometheus_local_storage_memory_dirty_series:當前髒系列數量。
process_resident_memory_bytes廣義地說,prometheus程序所佔據的實體記憶體。
go_memstats_alloc_bytes:去堆大小(分配的物件在使用中加分配物件不再使用,但尚未被垃圾回收)。
prometheus還另一個高階應用就是叢集聯邦,通過定義slave,這樣就可以在每個資料中心部署一個,然後通過聯邦匯聚。
- scrape_config:
- job_name: dc_prometheus
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"^job:.*"}' # Request all job-level time series
static_configs:
- targets:
- dc1-prometheus:9090
- dc2-prometheus:9090
當然如果儲存量不夠還可以通過分片去採集,
global:
external_labels:
slave: 1 # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
- job_name: some_job
# Add usual service discovery here, such as static_configs
relabel_configs:
- source_labels: [__address__]
modulus: 4 # 4 slaves
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ^1$ # This is the 2nd slave
action: keep
上面定義hash的方式去決定每個prometheus負責的targe他。
- scrape_config:
- job_name: slaves
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"^slave:.*"}' # Request all slave-level time series
static_configs:
- targets:
- slave0:9090
- slave1:9090
- slave3:9090
- slave4:9090
下面定義了多個slave。這樣資料就可以分片儲存了。