內容來源於官方 Longhorn 1.1.2 英文技術手冊。

系列

目錄

  1. 設定 PrometheusGrafana 來監控 Longhorn
  2. Longhorn 指標整合到 Rancher 監控系統中
  3. Longhorn 監控指標
  4. 支援 Kubelet Volume 指標
  5. Longhorn 警報規則示例

設定 PrometheusGrafana 來監控 Longhorn

概覽

LonghornREST 端點 http://LONGHORN_MANAGER_IP:PORT/metrics 上以 Prometheus 文字格式原生公開指標。

有關所有可用指標的說明,請參閱 Longhorn's metrics

您可以使用 Prometheus, Graphite, Telegraf 等任何收集工具來抓取這些指標,然後通過 Grafana 等工具將收集到的資料視覺化。

本文件提供了一個監控 Longhorn 的示例設定。監控系統使用 Prometheus 收集資料和警報,使用 Grafana 將收集的資料視覺化/儀表板(visualizing/dashboarding)。 高階概述來看,監控系統包含:

  • Prometheus 伺服器從 Longhorn 指標端點抓取和儲存時間序列資料。Prometheus 還負責根據配置的規則和收集的資料生成警報。Prometheus 伺服器然後將警報傳送到 Alertmanager
  • AlertManager 然後管理這些警報(alerts),包括靜默(silencing)、抑制(inhibition)、聚合(aggregation)和通過電子郵件、呼叫通知系統和聊天平臺等方法傳送通知。
  • GrafanaPrometheus 伺服器查詢資料並繪製儀表板進行視覺化。

下圖描述了監控系統的詳細架構。

上圖中有 2 個未提及的元件:

  • Longhorn 後端服務是指向 Longhorn manager pods 集的服務。Longhorn 的指標在端點 http://LONGHORN_MANAGER_IP:PORT/metricsLonghorn manager pods 中公開。
  • Prometheus operator 使在 Kubernetes 上執行 Prometheus 變得非常容易。operator 監視 3 個自定義資源:ServiceMonitorPrometheusAlertManager。當用戶建立這些自定義資源時,Prometheus Operator 會使用使用者指定的配置部署和管理 Prometheus server, AlerManager

安裝

按照此說明將所有元件安裝到 monitoring 名稱空間中。要將它們安裝到不同的名稱空間中,請更改欄位 namespace: OTHER_NAMESPACE

建立 monitoring 名稱空間

apiVersion: v1
kind: Namespace
metadata:
name: monitoring

安裝 Prometheus Operator

部署 Prometheus Operator 及其所需的 ClusterRoleClusterRoleBindingService Account

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-operator
subjects:
- kind: ServiceAccount
name: prometheus-operator
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
rules:
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- create
- apiGroups:
- apiextensions.k8s.io
resourceNames:
- alertmanagers.monitoring.coreos.com
- podmonitors.monitoring.coreos.com
- prometheuses.monitoring.coreos.com
- prometheusrules.monitoring.coreos.com
- servicemonitors.monitoring.coreos.com
- thanosrulers.monitoring.coreos.com
resources:
- customresourcedefinitions
verbs:
- get
- update
- apiGroups:
- monitoring.coreos.com
resources:
- alertmanagers
- alertmanagers/finalizers
- prometheuses
- prometheuses/finalizers
- thanosrulers
- thanosrulers/finalizers
- servicemonitors
- podmonitors
- prometheusrules
verbs:
- '*'
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- '*'
- apiGroups:
- ""
resources:
- configmaps
- secrets
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
verbs:
- list
- delete
- apiGroups:
- ""
resources:
- services
- services/finalizers
- endpoints
verbs:
- get
- create
- update
- delete
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- namespaces
verbs:
- get
- list
- watch
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
template:
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
spec:
containers:
- args:
- --kubelet-service=kube-system/kubelet
- --logtostderr=true
- --config-reloader-image=jimmidyson/configmap-reload:v0.3.0
- --prometheus-config-reloader=quay.io/prometheus-operator/prometheus-config-reloader:v0.38.3
image: quay.io/prometheus-operator/prometheus-operator:v0.38.3
name: prometheus-operator
ports:
- containerPort: 8080
name: http
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
securityContext:
allowPrivilegeEscalation: false
nodeSelector:
beta.kubernetes.io/os: linux
securityContext:
runAsNonRoot: true
runAsUser: 65534
serviceAccountName: prometheus-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/version: v0.38.3
name: prometheus-operator
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http
port: 8080
targetPort: http
selector:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator

安裝 Longhorn ServiceMonitor

Longhorn ServiceMonitor 有一個標籤選擇器 app: longhorn-manager 來選擇 Longhorn 後端服務。

稍後,Prometheus CRD 可以包含 Longhorn ServiceMonitor,以便 Prometheus server 可以發現所有 Longhorn manager pods 及其端點。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-prometheus-servicemonitor
namespace: monitoring
labels:
name: longhorn-prometheus-servicemonitor
spec:
selector:
matchLabels:
app: longhorn-manager
namespaceSelector:
matchNames:
- longhorn-system
endpoints:
- port: manager

安裝和配置 Prometheus AlertManager

  1. 使用 3 個例項建立一個高可用的 Alertmanager 部署:

    apiVersion: monitoring.coreos.com/v1
    kind: Alertmanager
    metadata:
    name: longhorn
    namespace: monitoring
    spec:
    replicas: 3
  2. 除非提供有效配置,否則 Alertmanager 例項將無法啟動。有關 Alertmanager 配置的更多說明,請參見此處。下面的程式碼給出了一個示例配置:

    global:
    resolve_timeout: 5m
    route:
    group_by: [alertname]
    receiver: email_and_slack
    receivers:
    - name: email_and_slack
    email_configs:
    - to: <the email address to send notifications to>
    from: <the sender address>
    smarthost: <the SMTP host through which emails are sent>
    # SMTP authentication information.
    auth_username: <the username>
    auth_identity: <the identity>
    auth_password: <the password>
    headers:
    subject: 'Longhorn-Alert'
    text: |-
    {{ range .Alerts }}
    *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    *Description:* {{ .Annotations.description }}
    *Details:*
    {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
    {{ end }}
    {{ end }}
    slack_configs:
    - api_url: <the Slack webhook URL>
    channel: <the channel or user to send notifications to>
    text: |-
    {{ range .Alerts }}
    *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    *Description:* {{ .Annotations.description }}
    *Details:*
    {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
    {{ end }}
    {{ end }}

    將上述 Alertmanager 配置儲存在名為 alertmanager.yaml 的檔案中,並使用 kubectl 從中建立一個 secret

    Alertmanager 例項要求 secret 資源命名遵循 alertmanager-{ALERTMANAGER_NAME} 格式。

    在上一步中,Alertmanager 的名稱是 longhorn,所以 secret 名稱必須是 alertmanager-longhorn

    $ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
  3. 為了能夠檢視 AlertmanagerWeb UI,請通過 Service 公開它。一個簡單的方法是使用 NodePort 型別的 Service

    apiVersion: v1
    kind: Service
    metadata:
    name: alertmanager-longhorn
    namespace: monitoring
    spec:
    type: NodePort
    ports:
    - name: web
    nodePort: 30903
    port: 9093
    protocol: TCP
    targetPort: web
    selector:
    alertmanager: longhorn

    建立上述服務後,您可以通過節點的 IP 和埠 30903 訪問 Alertmanagerweb UI

    使用上面的 NodePort 服務進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Alertmanagerweb UI

安裝和配置 Prometheus server

  1. 建立定義警報條件的 PrometheusRule 自定義資源。

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    labels:
    prometheus: longhorn
    role: alert-rules
    name: prometheus-longhorn-rules
    namespace: monitoring
    spec:
    groups:
    - name: longhorn.rules
    rules:
    - alert: LonghornVolumeUsageCritical
    annotations:
    description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
    more than 5 minutes.
    summary: Longhorn volume capacity is over 90% used.
    expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
    for: 5m
    labels:
    issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
    severity: critical

    有關如何定義警報規則的更多資訊,請參見https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules

  2. 如果激活了 RBAC 授權,則為 Prometheus Pod 建立 ClusterRoleClusterRoleBinding

    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: prometheus
    namespace: monitoring
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRole
    metadata:
    name: prometheus
    namespace: monitoring
    rules:
    - apiGroups: [""]
    resources:
    - nodes
    - services
    - endpoints
    - pods
    verbs: ["get", "list", "watch"]
    - apiGroups: [""]
    resources:
    - configmaps
    verbs: ["get"]
    - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
    name: prometheus
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: prometheus
    subjects:
    - kind: ServiceAccount
    name: prometheus
    namespace: monitoring
  3. 建立 Prometheus 自定義資源。請注意,我們在 spec 中選擇了 Longhorn 服務監視器(service monitor)和 Longhorn 規則。

    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
    name: prometheus
    namespace: monitoring
    spec:
    replicas: 2
    serviceAccountName: prometheus
    alerting:
    alertmanagers:
    - namespace: monitoring
    name: alertmanager-longhorn
    port: web
    serviceMonitorSelector:
    matchLabels:
    name: longhorn-prometheus-servicemonitor
    ruleSelector:
    matchLabels:
    prometheus: longhorn
    role: alert-rules
  4. 為了能夠檢視 Prometheus 伺服器的 web UI,請通過 Service 公開它。一個簡單的方法是使用 NodePort 型別的 Service

    apiVersion: v1
    kind: Service
    metadata:
    name: prometheus
    namespace: monitoring
    spec:
    type: NodePort
    ports:
    - name: web
    nodePort: 30904
    port: 9090
    protocol: TCP
    targetPort: web
    selector:
    prometheus: prometheus

    建立上述服務後,您可以通過節點的 IP 和埠 30904 訪問 Prometheus serverweb UI

    此時,您應該能夠在 Prometheus server UI 的目標和規則部分看到所有 Longhorn manager targets 以及 Longhorn rules

    使用上述 NodePort service 進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Prometheus serverweb UI

安裝 Grafana

  1. 建立 Grafana 資料來源配置:

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: grafana-datasources
    namespace: monitoring
    data:
    prometheus.yaml: |-
    {
    "apiVersion": 1,
    "datasources": [
    {
    "access":"proxy",
    "editable": true,
    "name": "prometheus",
    "orgId": 1,
    "type": "prometheus",
    "url": "http://prometheus:9090",
    "version": 1
    }
    ]
    }
  2. 建立 Grafana 部署:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: grafana
    namespace: monitoring
    labels:
    app: grafana
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: grafana
    template:
    metadata:
    name: grafana
    labels:
    app: grafana
    spec:
    containers:
    - name: grafana
    image: grafana/grafana:7.1.5
    ports:
    - name: grafana
    containerPort: 3000
    resources:
    limits:
    memory: "500Mi"
    cpu: "300m"
    requests:
    memory: "500Mi"
    cpu: "200m"
    volumeMounts:
    - mountPath: /var/lib/grafana
    name: grafana-storage
    - mountPath: /etc/grafana/provisioning/datasources
    name: grafana-datasources
    readOnly: false
    volumes:
    - name: grafana-storage
    emptyDir: {}
    - name: grafana-datasources
    configMap:
    defaultMode: 420
    name: grafana-datasources
  3. NodePort 32000 上暴露 Grafana

    apiVersion: v1
    kind: Service
    metadata:
    name: grafana
    namespace: monitoring
    spec:
    selector:
    app: grafana
    type: NodePort
    ports:
    - port: 3000
    targetPort: 3000
    nodePort: 32000

    使用上述 NodePort 服務進行快速驗證,因為它不通過 TLS 連線進行通訊。您可能希望將服務型別更改為 ClusterIP,並設定一個 Ingress-controller 以通過 TLS 連線公開 Grafana

  4. 使用埠 32000 上的任何節點 IP 訪問 Grafana 儀表板。預設憑據為:

    User: admin
    Pass: admin
  5. 安裝 Longhorn dashboard

    進入 Grafana 後,匯入預置的面板:https://grafana.com/grafana/dashboards/13032

    有關如何匯入 Grafana dashboard 的說明,請參閱 https://grafana.com/docs/grafana/latest/reference/export_import/

    成功後,您應該會看到以下 dashboard

Longhorn 指標整合到 Rancher 監控系統中

關於 Rancher 監控系統

使用 Rancher,您可以通過與領先的開源監控解決方案 Prometheus 的整合來監控叢集節點、Kubernetes 元件和軟體部署的狀態和程序。

有關如何部署/啟用 Rancher 監控系統的說明,請參見https://rancher.com/docs/rancher/v2.x/en/monitoring-alerting/

Longhorn 指標新增到 Rancher 監控系統

如果您使用 Rancher 來管理您的 Kubernetes 並且已經啟用 Rancher 監控,您可以通過簡單地部署以下 ServiceMonitorLonghorn 指標新增到 Rancher 監控中:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-prometheus-servicemonitor
namespace: longhorn-system
labels:
name: longhorn-prometheus-servicemonitor
spec:
selector:
matchLabels:
app: longhorn-manager
namespaceSelector:
matchNames:
- longhorn-system
endpoints:
- port: manager

建立 ServiceMonitor 後,Rancher 將自動發現所有 Longhorn 指標。

然後,您可以設定 Grafana 儀表板以進行視覺化。

Longhorn 監控指標

Volume(卷)

指標名 說明 示例
longhorn_volume_actual_size_bytes 對應節點上卷的每個副本使用的實際空間 longhorn_volume_actual_size_bytes{node="worker-2",volume="testvol"} 1.1917312e+08
longhorn_volume_capacity_bytes 此卷的配置大小(以 byte 為單位) longhorn_volume_capacity_bytes{node="worker-2",volume="testvol"} 6.442450944e+09
longhorn_volume_state 本卷狀態: 1=creating, 2=attached, 3=Detached, 4=Attaching, 5=Detaching, 6=Deleting longhorn_volume_state{node="worker-2",volume="testvol"} 2
longhorn_volume_robustness 本卷的健壯性: 0=unknown, 1=healthy, 2=degraded, 3=faulted longhorn_volume_robustness{node="worker-2",volume="testvol"} 1

Node(節點)

指標名 說明 示例
longhorn_node_status 該節點的狀態: 1=true, 0=false longhorn_node_status{condition="ready",condition_reason="",node="worker-2"} 1
longhorn_node_count_total Longhorn 系統中的節點總數 longhorn_node_count_total 4
longhorn_node_cpu_capacity_millicpu 此節點上的最大可分配 CPU longhorn_node_cpu_capacity_millicpu{node="worker-2"} 2000
longhorn_node_cpu_usage_millicpu 此節點上的 CPU 使用率 longhorn_node_cpu_usage_millicpu{node="pworker-2"} 186
longhorn_node_memory_capacity_bytes 此節點上的最大可分配記憶體 longhorn_node_memory_capacity_bytes{node="worker-2"} 4.031229952e+09
longhorn_node_memory_usage_bytes 此節點上的記憶體使用情況 longhorn_node_memory_usage_bytes{node="worker-2"} 1.833582592e+09
longhorn_node_storage_capacity_bytes 本節點的儲存容量 longhorn_node_storage_capacity_bytes{node="worker-3"} 8.3987283968e+10
longhorn_node_storage_usage_bytes 該節點的已用儲存 longhorn_node_storage_usage_bytes{node="worker-3"} 9.060941824e+09
longhorn_node_storage_reservation_bytes 此節點上為其他應用程式和系統保留的儲存空間 longhorn_node_storage_reservation_bytes{node="worker-3"} 2.519618519e+10

Disk(磁碟)

指標名 說明 示例
longhorn_disk_capacity_bytes 此磁碟的儲存容量 longhorn_disk_capacity_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 8.3987283968e+10
longhorn_disk_usage_bytes 此磁碟的已用儲存空間 longhorn_disk_usage_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 9.060941824e+09
longhorn_disk_reservation_bytes 此磁碟上為其他應用程式和系統保留的儲存空間 longhorn_disk_reservation_bytes{disk="default-disk-8b28ee3134628183",node="worker-3"} 2.519618519e+10

Instance Manager(例項管理器)

指標名 說明 示例
longhorn_instance_manager_cpu_usage_millicpu 這個 longhorn 例項管理器的 CPU 使用率 longhorn_instance_manager_cpu_usage_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 80
longhorn_instance_manager_cpu_requests_millicpu 在這個 Longhorn 例項管理器的 kubernetes 中請求的 CPU 資源 longhorn_instance_manager_cpu_requests_millicpu{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 250
longhorn_instance_manager_memory_usage_bytes 這個 longhorn 例項管理器的記憶體使用情況 longhorn_instance_manager_memory_usage_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 2.4072192e+07
longhorn_instance_manager_memory_requests_bytes 這個 longhorn 例項管理器在 Kubernetes 中請求的記憶體 longhorn_instance_manager_memory_requests_bytes{instance_manager="instance-manager-e-2189ed13",instance_manager_type="engine",node="worker-2"} 0

Manager(管理器)

指標名 說明 示例
longhorn_manager_cpu_usage_millicpu 這個 Longhorn Manager 的 CPU 使用率 longhorn_manager_cpu_usage_millicpu{manager="longhorn-manager-5rx2n",node="worker-2"} 27
longhorn_manager_memory_usage_bytes 這個 Longhorn Manager 的記憶體使用情況 longhorn_manager_memory_usage_bytes{manager="longhorn-manager-5rx2n",node="worker-2"} 2.6144768e+07

支援 Kubelet Volume 指標

關於 Kubelet Volume 指標

Kubelet 公開了以下指標

  1. kubelet_volume_stats_capacity_bytes
  2. kubelet_volume_stats_available_bytes
  3. kubelet_volume_stats_used_bytes
  4. kubelet_volume_stats_inodes
  5. kubelet_volume_stats_inodes_free
  6. kubelet_volume_stats_inodes_used

這些指標衡量與 Longhorn 塊裝置內的 PVC 檔案系統相關的資訊。

它們與 longhorn_volume_* 指標不同,後者測量特定於 Longhorn 塊裝置(block device)的資訊。

您可以設定一個監控系統來抓取 Kubelet 指標端點以獲取 PVC 的狀態並設定異常事件的警報,例如 PVC 即將耗盡儲存空間。

一個流行的監控設定是 prometheus-operator/kube-prometheus-stack,,它抓取 kubelet_volume_stats_* 指標併為它們提供儀表板和警報規則。

Longhorn CSI 外掛支援

v1.1.0 中,Longhorn CSI 外掛根據 CSI spec 支援 NodeGetVolumeStats RPC。

這允許 kubelet 查詢 Longhorn CSI 外掛以獲取 PVC 的狀態。

然後 kubeletkubelet_volume_stats_* 指標中公開該資訊。

Longhorn 警報規則示例

我們在下面提供了幾個示例 Longhorn 警報規則供您參考。請參閱此處獲取所有可用 Longhorn 指標的列表並構建您自己的警報規則。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: longhorn
role: alert-rules
name: prometheus-longhorn-rules
namespace: monitoring
spec:
groups:
- name: longhorn.rules
rules:
- alert: LonghornVolumeActualSpaceUsedWarning
annotations:
description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The actual used space of Longhorn volume is over 90% of the capacity.
expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
for: 5m
labels:
issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
severity: warning
- alert: LonghornVolumeStatusCritical
annotations:
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
more than 2 minutes.
summary: Longhorn volume {{$labels.volume}} is Fault
expr: longhorn_volume_robustness == 3
for: 5m
labels:
issue: Longhorn volume {{$labels.volume}} is Fault.
severity: critical
- alert: LonghornVolumeStatusWarning
annotations:
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
more than 5 minutes.
summary: Longhorn volume {{$labels.volume}} is Degraded
expr: longhorn_volume_robustness == 2
for: 5m
labels:
issue: Longhorn volume {{$labels.volume}} is Degraded.
severity: warning
- alert: LonghornNodeStorageWarning
annotations:
description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The used storage of node is over 70% of the capacity.
expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
for: 5m
labels:
issue: The used storage of node {{$labels.node}} is high.
severity: warning
- alert: LonghornDiskStorageWarning
annotations:
description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
more than 5 minutes.
summary: The used storage of disk is over 70% of the capacity.
expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
for: 5m
labels:
issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
severity: warning
- alert: LonghornNodeDown
annotations:
description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
summary: Longhorn nodes is offline
expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
for: 5m
labels:
issue: There are {{$value}} Longhorn nodes are offline
severity: critical
- alert: LonghornIntanceManagerCPUUsageWarning
annotations:
description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
more than 5 minutes.
summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
for: 5m
labels:
issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
severity: warning
- alert: LonghornNodeCPUUsageWarning
annotations:
description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
more than 5 minutes.
summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
for: 5m
labels:
issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
severity: warning

在https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules

檢視有關如何定義警報規則的更多資訊。

公眾號:黑客下午茶