前言

前面一篇部落格,我已經介紹了prometheus如何監控mysql。

這一篇我來介紹如何通過alertmanger進行告警郵件傳送(微信或釘釘類似,因為需要企業帳戶,我就不試了),以及如何通過grafana檢視告警。

開始演示

測試機器

Prometheus: 192.168.56.140

Host01:192.168.56.103

安裝alertmanager

獲取安裝包

wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz

建立目錄

mkdir -p /etc/alertmanager/

mkdir -p /etc/alertmanager/data

mkdir -p /etc/alertmanager/template/

獲取郵件模板

[root@prometheus-server template]# pwd

/etc/alertmanager/template

[root@prometheus-servertemplate]# wget https://raw.githubusercontent.com/prometheus/alertmanager/master/template/default.tmpl

複製檔案到/etc/alertmanager目錄

[root@prometheus-server ftpusr]cp ./alertmanager-0.22.2.linux-amd64/alertmanager* /etc/alertmanager/.

配置啟動服務

  1. [root@prometheus-server alertmanager]# cat /etc/systemd/system/alertmanager.service
  2.  
  3. [Unit]
  4.  
  5. Description=Alertmanager
  6.  
  7. After=network.target
  8.  
  9. [Service]
  10.  
  11. Type=simple
  12.  
  13. User=prometheus
  14.  
  15. ExecStart=/etc/alertmanager/alertmanager \
  16.  
  17. --config.file=/etc/alertmanager/alertmanager.yml \
  18.  
  19. --storage.path=/etc/alertmanager/data
  20.  
  21. Restart=on-failure
  22.  
  23. [Install]
  24.  
  25. WantedBy=multi-user.target

配置alertmanager郵件傳送

如下我使用的是163郵箱來發送郵件。

如需使用SMTP服務,需要先開啟服務。開啟後,增加授權碼,如下配置檔案裡面的smtp_auth_password填寫的是授權碼(而不是個人郵箱密碼)

[root@prometheus-server alertmanager]# cat alertmanager.yml

  1. global:
  2.  
  3. smtp_smarthost: 'smtp.163.com:25'
  4.  
  5. smtp_from: 'xxxx@163.com'
  6.  
  7. smtp_auth_username: 'xxxx@163.com'
  8.  
  9. smtp_auth_password: 'xxxxxxxxxxx'
  10.  
  11. smtp_require_tls: false
  12.  
  13. templates:
  14.  
  15. - '/etc/alertmanager/template/*.tmpl'
  16.  
  17. route:
  18.  
  19. group_by: ['alertname','cluster','service']
  20.  
  21. group_wait: 10s
  22.  
  23. group_interval: 10s
  24.  
  25. repeat_interval: 10m
  26.  
  27. receiver: 'default-receiver'
  28.  
  29. receivers:
  30.  
  31. - name: 'default-receiver'
  32.  
  33. email_configs:
  34.  
  35. - to: '20889922@qq.com'
  36.  
  37. html: '{{ template "email.default.html" . }}'
  38.  
  39. headers: { Subject: "Prometheus 告警測試郵件" }

啟動服務

service alertmanager start

prometheus配置alertmanager

prometheus.yml配置

  1. # Alertmanager configuration
  2.  
  3. alerting:
  4.  
  5. alertmanagers:
  6.  
  7. - static_configs:
  8.  
  9. - targets: ["localhost:9093"]
  10.  
  11. # - alertmanager:9093
  12.  
  13. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  14.  
  15. rule_files:
  16.  
  17. - "rules.yml"
  18.  
  19. # - "first_rules.yml"
  20.  
  21. # - "second_rules.yml"

rules.yml配置

  1. [root@prometheus-server prometheus]# cat rules.yml
  2.  
  3. # hostStatsAlert
  4.  
  5. groups:
  6.  
  7. - name: hostStatsAlert
  8.  
  9. rules:
  10.  
  11. - alert: NodeDown
  12.  
  13. expr: up == 0
  14.  
  15. for: 1m
  16.  
  17. labels:
  18.  
  19. severity: "Critical"
  20.  
  21. annotations:
  22.  
  23. summary: "Instance {{$labels.instance}} down"
  24.  
  25. description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  26.  
  27. - alert: NodeCPUUsage
  28.  
  29. expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
  30.  
  31. for: 1m
  32.  
  33. labels:
  34.  
  35. severity: "Warning"
  36.  
  37. annotations:
  38.  
  39. summary: "Instance {{ $labels.instance }} CPU usgae high"
  40.  
  41. description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  42.  
  43. - alert: NodeMemoryUsage
  44.  
  45. expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
  46.  
  47. for: 1m
  48.  
  49. labels:
  50.  
  51. severity: "Warning"
  52.  
  53. annotations:
  54.  
  55. summary: "Instance {{ $labels.instance }} MEM usgae high"
  56.  
  57. description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
  58.  
  59. - alert: filesystemUsageAlert
  60.  
  61. expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype=~"ext4|xfs"} * 100) / node_filesystem_size_bytes {mountpoint="/",fstype=~"ext4|xfs"}) > 85
  62.  
  63. for: 1m
  64.  
  65. labels:
  66.  
  67. severity: "Warning"
  68.  
  69. annotations:
  70.  
  71. summary: "Instance {{ $labels.instance }} root DISK usgae high"
  72.  
  73. description: "{{ $labels.instance }} root DISK usage above 85% (current value: {{ $value }})"

重新啟動prometheus使服務生效

service prometheus restart

檢視告警郵件

等待幾分鐘後,可以看到郵件的告警資訊

登入alertmanager埠,也可檢視告警資訊

http://192.168.56.140:9093/

Alertmanager grafana展示

安裝

grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource

安裝完後,重新啟動grafana-server

service grafana-server restart

新增alertmanager datasource

匯入dashboard

展示效果

碰到的問題與解決方法

告警展示的時候,雖然alerts有兩個告警,但downnode卻顯示沒有。

通過下載展示的JSON檔案,檢視原來是altername在告警檔案中,與JSON檔案中不匹配。匹配完成就OK了。

serverity在郵件顯示正常,但是grafana無法正常顯示。這個還沒調查清楚。

估計得需要谷歌了。但是,你能體會中國人無法上谷歌的痛苦嗎?

參考資料:

https://www.cnblogs.com/danny-djy/p/11097726.html

https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e