使用Nagios打造專業的業務狀態監控

Nagios · 發表 2018-09-19 07:16:00

摘要：想必各個公司都有部署zabbix之類的監控系統來監控伺服器的資源使用情況、各服務的執行狀態，是否這種監控就足夠了呢？有沒有遇到監控系統一切正常確發現專案無法正常對外提供服務的情況呢？本篇文章聊聊我們如何簡單的使用Nagios監控業務的狀態文中的業務指使用者訪問的網站頁面，對外提供的API...

想必各個公司都有部署zabbix之類的監控系統來監控伺服器的資源使用情況、各服務的執行狀態，是否這種監控就足夠了呢？有沒有遇到監控系統一切正常確發現專案無法正常對外提供服務的情況呢？本篇文章聊聊我們如何簡單的使用Nagios監控業務的狀態

文中的業務指使用者訪問的網站頁面，對外提供的API介面，移動端的APP等產品

監控的思考

通常我們會在專案所在的機房部署一套監控系統來監控我們伺服器和SQL/">MySQL之類的公共服務，制定報警策略，在出現異常情況的時候郵件或簡訊提醒我們及時處理。

此類監控主要的關注點有兩個：

資源的佔用情況，例如負載高低、記憶體大小、磁碟空間等
服務的狀態監控，例如Nginx狀態、Mysql主從狀態等

同時也會存在以下兩個主要問題：

缺少業務狀態的監控，不能很直觀的知道業務當前的狀態，可能伺服器、服務都正常但業務確掛了
監控伺服器和業務伺服器處於同一機房環境內，監控網路故障、入口網路擁堵等情況都可能會導致收不到監控系統的報警，且只能監控機房內的情況，使用者到機房入口的情況無法監控

那麼如何解決這兩個問題呢？

業務狀態監控，就是要最直觀的的反映業務當前是正常還是故障，該怎麼監控呢？以web專案為例，首先就是要確定具體URL的返回狀態，是200正常還是404未找到等，其次要考慮頁面裡邊的內容是不是正常，我們知道最終反饋給使用者內容的是由一些靜態資源和後端介面資料共同組成的HTML頁面，想知道內容究竟對不對這個比較困難，退而求其次我們預設所有靜態資源和後端介面都返回正常狀態則表示正常，這個監控就比較容易實現了。

靜態資源可以直接由nginx伺服器處理，nginx的併發能力很強，一般不會成為效能的瓶頸，針對靜態資源的監控我們可以結合ELK一起來看。後端介面的處理效能就要差很多了，對業務狀態的監控也主要是對後端介面狀態的監控，那我們是否需要監控所有的介面呢？這個實施起來比較麻煩，我覺得沒太大必要，只需要監控幾個有代表性的介面就可以了，例如我們所有的專案中都讓開發單獨加了一個health check的介面，這個介面的作用是連線專案所有用到的服務進行操作，如介面連線mysql進行資料查詢以確定mysql能給正常提供服務，連線redis進行get、set操作以確定redis服務正常，對於這個介面的監控就能覆蓋到整個鏈路的服務情況。

對於監控伺服器和業務伺服器在同一個機房內所導致的問題（上邊講到的第二點問題），我們可以通過在不同的網路環境內部署獨立的狀態監控來解決，例如辦公區部署Nagios，不同網路監控也更接近使用者的網路情況，這套狀態監控就區別於機房部署的資源佔用監控了，主要用來監控業務的狀態，也就是我們上邊提到的URL和介面狀態。

我們能不能直接將監控部署在機房外的環境來節省一套監控呢？例如公司或者其他的機房部署監控。這樣不是個好方案，跨網路的監控效能太差了，首先網路之間的延遲都比同機房內要大的多，其次大量監控項頻繁的資料傳輸對頻寬也是不小的壓力

Nagios監控

我們業務狀態監控採用了Nagios，Nagios部署簡單配置靈活，這種場景下非常適合。

系統環境：Debian8
nginx + nagios架構

部署Nagios

1.安裝基礎環境

# apt-get update
# apt-get install -y build-essential libgd2-xpm-dev autoconf gcc libc6 make wget
# apt-get install -y nginx php5-fpm spawn-fcgi fcgiwrap

2.下載並解壓nagios

# wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.0.8.tar.gz
# tar -zxvf nagios-4.0.8.tar.gz 
# cd nagios-4.0.8

# ./configure && make all

# make install-groups-users
# usermod -a -G nagios www-data

# make install
# make install-init
# make install-config
# make install-commandmode
# cd ..

nagios安裝完成後就可以啟動了，但是web頁面是無法訪問的，檢視日誌會報錯 (No output on stdout) stderr: execvp(/usr/local/nagios/libexec/check_ping, ...) failed. errno is 2: No such file or directory ，這是因為我們只安裝了nagios的core，沒有安裝nagios的外掛，需要安裝外掛來支援core工作

3.安裝nagios-plugins

# wget https://nagios-plugins.org/download/nagios-plugins-2.2.1.tar.gz
# tar -zxvf nagios-plugins-2.2.1.tar.gz
# cd nagios-plugins-2.2.1
# ./configure
# make
# make install
# cd ..

nagios的外掛主要是添加了check_ping、checkhttp之類的輔助檢查的指令碼，預設位於 /usr/local/nagios/libexec/ 下，可以藉助這些外掛來監控我們的HTTP介面或主機、服務狀態

4.建立nagios web訪問的賬號密碼

# vi /usr/local/bin/htpasswd.pl
#!/usr/bin/perl
 
use strict;
 
if ( @ARGV != 2 ){
print "usage: /usr/local/bin/htpasswd.pl <username> <password>\n";
}
else {
print $ARGV[0].":".crypt($ARGV[1],$ARGV[1])."\n";
}
# chmod +x /usr/local/bin/htpasswd.pl

#利用perl指令碼生成賬號密碼到htpasswd.users檔案中
# /usr/local/bin/htpasswd.pl nagiosadmin nagios@ops-coffee > /usr/local/nagios/htpasswd.users

/usr/local/nagios/etc/cgi.cfg

5.nginx新增server配置，讓瀏覽器可以訪問

server {
listen80;
server_namengs.domain.com;

access_log /var/log/nginx/nagios.access.log;
error_log /var/log/nginx/nagios.error.log;

auth_basic "Private";
auth_basic_user_file /usr/local/nagios/htpasswd.users;

root /usr/local/nagios/share;
index index.php index.html;

location / {
try_files $uri $uri/ index.php /nagios;
}

location /nagios {
alias /usr/local/nagios/share;
}

location ~ \.php$ {
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/php5-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

location ~ ^/nagios/(.*\.php)$ {
alias /usr/local/nagios/share/$1;
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/php5-fpm.sock;
}

location ~ \.cgi$ {
root /usr/local/nagios/sbin/;
rewrite ^/nagios/cgi-bin/(.*)\.cgi /$1.cgi break;
fastcgi_param AUTH_USER $remote_user;
fastcgi_param REMOTE_USER $remote_user;
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/fcgiwrap.socket;
}
}

6.檢查配置檔案並啟動

#檢查配置檔案是否有語法錯誤
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg 

#啟動nagios服務
# /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

#啟動fcgiwrap和php5-fpm服務
# service fcgiwrap restart
# service php5-fpm restart

7.瀏覽器訪問伺服器IP或域名就可以看到nagios的頁面了，預設有本機的監控資料，不需要的話可以在配置檔案 localhost.cfg 中刪除

Nagios配置

Nagios的主配置檔案路徑為 /usr/local/nagios/etc/nagios.cfg ，裡邊預設已經配置了一些配置檔案的路徑，cfg_file=後邊配置的都是配置檔案，nagios程式會來這裡讀取配置，我們可以新新增一個專門用來監控HTTP API的配置檔案

cfg_file=/usr/local/nagios/etc/objects/check_api.cfg

check_api.cfg裡邊的內容如下：

define service{
usegeneric-service
host_namelocalhost
service_descriptionweb_project_01
check_commandcheck_http!ops-coffee.cn -S
}

define service{
usegeneric-service
host_namelocalhost
service_descriptionweb_project_02
check_commandcheck_http!ops-coffee.cn -S -u / -e 200
}

define service{
usegeneric-service
host_namelocalhost
service_descriptionweb_project_03
check_commandcheck_http!ops-coffee.cn -S -u /action/health -k "sign:e5dhn"
}

define service ：定義一個服務，每一個頁面或api屬於一個服務
use ：定義服務使用的模板，模板配置檔案在 /usr/local/nagios/etc/objects/templates.cfg
host_name ：定義服務所屬的主機，我們這裡區別主機意義不大，統一都屬於localhost好了
service_description ：定義服務描述，這個值會最終展示在web頁面上的service欄位，定義應簡單有意義
check_command ：定義服務檢查使用的命令，命令的配置檔案在 /usr/local/nagios/etc/objects/commands.cfg
check_http檢測https介面時可以使用-S引數，如果報錯 SSL is not available ，那麼你需要先安裝libssl-dev包，然後重新編譯（ ./configure --with-openssl=/usr/bin/openssl ）部署nagios-plugin外掛新增對ssl的支援

check_command我們配置了check_http，需要修改commands.cfg檔案中預設的check_http配置如下：

define command {
command_namecheck_http
command_line$USER1$/check_http -H $ARG1$
}

define command ：定義一個command
command_name ：定義command的名字，在主機或服務的配置檔案中可以引用
command_line ：定義命令的路徑和執行方式，這個 check_http 就是我們通過安裝nagios-plugin生成的，位於 /usr/local/nagios/libexec/ 下， check_http 的詳細用法可以通過 check_http -h檢視 ，支援比較廣泛

use我們配置了generic-service，可以通過配置服務模板定義很多預設的配置如下：

define service {
namegeneric-service; The 'name' of this service template
active_checks_enabled1; Active service checks are enabled
passive_checks_enabled1; Passive service checks are enabled/accepted
parallelize_check1; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service1; We should obsess over this service (if necessary)
check_freshness0; Default is to NOT check service 'freshness'
notifications_enabled1; Service notifications are enabled
event_handler_enabled1; Service event handler is enabled
flap_detection_enabled1; Flap detection is enabled
process_perf_data1; Process performance data
retain_status_information1; Retain status information across program restarts
retain_nonstatus_information1; Retain non-status information across program restarts
is_volatile0; The service is not volatile
check_period24x7; The service can be checked at any time of the day
max_check_attempts2; Re-check the service up to 3 times in order to determine its final (hard) state
check_interval1; Check the service every 10 minutes under normal conditions
retry_interval1; Re-check the service every two minutes until a hard state can be determined
contact_groupsadmins; Notifications get sent out to everyone in the 'admins' group
notification_optionsw,u,c,r; Send notifications about warning, unknown, critical, and recovery events
notification_interval60; Re-notify about service problems every hour
notification_period24x7; Notifications can be sent out at any time
register0; DON'T REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

配置太多就不一一解釋了，配合後邊的英文註釋應該看得懂，說幾個重要的

max_check_attempts ：重試幾次來最終確定服務的狀態，例如我們一個服務掛了，需要重試3次才會確定這個服務確實是掛了，然後發郵件或簡訊通知我們
check_interval ：檢查頻率配置，在服務正常的情況下多長時間輪訓檢查一次，這裡為了更及時的反饋結果我們配置一分鐘一次
retry_interval ：當服務狀態發生變更的時候多長時間輪序檢查一次，我們也給配置一分鐘一次
contact_groups ：定義聯絡人組，當發生故障需要報警時，傳送報警給哪個組，這個組的配置檔案在 /usr/local/nagios/etc/objects/contacts.cfg

contact_groups我們配置了admins，接下來看下contacts.cfg的配置

define contact{
contact_namesa; Short name of user
usegeneric-contact; Inherit default values from generic-contact template (defined above)
aliasNagios Admin; Full name of user

service_notification_period24x7
host_notification_period24x7
service_notification_optionsw,u,c,r
host_notification_optionsd,u,r
host_notification_commandsnotify-host-by-email,notify-host-by-sms
service_notification_commandsnotify-service-by-email,notify-service-by-sms

[email protected]
pager15821212121,15822222222
}

define contactgroup{
contactgroup_nameadmins
aliasNagios Administrators
memberssa
}

/usr/local/nagios/etc/objects/commands.cfg

全部配置完成後重啟nagios服務，會看到監控已經正常

Nagstamon外掛

介紹一款配合nagios用起來非常棒的外掛Nagstamon，Nagstamon是一款nagios的桌面小工具（實際上現在不僅僅能配合nagios使用，還能配合zabbix等使用），啟動後常駐系統托盤，當nagios監控狀態發生變化時會及時的跳出來併發出聲音警告，能夠更加及時的獲取業務狀態。

配置如下：

Update interval能夠配置多長時間取一次nagios的狀態，我們這裡調整為1s
當出現報警時桌面直接飆紅，給你心跳加速的感覺

寫在最後

業務狀態監控作為Zabbix之類過程監控的補充，並不能替代過程監控系統，在我們過程監控不是很完善的情況下很有用，目前我們有相當一部分的報警都首先發現於這套業務狀態監控
選擇Nagios主要是她比較純粹，專注狀態監控（有外掛實現過程記錄），且對Nagios比較熟悉了。Nagios看似配置複雜，幾個配置檔案環環相扣，實際上理清楚配置檔案之間的關係就會發現配置合理且簡單
部署的狀態監控節點越多覆蓋地區越多使用者狀態獲取就越準確，但由於網路環境複雜，我們也不可能在每個省市、節點部署監控系統來監控專案的狀態，如有必要可以考慮一些商業監控方案，能夠做到全球節點監控，但相應的成本可能就會增加，要綜合權衡

如果你覺得文章對你有幫助，請轉發分享給更多的人。如果你覺得讀的不盡興，推薦閱讀以下文章：

ofollow,noindex" target="_blank">中小團隊落地配置中心詳解
記一次詭異的故障排查經歷