Corosync概述:
Corosync是集群管理套件的一部分,它在傳遞信息的時候可以通過一個簡單的配置文件來定義信息傳遞的方式和協議等。它是一個新興的軟件,2008年推出,但其實它並不是一個真正意義上的新軟件,在2002年的時候有一個項目Openais , 它由於過大,分裂為兩個子項目,其中可以實現HA心跳信息傳輸的功能就是Corosync ,它的代碼60%左右來源於Openais. Corosync可以提供一個完整的HA功能,但是要實現更多,更復雜的功能,那就需要使用Openais了。Corosync是未來的發展方向。在以後的新項目裏,一般采用Corosync,而hb_gui可以提供很好的HA管理功能,可以實現圖形化的管理。另外相關的圖形化有RHCS的套件luci+ricci,當然還有基於Java開發的LCMC集群管理工具;它與heartbeat都是實現集群高可用的工具,到這裏corosync與pacemaker的基礎知識就說到這裏了,下面我們來看看怎麽安裝corosync與pacemaker。
Corosync與pacemaker安裝:
1.環境說明
(1).操作系統
CentOS 6.5 X86_64位系統
(2).軟件環境
**corosync-1.4.1-17.el6.x86_64
**crmsh-1.2.6-4.el6.x86_64.rpm
**pssh-2.3.1-2.el6.x86_64.rpm
(3).拓撲環境
節點數:3 分別為:node1 node2 nfs
node1:172.16.100.6 node2:172.16.100.7 nfs:172.16.100.9 TestHost:172.16.100.88
拓撲結構如下圖所示:
2.安裝及配置過程如下:
1、準備工作
為了配置一臺linux主機成為HA的節點,通常需要做出如下的準備工作:
1)所有節點的主機名稱和對應的IP地址解析服務可以正常工作,且每個節點的主機名稱需要跟"uname -n“命令的結果保持一致;因此,需要保證兩個節點上的/etc/hosts文件均為下面的內容:
# vim /etc/hosts 172.16.100.6 node1.magedu.com node1 172.16.100.7 node2.magedu.com node2
為了使得重新啟動系統後仍能保持如上的主機名稱,還分別需要在各節點執行類似如下的命令:
Node1配置:
# sed -i 's@\(HOSTNAME=\).*@\1node1.samlee.com@g' /etc/sysconfig/Network # hostname node1.samlee.com
Node2配置:
# sed -i 's@\(HOSTNAME=\).*@\1node2.samlee.com@g' /etc/sysconfig/network # hostname node2.samlee.com
2)設定兩個節點可以基於密鑰進行ssh通信,這可以通過如下的命令實現:
Node1配置:
# ssh-keygen -t rsa -P '' # ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2 # ssh node2.samlee.com 'date';date
Node2配置:
# ssh-keygen -t rsa -P '' # ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1 # ssh node1.samlee.com 'date';date
3)設置5分鐘自動同步時間(node1、node2都需要配置)
# crontab -e */5 * * * * /sbin/ntpdata 172.16.100.10 &> /dev/null
2、安裝配置Corosync集群管理工具
1)安裝Corosync工具(yum方式)
# yum -y install corosync
安裝crmsh(rpm方式)
RHEL自6.4起不再提供集群的命令行配置工具crmsh,轉而使用pcs;如果你習慣了使用crm命令,可下載相關的程序包自行安裝即可。crmsh依賴於pssh,因此需要一並下載。
# cd /root/corosync_packages/ # yum -y --nogpgcheck localinstall crmsh*.rpm pssh*.rpm
2)配置corosync(操作在node1.samlee.com上執行)
# cd /etc/corosync/ # cp corosync.conf.example corosync.conf # vim corosync.conf # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 secauth: on --開啟認證功能 threads: 0 --CPU個數 interface { ringnumber: 0 bindnetaddr: 172.16.0.0 --集群節點運行所在的網絡地址 mcastaddr: 226.96.6.17 --組播傳輸地址 mcastport: 5405 --心跳信息檢測端口 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } ##設置隨corosync啟動的服務 service { ver: 0 name: pacemaker } ##ais運行身份設定 aisexec { user: root group: root } 並設定此配置文件中 bindnetaddr後面的IP地址為你的網卡所在網絡的網絡地址,我們這裏的兩個節點在172.16.0.0網絡,因此這裏將其設定為172.16.0.0;如下 bindnetaddr: 172.16.0.0
3)生成節點間通信時所用到的認證密鑰文件:
# corosync-keygen 如果隨機數不夠的話需要需要登錄狀態狂敲鍵盤
4)將corosync.conf和authkey復制至node2:
# scp -p corosync.conf authkey node2:/etc/corosync/
5)分別在node1、node2兩個節點中創建corosync生成的日誌所在的目錄
# mkdir /var/log/cluster # ssh node2 'mkdir /var/log/cluster'
6)啟動corosync服務
# service corosync start # ssh node2 '/etc/init.d/corosync start'
7)查看corosync集群引擎是否正常啟動:
# grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log # ssh node2 'grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log' 如下所示證明正常啟動: Aug 13 11:26:58 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. Aug 13 11:26:58 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
8)查看初始化成員節點通知是否正常發出:
# grep TOTEM /var/log/cluster/corosync.log # ssh node2 'grep TOTEM /var/log/cluster/corosync.log' 如下所示證明正常發出: Aug 13 13:19:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Aug 13 13:19:20 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Aug 13 13:19:20 corosync [TOTEM ] The network interface [172.16.100.6] is now up. Aug 13 13:19:20 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 13 11:26:59 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
9)檢查啟動過程中是否有錯誤產生。下面的錯誤信息表示packmaker不久之後將不再作為corosync的插件運行,因此,建議使用cman作為集群基礎架構服務;此處可安全忽略。
# grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources Aug 13 13:19:20 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. Aug 13 13:19:20 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
10)查看pacemaker是否正常啟動:
# grep pcmk_startup /var/log/cluster/corosync.log Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: CRM: Initialized Aug 13 13:19:20 corosync [pcmk ] Logging: Initialized pcmk_startup Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Service: 9 Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Local hostname: node1.samlee.com
11)如果安裝了crmsh,可使用如下命令查看集群節點的啟動狀態:
# crm status Last updated: Sat Aug 13 13:42:26 2016 Last change: Sat Aug 13 13:19:58 2016 by hacluster via crmd on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com (version 1.1.14-8.el6-70404b0) - partition with quorum 2 nodes and 0 resources configured, 2 expected votes Online: [ node1.samlee.com node2.samlee.com ]
12)檢查corosync端口是否正常:
# ss -tunlp | grep 5405 udp UNCONN 0 0 172.16.100.6:5405 *:* users:(("corosync",5879,15)) udp UNCONN 0 0 226.96.6.17:5405 *:* users:(("corosync",5879,11)) # ssh node2 'ss -tunlp | grep 5405' udp UNCONN 0 0 172.16.100.7:5405 *:* users:(("corosync",5047,15)) udp UNCONN 0 0 226.96.6.17:5405 *:* users:(("corosync",5047,11))
從上面的信息可以看出兩個節點都已經正常啟動,並且集群已經處於正常工作狀態.
13)執行ps auxf命令可以查看corosync啟動的各相關進程:
# ps auxf root 5879 0.9 0.9 545200 4648 ? Ssl 13:19 0:17 corosync 496 5884 0.0 2.1 94608 10672 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/cib root 5885 0.0 0.8 95148 3968 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/stonithd root 5886 0.0 0.5 62932 2788 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/lrmd 496 5887 0.0 0.6 85936 3196 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/attrd 496 5888 0.0 3.7 117468 18504 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/pengine 496 5889 0.0 0.8 135988 4228 ? S< 13:19 0:01 \_ /usr/libexec/pacemaker/crmd
3.集群資源管理
crmsh基本介紹
[root@node1 ~]# crm ##進入crmsh crm(live)# help ##查看幫助 This is crm shell, a Pacemaker command line interface. Available commands: cib manage shadow CIBs ##CIB資源管理模塊 resource resources management ##資源管理模塊 configure CRM cluster configuration ##CRM配置,包含資源粘性、資源類型、資源約束等 node nodes management ##節點管理 options user preferences ##用戶偏好 history CRM cluster history ##CRM歷史 site Geo-cluster support ##地理集群支持 ra resource agents information center ##資源代理配置 status show cluster status ##查看集群狀態 help,? show help (help topics for list of topics) ##查看幫助 end,cd,up go back one level ##返回上一級 quit,bye,exit exit the program ##退出 crm(live)# configure ##進入配置模式 crm(live)configure# show ##查看當前配置 node node1.samlee.com node node2.samlee.com property $id="cib-bootstrap-options" \ dc-version="1.1.10-14.el6-368c726" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2" crm(live)configure# verify ##檢查當前配置語法,由於沒有STONITH,所以報錯,可關閉 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid crm(live)configure# property stonith-enabled=false ##禁用stonith後再次檢查配置,無報錯 crm(live)configure# verify crm(live)configure# commit ##提交配置 crm(live)configure# cd crm(live)# ra ##-進入RA(資源代理配置)模式 crm(live)ra# help This level contains commands which show various information about the installed resource agents. It is available both at the top level and at the `configure` level. Available commands: classes list classes and providers ##查看RA類型 list list RA for a class (and provider)##查看指定類型(或提供商)的RA meta show meta data for a RA ##查看RA詳細信息 providers show providers for a RA and a class ##查看指定資源的提供商和類型 help show help (help topics for list of topics) end go back one level quit exit the program crm(live)ra# classes lsb ocf / heartbeat pacemaker service stonith crm(live)ra# list ocf pacemaker ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo systemHealth controld ping pingd remote crm(live)ra# info ocf:heartbeat:IPaddr crm(live)ra# cd crm(live)# status ##查看集群狀態 Last updated: Sat Aug 13 15:51:13 2016 Last change: Sat Aug 13 15:46:19 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 0 Resources configured Online: [ node1.samlee.com node2.samlee.com ]
法定票數問題:
在雙節點集群中,由於票數是偶數,當心跳出現問題(腦裂)時,兩個節點都將達不到法定票數,默認quorum策略會關閉集群服務,為了避免這種情況,可以增加票數為奇數,或者調整默認quorum策略為【ignore】
crm(live)# configure crm(live)configure# property no-quorum-policy=ignore crm(live)configure# show node node1.samlee.com node node2.samlee.com property $id="cib-bootstrap-options" \ dc-version="1.1.10-14.el6-368c726" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" crm(live)configure# verify crm(live)configure# commit
防止資源在節點恢復後移動:
故障發生時,資源會遷移到正常節點上,但當故障節點恢復後,資源可能再次回到原來節點,這在有些情況下並非是最好的策略,因為資源的遷移是有停機時間的,特別是一些復雜的應用,如Oracle數據庫,這個時間會更長。為了避免這種情況可設置資源粘性策略。
crm(live)configure# rsc_defaults resource-stickiness=100 ##設置資源粘性為100
實例應用:配置web高可用集群
(1)定義VIP:
crm(live)# configure crm(live)configure# primitive webip ocf:heartbeat:IPaddr params ip=172.16.100.99 nic=eth0 cidr_netmask=16 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Sat Aug 13 17:46:25 2016 Last change: Sat Aug 13 17:46:17 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com
最後一行,定義的資源已經在node1上啟動。使用 ip addr show命令可以看到該VIP已經生效:
# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:0c:29:07:45:da brd ff:ff:ff:ff:ff:ff inet 172.16.100.6/16 brd 172.16.255.255 scope global eth0 inet 172.16.100.99/16 brd 172.16.255.255 scope global secondary eth0 ##已經生效!! inet6 fe80::20c:29ff:fe07:45da/64 scope link valid_lft forever preferred_lft forever
(2)配置httpd資源
node1-web服務配置 # yum -y install httpd # echo "<h1>node1.samlee.com</h1>" >/var/www/html/index.html # service httpd start # chkconfig httpd off # service httpd stop node2-web服務配置 # yum -y install httpd # echo "<h1>node2.samlee.com</h1>" >/var/www/html/index.html # service httpd start # chkconfig httpd off # service httpd stop --------------------------------------------------------------------- --------------------------------------------------------------------- crm(live)# configure pcrm(live)configure# primitive webserver lsb:httpd crm(live)configure# show node node1.samlee.com node node2.samlee.com primitive webip ocf:heartbeat:IPaddr \ params ip="172.16.100.99" primitive webserver lsb:httpd property $id="cib-bootstrap-options" \ dc-version="1.1.10-14.el6-368c726" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" rsc_defaults $id="rsc-options" \ resource-stickiness="100" crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Sat Aug 13 17:55:46 2016 Last change: Sat Aug 13 17:55:19 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node2.samlee.com
從上面的信息中可以看出webip和webserver有可能會分別運行於兩個節點上,這對於通過此IP提供Web服務的應用來說是不成立的,即此兩者資源必須同時運行在某節點上,如何實現兩個資源運行在同一個節點上呢?
(1)手工切換資源至其他節點上(在資源自啟動無法滿足--僅用於測試)
crm(live)# resource crm(live)resource# list webip (ocf::heartbeat:IPaddr): Started webserver (lsb:httpd): Started crm(live)resource# migrate webserver crm(live)# status Last updated: Mon Aug 15 09:57:34 2016 Last change: Mon Aug 15 09:57:09 2016 via crm_resource on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
切換後查看效果如下:
(2)建立資源組(將需要在一起啟動的資源規劃在同一個資源組內)
crm(live)# configure crm(live)configure# group webservice webip webserver crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# resource crm(live)resource# list Resource Group: webservice webip (ocf::heartbeat:IPaddr): Started webserver (lsb:httpd): Started crm(live)# status Last updated: Mon Aug 15 10:06:17 2016 Last change: Mon Aug 15 10:04:33 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] Resource Group: webservice webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
測試效果如下:
測試完成後刪除組資源:
crm(live)# resource crm(live)resource# stop webservice crm(live)resource# cleanup webservice crm(live)resource# cd crm(live)configure# delete webservice crm(live)configure# verify crm(live)configure# commit crm(live)# status Last updated: Mon Aug 15 10:31:30 2016 Last change: Mon Aug 15 10:26:21 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node2.samlee.com ##停止資源--清除記錄 # crm crm(live)# resource crm(live)resource# stop webservice crm(live)resource# list crm(live)resource# cleanup webservice crm(live)resource# cleanup webip crm(live)resource# cleanup httpd crm(live)resource# cd crm(live)# node crm(live)node# clearstate node1.samlee.com crm(live)node# clearstate node2.samlee.com crm(live)node# cd crm(live)# resource crm(live)resource# start webservice crm(live)resource# reprobe crm(live)resource# refresh crm(live)resource# cd crm(live)# configure crm(live)configure# show crm(live)configure# edit crm(live)configure# verify crm(live)configure# commit
(3)使用資源約束對資源精細化管理
上面針對資源約束做的案例,即便集群擁有所有必需資源,但它可能還無法進行正確處理。資源約束則用以指定在哪些群集節點上運行資源,以何種順序裝載資源,以及特定資源依賴於哪些其它資源。pacemaker共給我們提供了三種資源約束方法:
1) Resource Location(資源位置約束): 定義資源可以、不可以或盡可能在哪些節點上運行;
2) Resource Collocation(資源排列約束): 排列約束用以定義集群資源可以或不可以在某個節點上同時運行;
3) Resource Order(資源順序約束): 順序約束定義集群資源在節點上啟動的順序;
定義約束時,還需要指定分數。各種分數是集群工作方式的重要組成部分。其實,從遷移資源到決定在已降級集群中停止哪些資源的整個過程是通過以某種方式修改分數來實現的。分數按每個資源來計算,資源分數為負的任何節點都無法運行該資源。在計算出資源分數後,集群選擇分數最高的節點。INFINITY(無窮大)目前定義為 1,000,000。加減無窮大遵循以下3個基本規則:
1)任何值 + 無窮大 = 無窮大
2)任何值 - 無窮大 = -無窮大
3)無窮大 - 無窮大 = -無窮大
定義資源約束時,也可以指定每個約束的分數。分數表示指派給此資源約束的值。分數較高的約束先應用,分數較低的約束後應用。通過使用不同的分數為既定資源創建更多位置約束,可以指定資源要故障轉移至的目標節點的順序。
因此,對於前述的webip和webserver可能會運行於不同節點的問題,通過定義排列約束解決:
crm(live)# configure crm(live)configure# colocation webserver_with_webip inf: webserver webip crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Mon Aug 15 11:03:31 2016 Last change: Mon Aug 15 11:02:47 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
最後看到兩個資源已經運行在同一個節點中,通過資源順序約束定義資源的啟動順序:
##定義先啟動資源webip後再啟動webserver資源 crm(live)configure# order webip_before_webserver mandatory: webip webserver crm(live)configure# verify crm(live)configure# commit
查看測試效果:
此外,由於HA集群本身並不強制每個節點的性能相同或相近,所以,某些時候我們可能希望在正常時服務總能在某個性能較強的節點上運行,這可以通過位置約束來實現:
crm(live)# configure crm(live)configure# location webip_on_node1 webip 200: node2.samlee.com crm(live)configure# verify crm(live)configure# commit
定義資源監控,如果服務停止或重啟我們可以通過資源監控方式來獲知:
crm(live)configure# primitive vip ocf:heartbeat:IPaddr params ip=172.16.100.100 op monitor interval=30s timeout=20s
--以上為高可用集群技術之corosync應用詳解(一)所有內容。
本文出自 “Opensamlee” 博客,請務必保留此出處http://gzsamlee.blog.51cto.com/9976612/1838084
Tags: 集群技術 java開發 配置文件 管理工具 IP地址
文章來源: