1. 程式人生 > >redis叢集實現(六) 容災與宕機恢復

redis叢集實現(六) 容災與宕機恢復

實現叢集,一個重要的保證就是高可用性,要在各種軟體和硬體的故障情況下仍然能夠提供服務。一般來說有兩種解決思路,一種是每一個節點互相之間都會進行資料互動以及監控,出現故障的時候,各個節點都可以做協調任務。另一種就是增加一個協調元件來對叢集進行實時監控以及故障處理。現在使用比較廣泛的是第二種方案,各個模組之間低耦合,工程師先也比較簡單(相對第一種而言)。上一節已經介紹過了raft協議,有了raft協議的基礎,相信大家理解sentinel也會比較輕鬆了。redis內的sentinel會實時掃描節點,如果發現了宕機的節點就會執行故障轉移,選主等操作,我們來看一下具體的過程。
首先我們啟動一個具有三個節點的sentinel叢集,首先需要修改sentinel的配置檔案,sentinel裡有以下幾個配置項需要修改的:
port:我們需要修改,因為要啟動三個節點,埠必須是不一樣的。
dir:sentinel的執行時目錄。
sentinel monitor <master-name> <ip> <redis-port> <quorum>:監視一個名叫 <master-name>的master,我們不需要監視slave,監視了master的話,slave會自動加入到sentinel裡邊。後邊的quorum表示達成一致的最小數目,至少quorum臺機器達成一致,才能保證一致性。
sentinel down-after-milliseconds <master-name> <milliseconds>表示監視的節點在<milliseconds>後沒有回覆就會被認為主觀下線,當quorum個節點都認為此節點下線了以後就會被認為客觀下線。
sentinel parallel-syncs <master-name> <numslaves>表示在故障轉移的時候最多有numslaves在同步更新新的master。
我們修改過的三個sentinel.conf是sentinel1.conf,sentinel2.conf,sentinel3.conf,具體內容如下:
sentinel1.conf:

# Example sentinel.conf

# port <sentinel-port>
# The port that this sentinel instance will run on
port 27000

# dir <working-directory>
# Every long running process should have a well-defined working directory.
# For Redis Sentinel to chdir to /tmp at startup is the simplest thing
# for the process to don't interfere with administrative tasks such as
# unmounting filesystems.
dir /tmp

# sentinel monitor <master-name> <ip> <redis-port> <quorum>
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least <quorum> sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Slaves are auto-discovered, so you don't need to specify slaves in
# any way. Sentinel itself will rewrite this configuration file adding
# the slaves using additional configuration options.
# Also note that the configuration file is rewritten when a
# slave is promoted to master.
#
# Note: master name should not include special characters or spaces.
# The valid charset is A-z 0-9 and the three characters ".-_".
sentinel monitor master1 127.0.0.1 7000 2
sentinel monitor master2 127.0.0.1 7004 2
sentinel monitor master3 127.0.0.1 7005 2

# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel down-after-milliseconds master1 30000
sentinel down-after-milliseconds master2 30000
sentinel down-after-milliseconds master3 30000

# sentinel parallel-syncs <master-name> <numslaves>
#
# How many slaves we can reconfigure to point to the new slave simultaneously
# during the failover. Use a low number if you use the slaves to serve query
# to avoid that all the slaves will be unreachable at about the same
# time while performing the synchronization with the master.
sentinel parallel-syncs master1 1
sentinel parallel-syncs master2 1
sentinel parallel-syncs master3 1

# sentinel failover-timeout <master-name> <milliseconds>
#
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.
#
# Default is 3 minutes.
sentinel failover-timeout master1 180000
sentinel failover-timeout master2 180000
sentinel failover-timeout master3 180000
sentinel2.conf
# Example sentinel.conf

# port <sentinel-port>
# The port that this sentinel instance will run on
port 27001

# dir <working-directory>
# Every long running process should have a well-defined working directory.
# For Redis Sentinel to chdir to /tmp at startup is the simplest thing
# for the process to don't interfere with administrative tasks such as
# unmounting filesystems.
dir /tmp

# sentinel monitor <master-name> <ip> <redis-port> <quorum>
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least <quorum> sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Slaves are auto-discovered, so you don't need to specify slaves in
# any way. Sentinel itself will rewrite this configuration file adding
# the slaves using additional configuration options.
# Also note that the configuration file is rewritten when a
# slave is promoted to master.
#
# Note: master name should not include special characters or spaces.
# The valid charset is A-z 0-9 and the three characters ".-_".
sentinel monitor master1 127.0.0.1 7000 2
sentinel monitor master2 127.0.0.1 7004 2
sentinel monitor master3 127.0.0.1 7005 2

# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel down-after-milliseconds master1 30000
sentinel down-after-milliseconds master2 30000
sentinel down-after-milliseconds master3 30000

# sentinel parallel-syncs <master-name> <numslaves>
#
# How many slaves we can reconfigure to point to the new slave simultaneously
# during the failover. Use a low number if you use the slaves to serve query
# to avoid that all the slaves will be unreachable at about the same
# time while performing the synchronization with the master.
sentinel parallel-syncs master1 1
sentinel parallel-syncs master2 1
sentinel parallel-syncs master3 1

# sentinel failover-timeout <master-name> <milliseconds>
#
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.
#
# Default is 3 minutes.
sentinel failover-timeout master1 180000
sentinel failover-timeout master2 180000
sentinel failover-timeout master3 180000
sentinel3.conf
# Example sentinel.conf

# port <sentinel-port>
# The port that this sentinel instance will run on
port 27002

# dir <working-directory>
# Every long running process should have a well-defined working directory.
# For Redis Sentinel to chdir to /tmp at startup is the simplest thing
# for the process to don't interfere with administrative tasks such as
# unmounting filesystems.
dir /tmp

# sentinel monitor <master-name> <ip> <redis-port> <quorum>
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least <quorum> sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Slaves are auto-discovered, so you don't need to specify slaves in
# any way. Sentinel itself will rewrite this configuration file adding
# the slaves using additional configuration options.
# Also note that the configuration file is rewritten when a
# slave is promoted to master.
#
# Note: master name should not include special characters or spaces.
# The valid charset is A-z 0-9 and the three characters ".-_".
sentinel monitor master1 127.0.0.1 7000 2
sentinel monitor master2 127.0.0.1 7004 2
sentinel monitor master3 127.0.0.1 7005 2

# sentinel down-after-milliseconds <master-name> <milliseconds>
#
# Number of milliseconds the master (or any attached slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel down-after-milliseconds master1 30000
sentinel down-after-milliseconds master2 30000
sentinel down-after-milliseconds master3 30000

# sentinel parallel-syncs <master-name> <numslaves>
#
# How many slaves we can reconfigure to point to the new slave simultaneously
# during the failover. Use a low number if you use the slaves to serve query
# to avoid that all the slaves will be unreachable at about the same
# time while performing the synchronization with the master.
sentinel parallel-syncs master1 1
sentinel parallel-syncs master2 1
sentinel parallel-syncs master3 1

# sentinel failover-timeout <master-name> <milliseconds>
#
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.
#
# Default is 3 minutes.
sentinel failover-timeout master1 180000
sentinel failover-timeout master2 180000
sentinel failover-timeout master3 180000
然後我們輸入
	redis-sentinel sentinel1.conf
	redis-sentinel sentinel2.conf
	redis-sentinel sentinel3.conf

就可以建立好三個sentinel偽叢集,我們會看到如下列印,說明三個master和sentinel都被識別了。

56161:X 04 Dec 09:23:09.855 # Sentinel runid is 4dd7b82766f7faac95c251235682e42079e0a701
56161:X 04 Dec 09:23:09.855 # +monitor master master0 192.168.39.153 7000 quorum 2
56161:X 04 Dec 09:23:09.855 # +monitor master master2 192.168.39.153 7005 quorum 2
56161:X 04 Dec 09:23:09.856 # +monitor master master1 192.168.39.153 7004 quorum 2
56161:X 04 Dec 09:23:10.842 * +slave slave 192.168.39.153:7003 192.168.39.153 7003 @ master0 192.168.39.153 7000
56161:X 04 Dec 09:23:10.842 * +slave slave 192.168.39.153:7002 192.168.39.153 7002 @ master2 192.168.39.153 7005
56161:X 04 Dec 09:23:10.843 * +slave slave 192.168.39.153:7001 192.168.39.153 7001 @ master1 192.168.39.153 7004
56161:X 04 Dec 09:23:19.505 * +sentinel sentinel 192.168.39.153:27001 192.168.39.153 27001 @ master0 192.168.39.153 7000
56161:X 04 Dec 09:23:19.506 * +sentinel sentinel 192.168.39.153:27001 192.168.39.153 27001 @ master2 192.168.39.153 7005
56161:X 04 Dec 09:23:19.508 * +sentinel sentinel 192.168.39.153:27001 192.168.39.153 27001 @ master1 192.168.39.153 7004
56161:X 04 Dec 09:23:25.240 * +sentinel sentinel 192.168.39.153:27002 192.168.39.153 27002 @ master1 192.168.39.153 7004
56161:X 04 Dec 09:23:25.241 * +sentinel sentinel 192.168.39.153:27002 192.168.39.153 27002 @ master2 192.168.39.153 7005
56161:X 04 Dec 09:23:25.242 * +sentinel sentinel 192.168.39.153:27002 192.168.39.153 27002 @ master0 192.168.39.153 7000
一般來說,我們把一個master下線了以後,叢集就會變成不可用狀態,但是現在有了sentinel了,一旦master下線就會立刻執行故障轉移,就能夠在很短的時間內恢復可用。
開始的時候有六個節點,三個master,三個slave,狀態如下:
127.0.0.1:7000> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:12
cluster_my_epoch:10
cluster_stats_messages_sent:2424257
cluster_stats_messages_received:2423717
127.0.0.1:7000> cluster nodes
930daea84150b5fabd32a95592781b27ceab1b71 192.168.39.153:7001 slave 81c884ebfc919ad293f02d797aff1033025ac27e 0 1480817793875 9 connected
8a6707d5b9269b6260315b47f300c1ab599733b7 192.168.39.153:7005 master - 0 1480817794879 11 connected 10923-16383
bdb62bb6ffce71588961f513c74b0d5a1a7145ea 192.168.39.153:7002 slave 8a6707d5b9269b6260315b47f300c1ab599733b7 0 1480817793372 11 connected
81c884ebfc919ad293f02d797aff1033025ac27e 192.168.39.153:7004 master - 0 1480817794378 9 connected 5461-10922
099cfc6fbb785449a8bf5369a53d21a9e127fa42 192.168.39.153:7000 myself,master - 0 0 10 connected 0-5460
a8081e97862d9cf76c72d364f9a173187376f215 192.168.39.153:7003 slave 099cfc6fbb785449a8bf5369a53d21a9e127fa42 0 1480817792868 10 connected
我們手動傳送int訊號終止這個程序,發現redis-server:7004程序已經被我們殺死了。
[email protected]:~/redis-3.0.0/src$ ps aux | grep redis
ubuntu     6067  0.0  0.4  33148  4080 ?        Ss   11月27   0:00 SCREEN -S redis
ubuntu     7192  0.0  0.8  42300  8392 ?        Ssl  11月27   7:22 redis-server *:7000 [cluster]          
ubuntu     7196  0.0  1.0  42300 10632 ?        Ssl  11月27   7:19 redis-server *:7001 [cluster]          
ubuntu     7200  0.0  1.0  42300 10504 ?        Ssl  11月27   7:21 redis-server *:7002 [cluster]          
ubuntu     7205  0.0  1.0  42300 10524 ?        Ssl  11月27   7:21 redis-server *:7003 [cluster]          
ubuntu     7218  0.0  0.8  42300  8556 ?        Ssl  11月27   7:21 redis-server *:7005 [cluster]          
ubuntu    56036  0.0  0.3  31128  3232 pts/6    S+   09:15   0:00 screen -r redis
ubuntu    56161  0.2  0.7  42304  7532 pts/25   Sl+  09:23   0:10 redis-sentinel *:27000 [sentinel]
ubuntu    56176  0.2  0.7  42304  7444 pts/26   Sl+  09:23   0:10 redis-sentinel *:27001 [sentinel]
ubuntu    56192  0.2  0.9  42304  9424 pts/27   Sl+  09:23   0:10 redis-sentinel *:27002 [sentinel]
ubuntu    56536  0.0  0.2  15944  2396 pts/12   R+   10:29   0:00 grep --color=auto redis
[email protected]:~/redis-3.0.0/src$ redis-cli -p 7000 
127.0.0.1:7000> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:13
cluster_my_epoch:10
cluster_stats_messages_sent:2433366
cluster_stats_messages_received:2433005
127.0.0.1:7000> cluster nodes
930daea84150b5fabd32a95592781b27ceab1b71 192.168.39.153:7001 master - 0 1480818606296 13 connected 5461-10922
8a6707d5b9269b6260315b47f300c1ab599733b7 192.168.39.153:7005 master - 0 1480818606797 11 connected 10923-16383
bdb62bb6ffce71588961f513c74b0d5a1a7145ea 192.168.39.153:7002 slave 8a6707d5b9269b6260315b47f300c1ab599733b7 0 1480818608306 11 connected
81c884ebfc919ad293f02d797aff1033025ac27e 192.168.39.153:7004 master,fail - 1480818583889 1480818583084 9 disconnected
099cfc6fbb785449a8bf5369a53d21a9e127fa42 192.168.39.153:7000 myself,master - 0 0 10 connected 0-5460
a8081e97862d9cf76c72d364f9a173187376f215 192.168.39.153:7003 slave 099cfc6fbb785449a8bf5369a53d21a9e127fa42 0 1480818607301 10 connected
我們發現,在停止了一個master節點以後,叢集在很短的時間內處理了故障轉移,然後叢集立刻恢復可用,原來的slave變成了master。可以看出來,sentinel成功發揮了故障處理的作用,在分散式的叢集中,保證高可用性是很重要的一點,下節我們從原始碼層次看看sentinel如何實現的故障轉移。