Redis Cluster節點伺服器宕機後導致叢集重啟失敗案例
阿新 • • 發佈:2018-12-25
這裡說下自己碰到的一種情況:
redis cluster叢集由三個節點伺服器組成,一個6個redis例項,每個節點開啟2個埠,三主三從。
reids部署目錄是/data/redis-4.0.1,叢集情況如下:
172.16.50.245:7000 master主節點 172.16.50.245:7001 slave從節點,是172.16.50.246:7002的從節點 172.16.50.246:7002 master主節點 172.16.50.246:7003 slave從節點,是172.16.50.247:7004的從節點 172.16.50.247:7004 master主節點 172.16.50.247:7005 slave從節點,是172.16.50.245:7000的從節點 由上面可以看出: 三個master主節點分佈在三個不同的伺服器上,三個slave從節點也分佈在三個不同的伺服器上。 由於上面的這三個節點伺服器是虛擬機器,這三臺虛擬機器部署在同一個物理宿主機上,某天這臺宿主機由於硬體故障突然關機,從而導致這三臺節點的redis服務也重啟了。 下面是重啟這三臺節點機器的redis情況:
1)三臺節點重啟redis服務的命令分別為:
172.16.50.245
[[email protected] ~]# ps -ef|grep redis|grep -v grep
[[email protected] ~]#
[ro[email protected] ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done
[[email protected] ~]# ps -ef|grep redis
root 2059 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7000 [cluster]
root 2061 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7001 [cluster]
root 2092 1966 0 22:29 pts/0 00:00:00 grep redis
[ [email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.245 -c -p 7000
172.16.50.245:7000> cluster nodes
678211b78a4eb15abf27406d057900554ff70d4d :[email protected] myself,master - 0 0 0 connected
172.16.50.246
[[email protected] ~]# ps -ef|grep redis|grep -v grep
[[email protected] ~]#
[[email protected] ~]# for((i=2;i<=3;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done
[[email protected] ~]# ps -ef|grep redis
root 1985 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7002 [cluster]
root 1987 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7003 [cluster]
root 2016 1961 0 22:29 pts/0 00:00:00 grep redis
[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.246 -c -p 7002
172.16.50.246:7002> cluster nodes
2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:[email protected] myself,master - 0 0 0 connected
172.16.50.247
[[email protected] ~]# ps -ef|grep redis|grep -v grep
[[email protected] ~]#
[[email protected] ~]# for((i=4;i<=5;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done
[[email protected] ~]# ps -ef|grep redis
root 1987 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7004 [cluster]
root 1989 1 0 22:29 ? 00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7005 [cluster]
root 2018 1966 0 22:29 pts/0 00:00:00 grep redis
[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.247 -c -p 7004
172.16.50.247:7004> cluster nodes
ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:[email protected] myself,master - 0 0 0 connected
172.16.50.247:7004>
由上面可知,三個redis節點重啟後,預設是沒有加到redis cluster叢集中的。
2)由於redis clster叢集節點宕機(或節點的redis服務重啟),導致了部分slot資料分片丟失;在用check檢查叢集執行狀態時,遇到下面報錯:
注意:建立redis cluster機器的操作要在安裝了gem工具的機器上,這裡在172.16.50.245節點伺服器上操作(其他兩臺節點伺服器沒有按照gem工具,故不能在這兩臺機器上操作
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb check 172.16.50.245:7000
........
[ERR] Not all 16384 slots are covered by nodes.
================================================================================
原因:一般是由於slot總數沒有達到16384,其實也就是slots分佈不正確導致的。其他2個節點伺服器的redis例項也是這種情況。
解決辦法:
官方是推薦使用redis-trib.rb fix 來修復叢集。通過cluster nodes看到7001這個節點被幹掉了。可以按照下面操作進行修復
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000
......
Fix these slots by covering with a random node? (type 'yes' to accept): yes
修復後再次check檢查就正常了
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000
>>> Performing Cluster Check (using node 172.16.50.245:7000)
M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000
slots:0-16383 (16384 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
其他5個redis例項照此方法進行修復
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7001
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7002
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7003
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7004
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7005
=================================================================================
3)接著將這三個redis節點重啟新增到叢集中去(也就是重新建立redis cluster機器,這裡採用自主新增master和slave主從節點)
注意:建立redis cluster機器的操作要在安裝了gem工具的機器上,這裡在172.16.50.245節點伺服器上操作(其他兩臺節點伺服器沒有按照gem工具,故不能在這兩臺機器上操作)
先新增redis master主節點
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb create 172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004
Using 3 masters:
172.16.50.245:7000
172.16.50.246:7002
172.16.50.247:7004
M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000
slots:0-16383 (16384 slots) master
M: 2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:7002
slots:0-16383 (16384 slots) master
M: ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:7004
slots:0-16383 (16384 slots) master
Can I set the above configuration? (type 'yes' to accept): yes #輸入yes
==============================================================================
如果提示下面報錯:
[ERR] Node 172.16.50.245:7000 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or
contains some key in database 0.
或者
[ERR] Node 192.16.50.246:7002 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or
contains some key in database 0.
或者
[ERR] Node 192.16.50.247:7004 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or
contains some key in database 0.
或者
Can I set the above configuration? (type 'yes' to accept): yes
/usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis/client.rb:119:in `call': ERR Slot 0 is already busy (Redis::CommandError)
from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2764:in `block in method_missing'
from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'
from /usr/local/rvm/rubies/ruby-2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'
from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2763:in `method_missing'
from /data/redis-4.0.1/src/redis-trib.rb:212:in `flush_node_config'
from /data/redis-4.0.1/src/redis-trib.rb:776:in `block in flush_nodes_config'
from /data/redis-4.0.1/src/redis-trib.rb:775:in `each'
from /data/redis-4.0.1/src/redis-trib.rb:775:in `flush_nodes_config'
from /data/redis-4.0.1/src/redis-trib.rb:1296:in `create_cluster_cmd'
from /data/redis-4.0.1/src/redis-trib.rb:1700:in `<main>'
解決辦法:
a)將172.16.50.245、172.16.50.246、172.16.50.247三個節點機redis下的aof、rdb等本地備份檔案全部刪除,刪除之前先備份(或者直接進行mv移動到別處)
[[email protected] ~]# cd /data/redis-4.0.1/redis-cluster/
[[email protected] redis-cluster]# ls
7000 7001 appendonly.aof dump.rdb nodes_7000.conf nodes_7001.conf
[[email protected] redis-cluster]# mv appendonly.aof /opt/
[[email protected] redis-cluster]# mv dump.rdb /opt/
[[email protected] redis-cluster]# mv nodes_7000.conf /opt/
[[email protected] redis-cluster]# mv nodes_7001.conf /opt/
b)登陸redis後,執行 "flushdb"命令進行資料清除操作
[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000
172.16.50.245:7000> flushdb
OK
172.16.50.245:7000>
c)重啟reds服務
[[email protected] ~]# pkill -9 redis
[[email protected] ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done
d)再次執行叢集建立操作就不會報錯了
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb create 172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004
=========================================================================
然後新增以上3個master主節點對應的3個從節點
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.247:7005 172.16.50.245:7000
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.245:7001 172.16.50.246:7002
[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.246:7003 172.16.50.247:7004
然後檢視redis cluster叢集情況(這個檢視命令再哪個redis節點伺服器上都可以檢視)
[[email protected] redis-cluster]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000
172.16.50.245:7000> cluster nodes
1032fedd0c1ca7ac12be3041a34232593bd82343 172.16.50.245:[email protected] myself,master - 0 1531150844000 1 connected 0-5460
1f389da1c7db857a7b72986289d7a2132e30b879 172.16.50.247:[email protected] master - 0 1531150846000 3 connected 10923-16383
778c92cdf73232864e1455edb7c2e1b07c6d067e 172.16.50.247:[email protected] slave 1032fedd0c1ca7ac12be3041a34232593bd82343 0 1531150846781 1 connected
e6649e6abf3ab496ca8a32896c5d72858d36dce9 172.16.50.246:[email protected] master - 0 1531150845779 2 connected 5461-10922
d0f4c92f0b6bcc5f257a7546d4d121c602c78df8 172.16.50.246:[email protected] slave 1f389da1c7db857a7b72986289d7a2132e30b879 0 1531150844777 3 connected
846943fd218aeafe6e2b4ec96505714fab2e1861 172.16.50.245:[email protected] slave e6649e6abf3ab496ca8a32896c5d72858d36dce9 0 1531150847784 2 connected
172.16.50.245:7000> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:1
cluster_stats_messages_ping_sent:769
cluster_stats_messages_pong_sent:711
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:1481
cluster_stats_messages_ping_received:706
cluster_stats_messages_pong_received:770
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:1481
======================================================
由此總結幾點(redis重啟後,master例項變為slave例項,master之前對應的slave則變為master。):
1)master主節點例項最好分佈在不同的伺服器上,slave從節點例項也最好分佈在不同的伺服器上
2)master主節點例項和它自己所包括的從節點例項也最好分佈在不同的伺服器上,不要在同一個伺服器上。
部署redis cluster的機器如果是虛擬機器,則虛擬機器最好也別部署在同一臺物理宿主機上。
這樣做是為了避免因伺服器掛掉或主從節點機器全部掛掉,導致資料丟失,叢集啟用失敗。
3)如果redis機器掛掉的數量不超過叢集機器總數量的1/2,那麼redis叢集服務將不受影響,可以繼續使用。
但是節點機器重啟後,則它上面的redis節點例項加入到叢集中後,它上面之前的master例項變為slave例項(這個master之前對應的slave將隨之變為master),
也就是說重啟後這個節點的兩個redis例項全部變為slave了(原來是一主一從,重啟後兩從)。
4)如果redis機器掛掉的數量超過了叢集機器總數量的1/2,則不僅要重啟機器的redis服務,還要重新加入到redis cluster叢集
重新建立叢集時,注意master和slave關係要和之前一樣。
5)千萬注意redis cluster中的節點重啟事會先從集中中踢出來,即它上面的master和slave例項會消失,master之前對應的slave變為新的master,待該節點重啟
成功後會自動重新加入到叢集中,並全部成為slave例項。由於節點重啟期間,在從叢集中踢出來到自動重新加入到叢集中存在一定的時間間隔,所以叢集中的節點
不能同時重啟,也不能短時間內一起重啟,否則會造成叢集崩潰,叢集資料丟失。一般而言,重啟一臺節點,待該節點重啟後並檢查成功重新加入到叢集中夠,再
另外重啟別的節點。
redis作為純快取服務時,資料丟失,一般對業務是無感的,不影響業務,丟失後會再次寫入。但如果作為儲存服務(即作為儲存資料庫),資料丟失則對業務影響很大。
不過一般業務場景,儲存資料庫會用mysql、oracle或mongodb。