Redis Cluster節點伺服器宕機後導致叢集重啟失敗案例

阿新 • • 發佈：2018-12-25

這裡說下自己碰到的一種情況：
redis cluster叢集由三個節點伺服器組成，一個6個redis例項，每個節點開啟2個埠，三主三從。
reids部署目錄是/data/redis-4.0.1，叢集情況如下：


172.16.50.245:7000  master主節點

172.16.50.245:7001  slave從節點，是172.16.50.246:7002的從節點

172.16.50.246:7002  master主節點

172.16.50.246:7003  slave從節點，是172.16.50.247:7004的從節點

172.16.50.247:7004  master主節點

172.16.50.247:7005  slave從節點，是172.16.50.245:7000的從節點

由上面可以看出：

三個master主節點分佈在三個不同的伺服器上，三個slave從節點也分佈在三個不同的伺服器上。

由於上面的這三個節點伺服器是虛擬機器，這三臺虛擬機器部署在同一個物理宿主機上，某天這臺宿主機由於硬體故障突然關機，從而導致這三臺節點的redis服務也重啟了。

下面是重啟這三臺節點機器的redis情況：

1）三臺節點重啟redis服務的命令分別為：


172.16.50.245

[[email protected] ~]# ps -ef|grep redis|grep -v grep

[[email protected] ~]#

[ro[email protected] ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[[email protected] ~]# ps -ef|grep redis

root      2059     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7000 [cluster]                  

root      2061     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7001 [cluster]                  

root      2092  1966  0 22:29 pts/0    00:00:00 grep redis

[ 
[email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.245 -c -p 7000

172.16.50.245:7000> cluster nodes

678211b78a4eb15abf27406d057900554ff70d4d :[email protected] myself,master - 0 0 0 connected

  

172.16.50.246

[[email protected] ~]# ps -ef|grep redis|grep -v grep

[[email protected] 
 ~]#

[[email protected] ~]# for((i=2;i<=3;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[[email protected] ~]# ps -ef|grep redis

root      1985     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7002 [cluster]                  

root      1987     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7003 [cluster]                  

root      2016  1961  0 22:29 pts/0    00:00:00 grep redis

[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.246 -c -p 7002

172.16.50.246:7002> cluster nodes

2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:[email protected] myself,master - 0 0 0 connected

  

172.16.50.247

[[email protected] ~]# ps -ef|grep redis|grep -v grep

[[email protected] ~]#

[[email protected] ~]# for((i=4;i<=5;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[[email protected] ~]# ps -ef|grep redis

root      1987     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7004 [cluster]                  

root      1989     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7005 [cluster]                  

root      2018  1966  0 22:29 pts/0    00:00:00 grep redis

[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.247 -c -p 7004

172.16.50.247:7004> cluster nodes

ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:[email protected] myself,master - 0 0 0 connected

172.16.50.247:7004>

  

由上面可知，三個redis節點重啟後，預設是沒有加到redis cluster叢集中的。

2）由於redis clster叢集節點宕機（或節點的redis服務重啟），導致了部分slot資料分片丟失；在用check檢查叢集執行狀態時，遇到下面報錯：
注意：建立redis cluster機器的操作要在安裝了gem工具的機器上，這裡在172.16.50.245節點伺服器上操作（其他兩臺節點伺服器沒有按照gem工具，故不能在這兩臺機器上操作


[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb check 172.16.50.245:7000

........

[ERR] Not all 16384 slots are covered by nodes.

  

================================================================================

原因：一般是由於slot總數沒有達到16384，其實也就是slots分佈不正確導致的。其他2個節點伺服器的redis例項也是這種情況。

  

解決辦法：

官方是推薦使用redis-trib.rb fix 來修復叢集。通過cluster nodes看到7001這個節點被幹掉了。可以按照下面操作進行修復

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000

......

Fix these slots by covering with a random node? (type 'yes' to accept): yes

  

修復後再次check檢查就正常了

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000

>>> Performing Cluster Check (using node 172.16.50.245:7000)

M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000

   slots:0-16383 (16384 slots) master

   0 additional replica(s)

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

  

其他5個redis例項照此方法進行修復

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7001

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7002

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7003

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7004

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7005

=================================================================================

3）接著將這三個redis節點重啟新增到叢集中去（也就是重新建立redis cluster機器，這裡採用自主新增master和slave主從節點）
注意：建立redis cluster機器的操作要在安裝了gem工具的機器上，這裡在172.16.50.245節點伺服器上操作（其他兩臺節點伺服器沒有按照gem工具，故不能在這兩臺機器上操作）


先新增redis master主節點

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb create  172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004

Using 3 masters:

172.16.50.245:7000

172.16.50.246:7002

172.16.50.247:7004

M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000

   slots:0-16383 (16384 slots) master

M: 2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:7002

   slots:0-16383 (16384 slots) master

M: ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:7004

   slots:0-16383 (16384 slots) master

Can I set the above configuration? (type 'yes' to accept):  yes   #輸入yes

     

==============================================================================

如果提示下面報錯：

[ERR] Node 172.16.50.245:7000 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

[ERR] Node 192.16.50.246:7002 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

[ERR] Node 192.16.50.247:7004 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

Can I set the above configuration? (type 'yes' to accept): yes

/usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis/client.rb:119:in `call': ERR Slot 0 is already busy (Redis::CommandError)

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2764:in `block in method_missing'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'

  from /usr/local/rvm/rubies/ruby-2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2763:in `method_missing'

  from /data/redis-4.0.1/src/redis-trib.rb:212:in `flush_node_config'

  from /data/redis-4.0.1/src/redis-trib.rb:776:in `block in flush_nodes_config'

  from /data/redis-4.0.1/src/redis-trib.rb:775:in `each'

  from /data/redis-4.0.1/src/redis-trib.rb:775:in `flush_nodes_config'

  from /data/redis-4.0.1/src/redis-trib.rb:1296:in `create_cluster_cmd'

  from /data/redis-4.0.1/src/redis-trib.rb:1700:in `<main>'

     

解決辦法：

a）將172.16.50.245、172.16.50.246、172.16.50.247三個節點機redis下的aof、rdb等本地備份檔案全部刪除，刪除之前先備份（或者直接進行mv移動到別處）

[[email protected] ~]# cd /data/redis-4.0.1/redis-cluster/

[[email protected] redis-cluster]# ls

7000  7001  appendonly.aof  dump.rdb  nodes_7000.conf  nodes_7001.conf

[[email protected] redis-cluster]# mv appendonly.aof /opt/

[[email protected] redis-cluster]# mv dump.rdb /opt/

[[email protected] redis-cluster]# mv nodes_7000.conf /opt/

[[email protected] redis-cluster]# mv nodes_7001.conf /opt/

     

b）登陸redis後，執行 "flushdb"命令進行資料清除操作

[[email protected] ~]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000

172.16.50.245:7000> flushdb

OK

172.16.50.245:7000>

     

c）重啟reds服務

[[email protected] ~]# pkill -9 redis

[[email protected] ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

     

d）再次執行叢集建立操作就不會報錯了

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb create  172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004

=========================================================================

     

然後新增以上3個master主節點對應的3個從節點

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.247:7005 172.16.50.245:7000

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.245:7001 172.16.50.246:7002

[[email protected] ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.246:7003 172.16.50.247:7004

     

然後檢視redis cluster叢集情況（這個檢視命令再哪個redis節點伺服器上都可以檢視）

[[email protected] redis-cluster]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000

172.16.50.245:7000> cluster nodes

1032fedd0c1ca7ac12be3041a34232593bd82343 172.16.50.245:[email protected] myself,master - 0 1531150844000 1 connected 0-5460

1f389da1c7db857a7b72986289d7a2132e30b879 172.16.50.247:[email protected] master - 0 1531150846000 3 connected 10923-16383

778c92cdf73232864e1455edb7c2e1b07c6d067e 172.16.50.247:[email protected] slave 1032fedd0c1ca7ac12be3041a34232593bd82343 0 1531150846781 1 connected

e6649e6abf3ab496ca8a32896c5d72858d36dce9 172.16.50.246:[email protected] master - 0 1531150845779 2 connected 5461-10922

d0f4c92f0b6bcc5f257a7546d4d121c602c78df8 172.16.50.246:[email protected] slave 1f389da1c7db857a7b72986289d7a2132e30b879 0 1531150844777 3 connected

846943fd218aeafe6e2b4ec96505714fab2e1861 172.16.50.245:[email protected] slave e6649e6abf3ab496ca8a32896c5d72858d36dce9 0 1531150847784 2 connected

     

172.16.50.245:7000> cluster info

cluster_state:ok

cluster_slots_assigned:16384

cluster_slots_ok:16384

cluster_slots_pfail:0

cluster_slots_fail:0

cluster_known_nodes:6

cluster_size:3

cluster_current_epoch:3

cluster_my_epoch:1

cluster_stats_messages_ping_sent:769

cluster_stats_messages_pong_sent:711

cluster_stats_messages_meet_sent:1

cluster_stats_messages_sent:1481

cluster_stats_messages_ping_received:706

cluster_stats_messages_pong_received:770

cluster_stats_messages_meet_received:5

cluster_stats_messages_received:1481

  

======================================================

由此總結幾點（redis重啟後，master例項變為slave例項，master之前對應的slave則變為master。）：

1）master主節點例項最好分佈在不同的伺服器上，slave從節點例項也最好分佈在不同的伺服器上

2）master主節點例項和它自己所包括的從節點例項也最好分佈在不同的伺服器上，不要在同一個伺服器上。

   部署redis cluster的機器如果是虛擬機器，則虛擬機器最好也別部署在同一臺物理宿主機上。

   這樣做是為了避免因伺服器掛掉或主從節點機器全部掛掉，導致資料丟失，叢集啟用失敗。

3）如果redis機器掛掉的數量不超過叢集機器總數量的1/2，那麼redis叢集服務將不受影響，可以繼續使用。

   但是節點機器重啟後，則它上面的redis節點例項加入到叢集中後，它上面之前的master例項變為slave例項（這個master之前對應的slave將隨之變為master），

   也就是說重啟後這個節點的兩個redis例項全部變為slave了（原來是一主一從，重啟後兩從）。

4）如果redis機器掛掉的數量超過了叢集機器總數量的1/2，則不僅要重啟機器的redis服務，還要重新加入到redis cluster叢集

   重新建立叢集時，注意master和slave關係要和之前一樣。

5）千萬注意redis cluster中的節點重啟事會先從集中中踢出來，即它上面的master和slave例項會消失，master之前對應的slave變為新的master，待該節點重啟

   成功後會自動重新加入到叢集中，並全部成為slave例項。由於節點重啟期間，在從叢集中踢出來到自動重新加入到叢集中存在一定的時間間隔，所以叢集中的節點

   不能同時重啟，也不能短時間內一起重啟，否則會造成叢集崩潰，叢集資料丟失。一般而言，重啟一臺節點，待該節點重啟後並檢查成功重新加入到叢集中夠，再

   另外重啟別的節點。

 

redis作為純快取服務時，資料丟失，一般對業務是無感的，不影響業務，丟失後會再次寫入。但如果作為儲存服務(即作為儲存資料庫)，資料丟失則對業務影響很大。

不過一般業務場景，儲存資料庫會用mysql、oracle或mongodb。

Redis Cluster節點伺服器宕機後導致叢集重啟失敗案例

Redis Cluster節點伺服器宕機後導致叢集重啟失敗案例

mysql在伺服器異常斷電後，無法重啟解決辦法

redis cluster 全部宕機後重啟會自動恢復叢集狀態

redis主庫宕機後重啟，主庫和從庫的資料丟失

Redis持久化方案（伺服器宕機挽回資料）

如何在不會導致伺服器宕機的情況下，用 PHP 讀取大檔案

Redis叢集宕機後重啟

計算節點宕機後，vm的遷移方法

導致伺服器宕機原因

Spring Cloud 公司專案實戰(Eureka相關):Eureka-Server 高可用叢集關於宕機後主動踢出該節點

openfire執行緒暴增導致堆溢位伺服器宕機排查處理

一例mysql主從數據庫，從庫宕機後無法啟動的解決方案

oracle 宕機後文件損壞resetlogs後處理

nginx+keepalived基本伺服器宕機的主從切換配置

Nginx+Tomcat做負載均衡時一臺伺服器宕機實現自動切換

伺服器宕機，mysql無法啟動，job for mysql.service failed because the process exited with error code，資料庫備份與恢復

伺服器宕機是什麼意思？為什麼會宕機？

遠離伺服器宕機，騰訊WeTest正式推出伺服器深度效能測試服務

CentOS上某一使用者宕機後處理辦法

遭遇難以想象4天的宕機後，Netflix用7年時間轉型為最超前的微服務架構

Redis Cluster節點伺服器宕機後導致叢集重啟失敗案例

相關推薦