1. 程式人生 > >redis cluster 全部宕機後重啟會自動恢復叢集狀態

redis cluster 全部宕機後重啟會自動恢復叢集狀態

昨天 測試環境上3主3從的redis節點叢集 虛擬機器3臺全部宕機

(3主3從交叉部署在3臺虛機上)

重新啟動各個節點發現  叢集自動恢復了  本來以為要重新使用create  命令

猜測叢集是根據node的主從資訊檔案自己恢復的 利用心跳檢測 

節點關係的檔案node-7001.conf

197d2e50893487f54d3cbbf07f746abc1fa08318 10.166.15.36:[email protected] master - 0 1530527218141 2 connected 5461-10922
052b9df159281988af20d8b5d0690e5defa42c47 10.166.15.37:
[email protected]
slave 197d2e50893487f54d3cbbf07f746abc1fa08318 0 1530527220145 4 connected ad20d73e76c72579be410a458c74bde554bf85bc 10.166.15.36:[email protected] master - 0 1530527218000 7 connected 0-5460 6d6f6fbeff41f5a85a7a7ebe15565a448fb0a69f 10.166.15.37:[email protected] slave ad20d73e76c72579be410a458c74bde554bf85bc 0 1530527220431 7 connected b98f368b250360dbb4d74a1b831d7fa939ea4f8d 10.166.15.35:
[email protected]
myself,slave 1e7a4a6b7376a89726211ce04acdaf97afbe2f13 0 1530527217000 1 connected 1e7a4a6b7376a89726211ce04acdaf97afbe2f13 10.166.15.35:[email protected] master - 0 1530527219143 8 connected 10923-16383 vars currentEpoch 8 lastVoteEpoch 7

對於日誌的分析

1 查看了 一個slave節點10.166.15.35:7001的異常斷電重新啟動日誌資訊如下

4446:M 02 Jul 18:07:55.091 * Node configuration loaded, I'm b98f368b250360dbb4d74a1b831d7fa939ea4f8d
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.6 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in cluster mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 7001
 |    `-._   `._    /     _.-'    |     PID: 4446
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               


4446:M 02 Jul 18:07:55.092 # Server initialized
4446:M 02 Jul 18:07:55.092 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
4446:M 02 Jul 18:07:55.092 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
4446:M 02 Jul 18:07:55.716 * DB loaded from append only file: 0.624 seconds
4446:M 02 Jul 18:07:55.716 * Ready to accept connections
4446:S 02 Jul 18:07:55.717 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
4446:S 02 Jul 18:07:55.717 # Cluster state changed: ok
4446:S 02 Jul 18:07:56.720 * Connecting to MASTER 10.166.15.35:7002  ---ps:說明本slave節點是10.166.15.35:7001 這裡開始連結master
4446:S 02 Jul 18:07:56.720 * MASTER <-> SLAVE sync started
4446:S 02 Jul 18:07:56.720 * Non blocking connect for SYNC fired the event.
4446:S 02 Jul 18:07:56.720 * Master replied to PING, replication can continue...
4446:S 02 Jul 18:07:56.721 * Trying a partial resynchronization (request eda94ebd6d85f2f9dfb8337358ab14bdaee896bd:1).
4446:S 02 Jul 18:07:56.723 * Full resync from master: 79be914765bfde6e102bfae1b106847c8dcaea36:0
4446:S 02 Jul 18:07:56.723 * Discarding previously cached master state.
4446:S 02 Jul 18:07:56.772 * MASTER <-> SLAVE sync: receiving 318259 bytes from master
4446:S 02 Jul 18:07:56.773 * MASTER <-> SLAVE sync: Flushing old data
4446:S 02 Jul 18:07:56.776 * MASTER <-> SLAVE sync: Loading DB in memory
4446:S 02 Jul 18:07:56.784 * MASTER <-> SLAVE sync: Finished with success
4446:S 02 Jul 18:07:56.785 * Background append only file rewriting started by pid 4451
4446:S 02 Jul 18:07:56.812 * AOF rewrite child asks to stop sending diffs.
4451:C 02 Jul 18:07:56.812 * Parent agreed to stop sending diffs. Finalizing AOF...
4451:C 02 Jul 18:07:56.812 * Concatenating 0.00 MB of AOF diff received from parent.
4451:C 02 Jul 18:07:56.812 * SYNC append only file rewrite performed
4451:C 02 Jul 18:07:56.812 * AOF rewrite: 6 MB of memory used by copy-on-write
4446:S 02 Jul 18:07:56.820 * Background AOF rewrite terminated with success
4446:S 02 Jul 18:07:56.820 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
4446:S 02 Jul 18:07:56.820 * Background AOF rewrite finished successfully
4446:S 02 Jul 18:08:10.776 # Cluster state changed: fail
4446:S 02 Jul 18:08:13.650 # Cluster state changed: ok

4446:S 02 Jul 18:08:29.586 * FAIL message received from 1e7a4a6b7376a89726211ce04acdaf97afbe2f13 about 052b9df159281988af20d8b5d0690e5defa42c47

說明:1e7a4a6b7376a89726211ce04acdaf97afbe2f13  就是10.166.15.35:7002

4446:S 02 Jul 18:08:29.586 * FAIL message received from 1e7a4a6b7376a89726211ce04acdaf97afbe2f13 about 6d6f6fbeff41f5a85a7a7ebe15565a448fb0a69f
4446:S 02 Jul 18:08:56.087 * 10000 changes in 60 seconds. Saving...
4446:S 02 Jul 18:08:56.088 * Background saving started by pid 4472
4472:C 02 Jul 18:08:56.106 * DB saved on disk
4472:C 02 Jul 18:08:56.107 * RDB: 6 MB of memory used by copy-on-write
4446:S 02 Jul 18:08:56.188 * Background saving terminated with success
4446:S 02 Jul 18:26:56.159 * Clear FAIL state for node 052b9df159281988af20d8b5d0690e5defa42c47: slave is reachable again.
4446:S 02 Jul 18:27:00.431 * Clear FAIL state for node 6d6f6fbeff41f5a85a7a7ebe15565a448fb0a69f: slave is reachable again.
4446:S 02 Jul 18:27:11.969 * 1 changes in 900 seconds. Saving...
4446:S 02 Jul 18:27:11.971 * Background saving started by pid 5248
5248:C 02 Jul 18:27:11.995 * DB saved on disk
5248:C 02 Jul 18:27:11.996 * RDB: 6 MB of memory used by copy-on-write
4446:S 02 Jul 18:27:12.071 * Background saving terminated with success
4446:S 02 Jul 18:42:13.029 * 1 changes in 900 seconds. Saving...
4446:S 02 Jul 18:42:13.031 * Background saving started by pid 6014
6014:C 02 Jul 18:42:13.051 * DB saved on disk
6014:C 02 Jul 18:42:13.051 * RDB: 8 MB of memory used by copy-on-write
4446:S 02 Jul 18:42:13.134 * Background saving terminated with success
4446:S 02 Jul 19:21:15.352 * 1 changes in 900 seconds. Saving...
4446:S 02 Jul 19:21:15.354 * Background saving started by pid 7287
7287:C 02 Jul 19:21:15.363 * DB saved on disk
7287:C 02 Jul 19:21:15.363 * RDB: 8 MB of memory used by copy-on-write
4446:S 02 Jul 19:21:15.455 * Background saving terminated with success
4446:S 02 Jul 19:26:16.093 * 10 changes in 300 seconds. Saving...
4446:S 02 Jul 19:26:16.094 * Background saving started by pid 7615
7615:C 02 Jul 19:26:16.113 * DB saved on disk
7615:C 02 Jul 19:26:16.114 * RDB: 8 MB of memory used by copy-on-write
4446:S 02 Jul 19:26:16.197 * Background saving terminated with success
4446:S 02 Jul 19:41:17.032 * 1 changes in 900 seconds. Saving...
4446:S 02 Jul 19:41:17.033 * Background saving started by pid 8074
8074:C 02 Jul 19:41:17.055 * DB saved on disk
8074:C 02 Jul 19:41:17.055 * RDB: 8 MB of memory used by copy-on-write
4446:S 02 Jul 19:41:17.134 * Background saving terminated with success
4446:S 02 Jul 22:03:58.651 * 1 changes in 900 seconds. Saving...
4446:S 02 Jul 22:03:58.653 * Background saving started by pid 13916
13916:C 02 Jul 22:03:58.667 * DB saved on disk
13916:C 02 Jul 22:03:58.667 * RDB: 8 MB of memory used by copy-on-write
4446:S 02 Jul 22:03:58.754 * Background saving terminated with success
4446:S 03 Jul 00:14:43.282 * 1 changes in 900 seconds. Saving...
4446:S 03 Jul 00:14:43.284 * Background saving started by pid 19046
19046:C 03 Jul 00:14:43.307 * DB saved on disk

2 給出1中slave節點對應的主節點10.166.15.35:7002的異常斷電後重新啟動資訊

4440:C 02 Jul 18:07:51.105 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4440:C 02 Jul 18:07:51.109 # Redis version=4.0.6, bits=64, commit=00000000, modified=0, pid=4440, just started
4440:C 02 Jul 18:07:51.109 # Configuration loaded
4441:M 02 Jul 18:07:51.122 * Node configuration loaded, I'm 1e7a4a6b7376a89726211ce04acdaf97afbe2f13
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.6 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in cluster mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 7002
|    `-._   `._    /     _.-'    |     PID: 4441
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               


4441:M 02 Jul 18:07:51.127 # Server initialized
4441:M 02 Jul 18:07:51.127 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
4441:M 02 Jul 18:07:51.127 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
4441:M 02 Jul 18:07:51.752 * DB loaded from append only file: 0.625 seconds
4441:M 02 Jul 18:07:51.752 * Ready to accept connections
4441:M 02 Jul 18:07:53.759 # Cluster state changed: ok
4441:M 02 Jul 18:07:56.721 * Slave 10.166.15.35:7001 asks for synchronization
4441:M 02 Jul 18:07:56.721 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'eda94ebd6d85f2f9dfb8337358ab14bdaee896bd', my replication IDs are 'd64784b6f440cfc58eb490428541b43c1cc7eecf' and '0000000000000000000000000000000000000000')
4441:M 02 Jul 18:07:56.721 * Starting BGSAVE for SYNC with target: disk
4441:M 02 Jul 18:07:56.722 * Background saving started by pid 4450
4450:C 02 Jul 18:07:56.740 * DB saved on disk
4450:C 02 Jul 18:07:56.741 * RDB: 6 MB of memory used by copy-on-write
4441:M 02 Jul 18:07:56.771 * Background saving terminated with success
4441:M 02 Jul 18:07:56.772 * Synchronization with slave 10.166.15.35:7001 succeeded
4441:M 02 Jul 18:08:06.807 # Cluster state changed: fail
4441:M 02 Jul 18:08:18.647 # Cluster state changed: ok
4441:M 02 Jul 18:08:29.585 * Marking node 052b9df159281988af20d8b5d0690e5defa42c47 as failing (quorum reached).

4441:M 02 Jul 18:08:29.586 * Marking node 6d6f6fbeff41f5a85a7a7ebe15565a448fb0a69f as failing (quorum reached).

--說明本節點為主節點 聯合36虛擬啟動的兩個主節點 確認了37上兩個slave節點宕機(超過半數的主投票確認,37上兩個節點晚啟動20分鐘)

4441:M 02 Jul 18:26:56.159 * Clear FAIL state for node 052b9df159281988af20d8b5d0690e5defa42c47: slave is reachable again.

4441:M 02 Jul 18:27:00.431 * Clear FAIL state for node 6d6f6fbeff41f5a85a7a7ebe15565a448fb0a69f: slave is reachable again.

--說明本節點為主節點 聯合36虛擬啟動的兩個主節點 確認了37上兩個slave節點恢復 重新加入叢集(超過半數的主投票確認)

4441:M 02 Jul 18:27:11.969 * 1 changes in 900 seconds. Saving...
4441:M 02 Jul 18:27:11.970 * Background saving started by pid 5249
5249:C 02 Jul 18:27:11.979 * DB saved on disk
5249:C 02 Jul 18:27:11.979 * RDB: 6 MB of memory used by copy-on-write
4441:M 02 Jul 18:27:12.072 * Background saving terminated with success
4441:M 02 Jul 18:42:13.032 * 1 changes in 900 seconds. Saving...
4441:M 02 Jul 18:42:13.034 * Background saving started by pid 6015
6015:C 02 Jul 18:42:13.068 * DB saved on disk
6015:C 02 Jul 18:42:13.069 * RDB: 8 MB of memory used by copy-on-write
4441:M 02 Jul 18:42:13.135 * Background saving terminated with success
4441:M 02 Jul 19:21:15.353 * 1 changes in 900 seconds. Saving...
4441:M 02 Jul 19:21:15.354 * Background saving started by pid 7288
7288:C 02 Jul 19:21:15.363 * DB saved on disk
7288:C 02 Jul 19:21:15.363 * RDB: 8 MB of memory used by copy-on-write
4441:M 02 Jul 19:21:15.455 * Background saving terminated with success

4441:M 02 Jul 19:26:16.093 * 10 changes in 300 seconds. Saving...

參考: https://www.cnblogs.com/yjmyzz/p/redis-cluster-turotial.html

ps:2018.07.12add

以上這種是理想情況下的,可以自動恢復,極端情況(多發生在高併發場景)

可能出現主從資料不一致,

有一些公司遇到叢集宕機某些節點後,重啟無法恢復叢集初始狀態,甚至叢集不可用的

情況