【問題追查】記海外aws上redis-cluster單實例抖動問題追查

阿新 • • 發佈：2018-01-12

node same pan 使用 command rename exe png about

【背景】

公司在海外的業務沒有自建機房，而是使用了aws的服務，型號是r4.4xlarge。

但是，部署在aws上的redis集群，經常遇到某個實例耗時抖動比其他實例厲害，但是cpu、mem、網絡等指標都較低的情況。

於是開啟了一場漫長的問題定位之路。

【現象】

集群中某些redis實例有較明顯的抖動現象

技術分享圖片

如上圖所示，實例10.100.14.19、10.100.15.206、10.100.7.237有明顯抖動。

所有機器的cpu資源、內存資源都還有很大余量，網絡丟包也沒有

【追查】

使用LATENCY工具對抖動的redis實例監控 => 無明確結論，第3條->可能有一個“擾人”的鄰居

10.100.14.19:6870> LATENCY DOCTOR

Dave, I have observed latency spikes in this Redis instance. You don‘t mind talking about it, do you Dave?

1. fast-command: 2 latency spikes (average 1ms, mean deviation 0ms, period 184.50 sec). Worst all time event 1ms.

2. command: 10 latency spikes (average 8ms, mean deviation 10ms, period 77.70 sec). Worst all time event 39ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.

- The system is slow to execute Redis code paths not containing system calls. This usually means the system does not provide Redis CPU time to run for long periods. You should try to:

1) Lower the system load.

2) Use a computer / VM just for Redis if you are running other softawre in the same system.

3) Check if you have a "noisy neighbour" problem.

4) Check with ‘redis-cli --intrinsic-latency 100‘ what is the intrinsic latency in your system.

5) Check if the problem is allocator-related by recompiling Redis with MALLOC=libc, if you are using Jemalloc. However this may create fragmentation problems.

- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.

使用watchdog將超過20ms的命令打印日誌 =>

155384:signal-handler (1515198055)
--- WATCHDOG TIMER EXPIRED ---
EIP:
/lib/x86_64-linux-gnu/libc.so.6(rename+0x7)[0x7fe53f018727]

Backtrace:
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](logStackTrace+0x34)[0x462d24]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](watchdogSignalHandler+0x1b)[0x462dcb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7fe53f36b890]
/lib/x86_64-linux-gnu/libc.so.6(rename+0x7)[0x7fe53f018727]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](readSyncBulkPayload+0x1ed)[0x441bbd]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](aeProcessEvents+0x228)[0x426128]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](aeMain+0x2b)[0x4263bb]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](main+0x405)[0x423125]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fe53efd2b45]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster][0x423395]
155384:signal-handler (1515198055) --------

發現同一機器上有其他服務消耗cpu較多，遷移 => 有效果，但是一段時間後又有抖動

技術分享圖片

發現同一機器上有其他服務不定時的把cpu跑滿 => node_keeper.py的bug
將整個集群的所有實例移到新的機器獨立部署 => 抖動沒有再發生

技術分享圖片

【初步結論】

同一機器上的其他服務影響了redis實例

其他服務平時cpu消耗很低，通過監控看不出來抖動
其他服務抖動的時候可能把cpu打滿，如果cpu打滿，redis所在的核繁忙，因為redis是單線程必然受到影響 => 產生抖動

redis是單線程+異步服務，很容易受到同機器其他服務的影響，且影響較大。

【問題追查】記海外aws上redis-cluster單實例抖動問題追查

node same pan 使用 command rename exe png about 【背景】公司在海外的業務沒有自建機房，而是使用了aws的服務，型號是r4.4xlarge。但是，部署在aws上的redis集群，經常遇到某個實例耗

【問題追查】記海外aws上redis-cluster單實例抖動問題追查

【問題追查】記海外aws上redis-cluster單實例抖動問題追查

【php】文件的上傳與下載

【springmvc】springmvc中如何上傳文件

【轉】CentOS 7.0 安裝Redis 3.2.1詳細過程和使用常見問題

【轉】文件各種上傳，離不開的表單

010-shiro與spring web項目整合【四】緩存Ehcache、Redis

【圖片】移動端圖片上傳旋轉、壓縮的解決方案

【轉】如何使用Git上傳本地項目到github?(mac版)

【troubleshooting】記一次Kafka集群重啟導致消息重復消費問題處理記錄

【轉】大數據分析中Redis怎麽做到220萬ops

【ZT】在微信上有哪些高情商的說話方式 | M周刊（聽語音需要60秒，看文字只需10秒）

【感悟】或許這世界上分為兩種人吧

【原創】分布式之redis復習精講

redis實戰 php實例【1】

【轉載】分布式之redis復習精講

【vbs】查詢客戶端上的應用程序

【國慶】記一次mysqld_safe引發mysql進程故障

【LeetCode】動態規劃（上篇共75題）

【轉】記一次Canvas 設定 overrideSorting 失敗

【OpenGL】-007 在視窗上顯示一個三角形

【問題追查】記海外aws上redis-cluster單實例抖動問題追查

相關推薦