1. 程式人生 > >【問題追查】記海外aws上redis-cluster單實例抖動問題追查

【問題追查】記海外aws上redis-cluster單實例抖動問題追查

node same pan 使用 command rename exe png about

【背景】

公司在海外的業務沒有自建機房,而是使用了aws的服務,型號是r4.4xlarge。

但是,部署在aws上的redis集群,經常遇到某個實例耗時抖動比其他實例厲害,但是cpu、mem、網絡等指標都較低的情況。

於是開啟了一場漫長的問題定位之路。

【現象】

  • 集群中某些redis實例有較明顯的抖動現象

技術分享圖片

如上圖所示,實例10.100.14.19、10.100.15.206、10.100.7.237有明顯抖動。

  • 所有機器的cpu資源、內存資源都還有很大余量,網絡丟包也沒有

【追查】

  • 使用LATENCY工具對抖動的redis實例監控 => 無明確結論,第3條->可能有一個“擾人”的鄰居

10.100.14.19:6870> LATENCY DOCTOR

Dave, I have observed latency spikes in this Redis instance. You don‘t mind talking about it, do you Dave?

1. fast-command: 2 latency spikes (average 1ms, mean deviation 0ms, period 184.50 sec). Worst all time event 1ms.

2. command: 10 latency spikes (average 8ms, mean deviation 10ms, period 77.70 sec). Worst all time event 39ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.

- The system is slow to execute Redis code paths not containing system calls. This usually means the system does not provide Redis CPU time to run for long periods. You should try to:

1) Lower the system load.

2) Use a computer / VM just for Redis if you are running other softawre in the same system.

3) Check if you have a "noisy neighbour" problem.

4) Check with ‘redis-cli --intrinsic-latency 100‘ what is the intrinsic latency in your system.

5) Check if the problem is allocator-related by recompiling Redis with MALLOC=libc, if you are using Jemalloc. However this may create fragmentation problems.

- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.

  • 使用watchdog將超過20ms的命令打印日誌 =>

155384:signal-handler (1515198055)
--- WATCHDOG TIMER EXPIRED ---
EIP:
/lib/x86_64-linux-gnu/libc.so.6(rename+0x7)[0x7fe53f018727]

Backtrace:
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](logStackTrace+0x34)[0x462d24]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](watchdogSignalHandler+0x1b)[0x462dcb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7fe53f36b890]
/lib/x86_64-linux-gnu/libc.so.6(rename+0x7)[0x7fe53f018727]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](readSyncBulkPayload+0x1ed)[0x441bbd]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](aeProcessEvents+0x228)[0x426128]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](aeMain+0x2b)[0x4263bb]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster](main+0x405)[0x423125]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fe53efd2b45]
/opt/tiger/cache_manager/bin/redis-server-3.2 10.100.14.19:6870 [cluster][0x423395]
155384:signal-handler (1515198055) --------

  • 發現同一機器上有其他服務消耗cpu較多,遷移 => 有效果,但是一段時間後又有抖動

技術分享圖片

  • 發現同一機器上有其他服務不定時的把cpu跑滿 => node_keeper.py的bug
  • 將整個集群的所有實例移到新的機器獨立部署 => 抖動沒有再發生

技術分享圖片

【初步結論】

同一機器上的其他服務影響了redis實例

  • 其他服務平時cpu消耗很低,通過監控看不出來抖動
  • 其他服務抖動的時候可能把cpu打滿,如果cpu打滿,redis所在的核繁忙,因為redis是單線程必然受到影響 => 產生抖動

redis是單線程+異步服務,很容易受到同機器其他服務的影響,且影響較大。

【問題追查】記海外aws上redis-cluster單實例抖動問題追查