系統技術非業餘研究 » MYSQL資料庫網絡卡軟中斷不平衡問題及解決方案
最近公司在MySQL的資料庫上由於採用了高速的如PCIe卡以及大記憶體,去年在壓力測試的時候突然發現數據庫的流量可以把一個千M網絡卡壓滿了。隨著資料庫的優化,現在流量可以達到150M,所以我們採用了雙網絡卡,在交換機上繫結,做LB的方式,提高系統的吞吐量。
但是在最近壓測試的一個數據庫中,mpstat發現其中一個核的CPU被軟中斷耗盡:
Mysql QPS 2W左右
——– —–load-avg—- —cpu-usage— —swap— -QPS- -TPS- -Hit%-
time | 1m 5m 15m |usr sys idl iow| si so| ins upd del sel iud| lor hit|
13:43:46| 0.00 0.00 0.00| 67 27 3 3| 0 0| 0 0 0 0 0| 0 100.00|
13:43:47| 0.00 0.00 0.00| 30 10 60 0| 0 0| 0 0 0 19281 0| 326839 100.00|
13:43:48| 0.00 0.00 0.00| 28 10 63 0| 0 0| 0 0 0 19083 0| 323377 100.00|
13:43:49| 0.00 0.00 0.00| 28 10 63 0| 0 0| 0 0 0 19482 0| 330185 100.00|
13:43:50| 0.00 0.00 0.00| 26 9 65 0| 0 0| 0 0 0 19379 0| 328575 100.00|
13:43:51| 0.00 0.00 0.00| 27 9 64 0| 0 0| 0 0 0 19723 0| 334378 100.00|
針對這個問題,我們利用工具,特別是systemtap, 一步步來調查和解決問題。
首先我們來確認下網絡卡的設定:
$uname -r 2.6.32-131.21.1.tb399.el6.x86_64 $ lspci -vvvv 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin A routed to IRQ 114 Region 0: Memory at f6000000 (64-bit, non-prefetchable) [size=32M] Capabilities: <access denied> 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin B routed to IRQ 122 Region 0: Memory at f8000000 (64-bit, non-prefetchable) [size=32M] Capabilities: <access denied> $cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: em1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: em1 MII Status: up Link Failure Count: 0 Permanent HW addr: 78:2b:cb:1f:eb:c9 Slave queue ID: 0 Slave Interface: em2 MII Status: up Link Failure Count: 0 Permanent HW addr: 78:2b:cb:1f:eb:ca Slave queue ID: 0
從上面的資訊我們可以確認二塊 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)網絡卡在做bonding。
我們的系統核心組維護的是RHEL 6.1, 很容易可以從/proc/interrupts和/proc/softirqs得到中斷和軟中斷的資訊的資訊。
我們特別留意下softirq, 由於CPU太多,資訊太亂,我只列出7個核心的情況:
$cat /proc/softirqs|tr -s ' ' '\t'|cut -f 1-8 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 HI: 0 0 0 0 0 0 TIMER: 401626149 366513734 274660062 286091775 252287943 258932438 NET_TX: 136905 10428 17269 25080 16613 17876 NET_RX: 1898437808 2857018450 580117978 26443 11820 15545 BLOCK: 716495491 805780859 113853932 132589667 106297189 104629321 BLOCK_IOPOLL: 0 0 0 0 0 0 0 TASKLET: 190643874 775861235 0 0 1 0 SCHED: 61726009 66994763 102590355 83277433 144588168 154635009 HRTIMER: 1883420 1837160 2316722 2369920 1721755 1666867 RCU: 391610041 365150626 275741153 287074106 253401636 260389306
從上面我們粗粗可以看出網絡卡的軟中斷接收和傳送都不平衡。
單單這些資訊還不夠,還是無法區別為什麼一個核心被壓垮了,因為我們的機器上還有個中斷的大戶:fusionIO PCIe卡,在過去的測試中該卡也會吃掉大量的CPU,所以目前無法判斷就是網絡卡引起的,因而我們用stap來double check下:
$cat i.stp global hard, soft, wq probe irq_handler.entry { hard[irq, dev_name]++; } probe timer.s(1) { println("==irq number:dev_name") foreach( [irq, dev_name] in hard- limit 5) { printf("%d,%s->%d\n", irq, kernel_string(dev_name), hard[irq, dev_name]); } println("==softirq cpu:h:vec:action") foreach( in soft- limit 5) { printf("%d:%x:%x:%s->%d\n", c, h, vec, symdata(action), soft); } println("==workqueue wq_thread:work_func") foreach( [wq_thread,work_func] in wq- limit 5) { printf("%x:%x->%d\n", wq_thread, work_func, wq[wq_thread, work_func]); } println("\n") delete hard delete soft delete wq } probe softirq.entry { soft[cpu(), h,vec,action]++; } probe workqueue.execute { wq[wq_thread, work_func]++ } probe begin { println("~") } $sudo stap i.stp ==irq number:dev_name 73,em1-6->7150 50,iodrive-fct0->7015 71,em1-4->6985 74,em1-7->6680 69,em1-2->6557 ==softirq cpu:h:vec:action 1:ffffffff81a23098:ffffffff81a23080:0xffffffff81411110->36627 1:ffffffff81a230b0:ffffffff81a23080:0xffffffff8106f950->2169 1:ffffffff81a230a0:ffffffff81a23080:0xffffffff81237100->1736 0:ffffffff81a230a0:ffffffff81a23080:0xffffffff81237100->1308 1:ffffffff81a23088:ffffffff81a23080:0xffffffff81079ee0->941 ==workqueue wq_thread:work_func ffff880c14268a80:ffffffffa026b390->51 ffff880c1422e0c0:ffffffffa026b390->30 ffff880c1425f580:ffffffffa026b390->25 ffff880c1422f540:ffffffffa026b390->24 ffff880c14268040:ffffffffa026b390->23 #上面軟中斷的action的符號資訊: $addr2line -e /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb411.el6.x86_64/vmlinux ffffffff81411110 /home/ads/build22_6u0_x64/workspace/kernel-el6/origin/taobao-kernel-build/kernel-2.6.32-131.21.1.el6/linux-2.6.32-131.21.1.el6.x86_64/net/core/ethtool.c:653 $addr2line -e /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb411.el6.x86_64/vmlinux ffffffff810dc3a0 /home/ads/build22_6u0_x64/workspace/kernel-el6/origin/taobao-kernel-build/kernel-2.6.32-131.21.1.el6/linux-2.6.32-131.21.1.el6.x86_64/kernel/relay.c:466 $addr2line -e /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb411.el6.x86_64/vmlinux ffffffff81079ee0 /home/ads/build22_6u0_x64/workspace/kernel-el6/origin/taobao-kernel-build/kernel-2.6.32-131.21.1.el6/linux-2.6.32-131.21.1.el6.x86_64/include/trace/events/timer.h:118 $addr2line -e /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb411.el6.x86_64/vmlinux ffffffff8105d120 /home/ads/build22_6u0_x64/workspace/kernel-el6/origin/taobao-kernel-build/kernel-2.6.32-131.21.1.el6/linux-2.6.32-131.21.1.el6.x86_64/kernel/sched.c:2460
這次我們可以輕鬆的定位到硬中斷基本上是平衡的,軟中斷都基本壓在了1號核心上,再根據符號查詢確認是網絡卡的問題。
好了,現在定位到了,問題解決起來就容易了:
1. 採用多佇列萬M網絡卡。
2. 用google的RPS patch來解決軟中斷平衡的問題, 把軟中斷分散到不同的核心去,參見這裡.
我們還是用窮人的方案,寫了個shell指令碼來做這個事情:
$cat em.sh #! /bin/bash for i in `seq 0 7` do echo f|sudo tee /sys/class/net/em1/queues/rx-$i/rps_cpus >/dev/null echo f|sudo tee /sys/class/net/em2/queues/rx-$i/rps_cpus >/dev/null done $sudo ./em.sh $mpstat -P ALL 1
網絡卡的軟中斷成功分到二個核心上了,不再把一個核心拖死。
小結:多觀察系統是好事。
後記:
——————————————————————————————————————
有同學留言說:
根據我們的測試,BCM5709應該是支援多佇列的
從中斷來看確實是平衡的,也就是說多佇列在工作,但是為什麼軟中斷不平衡呢,還有CPU1上還壓著什麼任務呢?繼續調查!
好幾個月過去了,今天我們線上又發現軟中斷的問題!
這次的網絡卡情況是這樣的:
Network: em1 (igb): Intel I350 Gigabit, bc:30:5b:ee:b8:60, 1Gb/s
Network: em2 (igb): Intel I350 Gigabit, bc:30:5b:ee:b8:60, 1Gb/s
Network: em3 (igb): Intel I350 Gigabit, bc:30:5b:ee:b8:62, no carrier
Network: em4 (igb): Intel I350 Gigabit, bc:30:5b:ee:b8:63, no carrier
intel多佇列網絡卡,從dmesg我們可以知道這個網絡卡有8個硬體中斷:
# dmesg|grep igb
[ 6.467025] igb 0000:01:00.0: irq 108 for MSI/MSI-X
[ 6.467031] igb 0000:01:00.0: irq 109 for MSI/MSI-X
[ 6.467037] igb 0000:01:00.0: irq 110 for MSI/MSI-X
[ 6.467043] igb 0000:01:00.0: irq 111 for MSI/MSI-X
[ 6.467050] igb 0000:01:00.0: irq 112 for MSI/MSI-X
[ 6.467056] igb 0000:01:00.0: irq 113 for MSI/MSI-X
[ 6.467062] igb 0000:01:00.0: irq 114 for MSI/MSI-X
[ 6.467068] igb 0000:01:00.0: irq 115 for MSI/MSI-X
[ 6.467074] igb 0000:01:00.0: irq 116 for MSI/MSI-X
同樣的軟中斷不平衡,在@普空和@炳天 同學的幫助下,大概知道了原來這款網絡卡硬體親緣性繫結的時候只能一箇中斷一個core來繫結。
同時普空同學給到一個指令碼set_irq_affinity.sh,我把它改了下支援從指定的core開始繫結, 指令碼如下:
# cat set_irq_affinity.sh # setting up irq affinity according to /proc/interrupts # 2008-11-25 Robert Olsson # 2009-02-19 updated by Jesse Brandeburg # # > Dave Miller: # (To get consistent naming in /proc/interrups) # I would suggest that people use something like: # char buf[IFNAMSIZ+6]; # # sprintf(buf, "%s-%s-%d", # netdev->name, # (RX_INTERRUPT ? "rx" : "tx"), # queue->index); # # Assuming a device with two RX and TX queues. # This script will assign: # # eth0-rx-0 CPU0 # eth0-rx-1 CPU1 # eth0-tx-0 CPU0 # eth0-tx-1 CPU1 # set_affinity() { if [ $VEC -ge 32 ] then MASK_FILL="" MASK_ZERO="00000000" let "IDX = $VEC / 32" for ((i=1; i<=$IDX;i++)) do MASK_FILL="${MASK_FILL},${MASK_ZERO}" done let "VEC -= 32 * $IDX" MASK_TMP=$((1<<$VEC)) MASK=`printf "%X%s" $MASK_TMP $MASK_FILL` else MASK_TMP=$((1<<(`expr $VEC + $CORE`))) MASK=`printf "%X" $MASK_TMP` fi printf "%s mask=%s for /proc/irq/%d/smp_affinity\n" $DEV $MASK $IRQ printf "%s" $MASK > /proc/irq/$IRQ/smp_affinity } if [ $# -ne 2 ] ; then echo "Description:" echo " This script attempts to bind each queue of a multi-queue NIC" echo " to the same numbered core, ie tx0|rx0 --> cpu0, tx1|rx1 --> cpu1" echo "usage:" echo " $0 core eth0 [eth1 eth2 eth3]" exit fi CORE=$1 # check for irqbalance running IRQBALANCE_ON=`ps ax | grep -v grep | grep -q irqbalance; echo $?` if [ "$IRQBALANCE_ON" == "0" ] ; then echo " WARNING: irqbalance is running and will" echo " likely override this script's affinitization." echo " Please stop the irqbalance service and/or execute" echo " 'killall irqbalance'" fi # # Set up the desired devices. # shift 1 for DEV in $* do for DIR in rx tx TxRx do MAX=`grep $DEV-$DIR /proc/interrupts | wc -l` if [ "$MAX" == "0" ] ; then MAX=`egrep -i "$DEV:.*$DIR" /proc/interrupts | wc -l` fi if [ "$MAX" == "0" ] ; then echo no $DIR vectors found on $DEV continue fi for VEC in `seq 0 1 $MAX` do IRQ=`cat /proc/interrupts | grep -i $DEV-$DIR-$VEC"$" | cut -d: -f1 | sed "s/ //g"` if [ -n "$IRQ" ]; then set_affinity else IRQ=`cat /proc/interrupts | egrep -i $DEV:v$VEC-$DIR"$" | cut -d: -f1 | sed "s/ //g"` if [ -n "$IRQ" ]; then set_affinity fi fi done done done
指令碼引數是:set_irq_affinity.sh core eth0 [eth1 eth2 eth3]
可以一次設定多個網絡卡,core的意思是從這個號開始遞增。
我們來演示下:
#./set_irq_affinity.sh 0 em1 no rx vectors found on em1 no tx vectors found on em1 em1 mask=1 for /proc/irq/109/smp_affinity em1 mask=2 for /proc/irq/110/smp_affinity em1 mask=4 for /proc/irq/111/smp_affinity em1 mask=8 for /proc/irq/112/smp_affinity em1 mask=10 for /proc/irq/113/smp_affinity em1 mask=20 for /proc/irq/114/smp_affinity em1 mask=40 for /proc/irq/115/smp_affinity em1 mask=80 for /proc/irq/116/smp_affinity #./set_irq_affinity.sh 8 em2 no rx vectors found on em2 no tx vectors found on em2 em2 mask=100 for /proc/irq/118/smp_affinity em2 mask=200 for /proc/irq/119/smp_affinity em2 mask=400 for /proc/irq/120/smp_affinity em2 mask=800 for /proc/irq/121/smp_affinity em2 mask=1000 for /proc/irq/122/smp_affinity em2 mask=2000 for /proc/irq/123/smp_affinity em2 mask=4000 for /proc/irq/124/smp_affinity em2 mask=8000 for /proc/irq/125/smp_affinity
同時炳天同學提醒說由於有硬體中斷平衡,所以我們也不需要rps了,用以下指令碼關閉掉:
# cat em.sh #! /bin/bash for i in `seq 0 7` do echo 0|sudo tee /sys/class/net/em1/queues/rx-$i/rps_cpus >/dev/null echo 0|sudo tee /sys/class/net/em2/queues/rx-$i/rps_cpus >/dev/null done #./em.sh
最後一定要記得關掉無用的關閉irqbalance, 除了搗亂沒啥用途:
# service irqbalance stop
——————————————————————————————————————
補充:
在微博上大家討論了很激烈,我總結下提高的點:
1. @孺風 同學說的網絡卡繫結的時候最好和一個物理CPU的核挨個繫結,這樣避免L1,L2,L3踐踏。 那麼如何知道那個核心對應哪個CPU呢? lscpu能幫忙:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2134.000
BogoMIPS: 4266.61
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 4-7,12-15
NUMA node1 CPU(s): 0-3,8-11
我們知道 4-7,12-15屬於第一個CPU, 0-3,8-11屬於第二個CPU。
2. 網絡卡為什麼不能自動中斷在多核心上,這個問題我們的@炳天同學在跟進調查,以期徹底解決問題。
3. RPS是不是有用? 網絡卡支援多佇列的時候,RPS其實沒有啥用途。RPS是窮人的解決方案。
祝玩得開心!
Post Footer automatically generated by wp-posturl plugin for wordpress.