1. 程式人生 > >Linux Kernel BUG:soft lockup CPU問題(二)

Linux Kernel BUG:soft lockup CPU問題(二)

產品大使用者場景下執行一段時間即報“Aug  7 19:19:58 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [xxxx:80779]" 錯誤,核心日誌呼叫棧顯示是在獲取鎖時失敗,該鎖是讀寫鎖,使用read_lock或者write_lock獲取鎖,鎖使用情況如下所示,如圖所示鎖用在桶深256的HASH表下掛連結串列插入刪除時的競爭保護。剛開始按照使用者態經驗加鎖時鎖不可用不是被掛起睡眠嗎?等待鎖可用再被喚醒,怎麼會一直佔有CPU20多秒,通過分析核心讀寫鎖函式發現核心讀寫鎖類似spin_lock死等,所以應該是該鎖被其他人長時間佔用,到了呼叫棧所示的呼叫時一直無法獲取鎖導致的,死等20多秒超時導致該CPU的watchdog發現(watchdog是IRQ所以不會因為死等鎖而不能排程),同時可以確定使用者態死鎖不會導致soft lockup。

結合產品核心程式碼進而分析可能原因: 1) 多個核心鎖的PV操作順序不當導致死鎖; 2) HASH鏈上的節點太多,導致獲取鎖的任務處理時間超過20s;3)某任務獲取鎖之後睡眠(同spin_lock後不能睡眠)4) 其他原因死鎖

    分析程式碼排除1,增加除錯手段檢視所有HASH鏈的長度排除2,分析程式碼發現操作HASH表時,獲取鎖後有kmalloc操作,但是引數2是GFP_ATOMIC不會睡眠(引數GFP_KERNEL可能導致睡眠),陷入困境。想到尋求核心自帶debug特性輔助,發現lockdep。lockdep需要修改核心預設配置重編譯核心,如何配置使能lockdep見點選開啟連結

。配置重編核心後執行直到出現soft lockup,dmesg當看到Interrupt的時候猛然明白了,獲取鎖任務被中斷打斷而中斷中又需要獲取鎖,導致死鎖。所以讀寫鎖介面read_lock write_lock應該修改為關閉中斷的read_lockirqsave  write_lockirqsave。

dmesg 輸出:

[ 4773.122536] =================================
[ 4773.122539] [ INFO: inconsistent lock state ]
[ 4773.122542] 3.10.101 #9 Tainted: GF          O
[ 4773.122544] ---------------------------------
[ 4773.122547] inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage.
[ 4773.122551] xxx/17184 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 4773.122554]  (
[ 4773.122566] {IN-SOFTIRQ-R} state was registered at:
[ 4773.122568]   [<ffffffff810bf9bb>] __lock_acquire+0x5ab/0xc30
[ 4773.122576]   [<ffffffff810c00f0>] lock_acquire+0xb0/0x160
[ 4773.122580]   [<ffffffff815cf7d9>] _raw_read_lock+0x39/0x70
[ 4773.122586]   [<ffffffffa08a891a>] xxxx
[ 4773.122590]   [<ffffffffa08ab2c0>] xxxx
[ 4773.122595]   [<ffffffff81519f09>] udp_queue_rcv_skb+0x1b9/0x570
[ 4773.122600]   [<ffffffff8151a911>] __udp4_lib_rcv+0x181/0xa20
[ 4773.122604]   [<ffffffff8151b1ca>] udp_rcv+0x1a/0x20
[ 4773.122609]   [<ffffffff814e8512>] ip_local_deliver_finish+0xe2/0x300
[ 4773.122615]   [<ffffffff814e8ef7>] ip_local_deliver+0x47/0x80
[ 4773.122619]   [<ffffffff814e8890>] ip_rcv_finish+0x160/0x630
[ 4773.122624]   [<ffffffff814e9159>] ip_rcv+0x229/0x370
[ 4773.122628]   [<ffffffff814af0a6>] __netif_receive_skb_core+0x6c6/0x8a0
[ 4773.122634]   [<ffffffff814af298>] __netif_receive_skb+0x18/0x60
[ 4773.122639]   [<ffffffff814af363>] netif_receive_skb+0x83/0xf0
[ 4773.122644]   [<ffffffff814afe80>] napi_gro_receive+0x80/0xb0
[ 4773.122649]   [<ffffffffa004e0b4>] rtl8169_poll+0x174/0x68f [r8169]
[ 4773.122668]   [<ffffffff814af76a>] net_rx_action+0x15a/0x2a0
[ 4773.122673]   [<ffffffff81063e57>] __do_softirq+0x107/0x2a0
[ 4773.122678]   [<ffffffff815d9e3c>] call_softirq+0x1c/0x30
[ 4773.122683]   [<ffffffff810156ed>] do_softirq+0x8d/0xc0
[ 4773.122688]   [<ffffffff810641cd>] irq_exit+0xdd/0xf0
[ 4773.122692]   [<ffffffff815da766>] do_IRQ+0x56/0xc0
[ 4773.122697]   [<ffffffff815d01f2>] ret_from_intr+0x0/0x1a
[ 4773.122701]   [<ffffffff8109378b>] try_to_wake_up+0x1cb/0x2e0
[ 4773.122707]   [<ffffffff810938b5>] wake_up_process+0x15/0x20
[ 4773.122711]   [<ffffffff81237c91>] do_msgsnd+0x401/0x430
[ 4773.122717]   [<ffffffff81237d0d>] SyS_msgsnd+0x4d/0x60
[ 4773.122721]   [<ffffffff815d8ba0>] tracesys+0xdd/0xe2
[ 4773.122726] irq event stamp: 12549
[ 4773.122728] hardirqs last  enabled at (12549): [<ffffffff81142105>] user_exit+0x55/0xb0
[ 4773.122732] hardirqs last disabled at (12548): [<ffffffff811420eb>] user_exit+0x3b/0xb0
[ 4773.122736] softirqs last  enabled at (12468): [<ffffffff8149b463>] release_sock+0x173/0x1d0
[ 4773.122741] softirqs last disabled at (12464): [<ffffffff815cf478>] _raw_spin_lock_bh+0x18/0x70
[ 4773.122746] 
other info that might help us debug this:
[ 4773.122748]  Possible unsafe locking scenario:
[ 4773.122751]        CPU7
[ 4773.122752]        ----
[ 4773.122754]   lock(&xxx);
[ 4773.122757]   <Interrupt>
[ 4773.122759]     lock(&xxx);
[ 4773.122762]  *** DEADLOCK ***