被sleep開了個小玩笑
Ⅰ、問題背景
探活腳本連續8此探測,判斷主庫異常,觸發切換(判斷備機是否有延遲,kill原主,VIP飄到備機,設置新主可寫)
切換後,業務還是異常,SQL查詢沒返回,DB連接數耗完,為了恢復業務,重啟新主後業務恢復
兩個問題
- 主庫為什麽會切換
- 切換後新主庫為什麽還是不可用
Ⅱ、問題排查
1、根據系統運行時間、系統日誌以及服務器帶外日誌可排除服務器和數據句OOM
2、從每分鐘的縣城快照中發現故障時大量線程處於sending data和statistics狀態,但前一分鐘快照中未看到任何阻塞與鎖等待
3、被阻塞的sql都是基於主鍵或者業務索引訪問,理論上沒問題,提取sql在從庫執行一遍,很快,且當時沒有產生慢日誌,固排除sql執行效率慢導致阻塞
4、statistics是為sql生成執行計劃的,會觸發表的統計操作,而統計操作需要對表中page進行采樣,會觸發io,分析當時磁盤iops、吞吐量、cpu負載等,和前一天基本吻合,排除系統負載導致性能下降
-----------------
思路中斷~~~
復盤,開發反映業務切到新db,業務各個接口耗時變大,詢問是否新庫服務器性能不如老庫
會後上機器發現實例上存在20個處於user sleep狀態的sql,大概模型都是where id = ‘+(select ‘rbzd‘ where 6910=6910 and sleep(300)+)‘,比較可疑,因為開發不會在程序中調用sleep函數
重點分析此sql
此sql不占用系統資源,但是寫法可以,類似sql註入
經查,sleep操作和innodb_thread_concurrency參數互斥,這樣每秒只能處理24個sql
換言之當有24個線程進入引擎並處於sleep狀態的話,其他線程是不能進入innodb引擎層,這裏的24是和線上MySQL參數innodb_thread_concurrency被設置為24有關
Ⅱ、模擬故障
2.1 準備數據
CREATE TABLE `test` (
`id` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into test select 1;
2.2 開24個線程訪問此表
提前設置innodb_thread_concurrency設置為24模擬線上情況
[root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘" & done
2.3 觀察一波
執行上一步腳本後,再開一個session訪問test表會發現線程hang住,沒返回,狀態為sending data,如下:
(root@localhost) [test]> show processlist;
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
| 2 | root | localhost:38204 | NULL | Sleep | 0 | | NULL |
| 3 | root | localhost | test | Query | 0 | starting | show processlist |
| 4 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 5 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 6 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 7 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 8 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 9 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 10 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 11 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 12 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 13 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 14 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 15 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 16 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 17 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 18 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 19 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 20 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 21 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 22 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 23 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 24 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 25 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 26 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 27 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 28 | root | localhost | NULL | Query | 11 | Sending data | select id from test.test |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
27 rows in set (0.00 sec)
Ⅲ、故障回顧
1、sql註入,user sleep狀態的sql累計到24條之後,其他sql就不能進入innodb進行操作,包括高可用探測程序,由於線程快照中過濾了sleep,導致沒抓到註入的sql,加大了後續排查難度
2、主庫的存活探測程序檢查失敗(探測方式為update一個innodb表),連續8次失敗後,ha認為實例異常,則kill主,觸發切換流程
3、切換到主庫後,註入還在繼續,所以同樣的故障在新主上重演
4、重啟主庫後,僅有20個註入進入innodb且一直為user sleep,由於沒達到24個觸發閾值,所以業務表現正常,只是性能不及老主庫,原因就是已經有20個線程在innodb層一直沒退出,kill掉這些線程,業務恢復正常
Ⅳ、問題解決
- 由於線程快照過濾了sleep關鍵字,導致排查過程較長,後續要對user sleep狀態線程保留在線程快照中
- 開發優化程序,防止sql註入
被sleep開了個小玩笑