1. 程式人生 > >被sleep開了個小玩笑

被sleep開了個小玩笑

update mysq 比較 roc 基於 流程 stat 問題排查 oom

Ⅰ、問題背景

探活腳本連續8此探測,判斷主庫異常,觸發切換(判斷備機是否有延遲,kill原主,VIP飄到備機,設置新主可寫)
切換後,業務還是異常,SQL查詢沒返回,DB連接數耗完,為了恢復業務,重啟新主後業務恢復

兩個問題

  • 主庫為什麽會切換
  • 切換後新主庫為什麽還是不可用

Ⅱ、問題排查

1、根據系統運行時間、系統日誌以及服務器帶外日誌可排除服務器和數據句OOM
2、從每分鐘的縣城快照中發現故障時大量線程處於sending data和statistics狀態,但前一分鐘快照中未看到任何阻塞與鎖等待
3、被阻塞的sql都是基於主鍵或者業務索引訪問,理論上沒問題,提取sql在從庫執行一遍,很快,且當時沒有產生慢日誌,固排除sql執行效率慢導致阻塞
4、statistics是為sql生成執行計劃的,會觸發表的統計操作,而統計操作需要對表中page進行采樣,會觸發io,分析當時磁盤iops、吞吐量、cpu負載等,和前一天基本吻合,排除系統負載導致性能下降
-----------------
思路中斷~~~

復盤,開發反映業務切到新db,業務各個接口耗時變大,詢問是否新庫服務器性能不如老庫
會後上機器發現實例上存在20個處於user sleep狀態的sql,大概模型都是where id = ‘+(select ‘rbzd‘ where 6910=6910 and sleep(300)+)‘,比較可疑,因為開發不會在程序中調用sleep函數

重點分析此sql
此sql不占用系統資源,但是寫法可以,類似sql註入
經查,sleep操作和innodb_thread_concurrency參數互斥,這樣每秒只能處理24個sql
換言之當有24個線程進入引擎並處於sleep狀態的話,其他線程是不能進入innodb引擎層,這裏的24是和線上MySQL參數innodb_thread_concurrency被設置為24有關

Ⅱ、模擬故障

2.1 準備數據

CREATE TABLE `test` (
`id` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

insert into test select 1;

2.2 開24個線程訪問此表

提前設置innodb_thread_concurrency設置為24模擬線上情況

[root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘" & done

2.3 觀察一波

執行上一步腳本後,再開一個session訪問test表會發現線程hang住,沒返回,狀態為sending data,如下:

(root@localhost) [test]> show processlist;
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
| Id | User | Host            | db   | Command | Time | State        | Info                                                                                   |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
|  2 | root | localhost:38204 | NULL | Sleep   |    0 |              | NULL                                                                                   |
|  3 | root | localhost       | test | Query   |    0 | starting     | show processlist                                                                       |
|  4 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
|  5 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
|  6 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
|  7 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
|  8 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
|  9 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 10 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 11 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 12 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 13 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 14 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 15 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 16 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 17 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 18 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 19 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 20 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 21 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 22 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 23 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 24 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 25 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 26 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 27 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=‘‘+(select ‘rbzd‘ where 6910=6910 and sleep(300))+‘‘ |
| 28 | root | localhost       | NULL | Query   |   11 | Sending data | select id from test.test                                                               |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
27 rows in set (0.00 sec)

Ⅲ、故障回顧

1、sql註入,user sleep狀態的sql累計到24條之後,其他sql就不能進入innodb進行操作,包括高可用探測程序,由於線程快照中過濾了sleep,導致沒抓到註入的sql,加大了後續排查難度

2、主庫的存活探測程序檢查失敗(探測方式為update一個innodb表),連續8次失敗後,ha認為實例異常,則kill主,觸發切換流程

3、切換到主庫後,註入還在繼續,所以同樣的故障在新主上重演

4、重啟主庫後,僅有20個註入進入innodb且一直為user sleep,由於沒達到24個觸發閾值,所以業務表現正常,只是性能不及老主庫,原因就是已經有20個線程在innodb層一直沒退出,kill掉這些線程,業務恢復正常

Ⅳ、問題解決

  • 由於線程快照過濾了sleep關鍵字,導致排查過程較長,後續要對user sleep狀態線程保留在線程快照中
  • 開發優化程序,防止sql註入

被sleep開了個小玩笑