1. 程式人生 > >一個磁碟I/O故障導致的AlwaysOn FailOver 過程梳理和分析

一個磁碟I/O故障導致的AlwaysOn FailOver 過程梳理和分析

下面是我們在使用AlwaysOn過程中遇到的一個切換案例。這個案例發生在2014年8月,雖然時間相對久遠了,但是對我們學習理解AlwaysOn的FailOver原理和過程還是很有幫助的。本次FailOver的觸發原因是系統I/O問題。大家需要理解,作業系統I/O出現了問題不一定立即觸發SQL Server發生漂移,因為壞的槽點可能不在SQL Server例項所用到的位置,但是隨著時間持續 和資料堆積,問題槽點可能擴大升級。我們可以看到在本例中,第一次出現I/O問題到SQL Server 漂移間隔了16分鐘,所以大家不要奇怪。我們重點可以FailOver的過程和觸發條件設定上,即文章的第二和第三部分。

一 . 系統 I/O 異常 Log追蹤 

1.1  10:36:12 發現I/O異常

1.2   10:45:43    顯示個別讀寫花費時間較長

1.3  10:45:28 看似I/O嚴重

1.4   10:52:20 出現個別連線Fail現象

(查看錶中的最後一筆資料顯示為10:53:17

 . AlwaysOn FailOver 過程

2.1 系統提示需要FailOver

2.2 高可用性組的本地副本需要離線。

相關知識:Lease expired event from the cluster. Possible causes include loss of lease, possible network issues and sp_server_diagnostic query timeout. )

2.3 錯誤提示資訊顯示,SQL Instance和WSFC連線異常。

2.4 可用性副本的角色發生變換。

2.5 角色為RESOLVING無法訪問DB

相關知識:When the role of an availability replica is indeterminate, such as during a failover, its databases are temporarily in a NOT SYNCHRONIZING state. Their role is set to RESOLVING until the role of the availability replica has resolved.)

此時: 通過SSMS管理器,連線資料也是不可以訪問的,顯示狀態為不同步了。

三 . 相關知識點

3.1  什麼是resourceDell?resourceDell的用途?

由於AlwaysOn可用性組是建立在Windows故障轉移群集之上的,Alwayson可用性組需要一個群集resourceDell來連線Windows群集和SQLServer例項。由於可用性組是一個群集資源,Windows群集需要透過AlwaysOn的resourceDell來控制資源的上線/離線,檢查資源是否失敗,更改資源的狀態和屬性,以及發生各種命令給可用性副本例項。(AlwaysOn可用性組的資源型別是“SQLServer Availability Group”)

AlwaysOn通過sp_server_diagnostics來檢查可用性組的健康狀況,不斷地獲得診斷資訊。sp_server_diagnostics的評估結果會被用來和AlwaysOn可用性組的FailureConditionLevel設定相比較,來約定是否符合發生故障轉移的條件。一旦條件滿足,則可用性組就被切換到新的可用性副本上。

3.2  HealthCheckTimeout

The HealthCheckTimeout setting is used to specify the length of time, in milliseconds, that the SQL Server resource DLL should wait for information returned by the sp_server_diagnostics stored procedure before reporting the AlwaysOn Failover Cluster Instance (FCI) as unresponsive. Changes that are made to the timeout settings are effective immediately and do not require a restart of the SQL Server resource.

The resource DLL determines the responsiveness of the SQL instance using a health check timeout. The HealthCheckTimeout property defines how long the resource DLL should wait for the sp_server_diagnostics stored procedure before it reports the SQL instance as unresponsive to the WSFC service.

The following items describe how this property affects timeout and repeat interval settings:

  • The resource DLL calls the sp_server_diagnostics stored procedure and sets the repeat interval to one-third of the HealthCheckTimeout setting.
  • If the sp_server_diagnostics stored procedure is slow or is not returning information, the resource DLL will wait for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.
  • If the dedicated connection is lost, the resource DLL will retry the connection to the SQL instance for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.

3.3  FailureConditionLevel

The SQL Server Database Engine resource DLL determines whether the detected health status is a condition for failure using the FailureConditionLevel property. The FailureConditionLevel property defines which detected health statuses cause restarts or failovers.

Review sp_server_diagnostics (Transact-SQL) as this system stored procedure plays in important role in the failure condition levels.

Level

Condition

Description

0

No automatic failover or restart

  • Indicates that no failover or restart will be triggered automatically on any failure conditions. This level is for system maintenance purposes only.

1

Failover or restart on server down

Indicates that a server restart or failover will be triggered if the following condition is raised:

SQL Server service is down.

2

Failover or restart on server unresponsive

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

3

Failover or restart on critical server errors

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns ‘system error’.

4

Failover or restart on moderate server errors

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns ‘system error’.

System stored procedure sp_server_diagnostics returns ‘resource error’.

5

Failover or restart on any qualified failure conditions

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns ‘system error’.

System stored procedure sp_server_diagnostics returns ‘resource error’.

System stored procedure sp_server_diagnostics returns ‘query_processing error’.

3.4  通過SQL更改相關配置。

  The following example sets the HealthCheckTimeout option to 15,000 milliseconds (15 seconds).

ALTER SERVER CONFIGURATION 
SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 15000;

  The following example sets the FailureConditionLevel property to 0, indicating that failover or restart will not be triggered automatically on any failure conditions.

ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY FailureConditionLevel = 0;

四 . 結語

  可用性副本的FailOver不僅僅取決於Availability Mode 和FailOver Mode,還要受限於FailureConditionLevel。