Troubleshooting high “gc current grant 2-way” and “gc cr grant 2-way” 案例(二)

Oracle資料庫 · 發表 2019-01-31 00:39:09

摘要：很久以前遇到過一個案例，也是一套RAC環境出現了非常高的GC cr/current grant 2-way 等待事件，《Performance tuning ‘gc cr&current grant 2-way’ event (當主機擴容cpu後)》當是這個案例是因為同一clus...

很久以前遇到過一個案例，也是一套RAC環境出現了非常高的GC cr/current grant 2-way 等待事件，《Performance tuning ‘gc cr&current grant 2-way’ event (當主機擴容cpu後)》當是這個案例是因為同一cluster的兩個例項主機不同的CPU個數導致出現了不同的LMS程序數，負載能力不均衡導致，這次要分享的案例是另個一種特殊情況。

來自SMARTDBOPS

持續較高的GC類等待，

NODE1

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
enq: TX – index contention	27,057	32,531	1202	25.09	Concurrency
DB CPU		26,320		20.30
enq: TX – row lock contention	165,277	9,237	56	7.12	Application
gc current grant 2-way	61,769	9,208	149	7.10	Cluster
log file sync	1,937,689	6,747	3	5.20	Commit

NODE2

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
gc buffer busy acquire	84,112	31,365	373	35.95	Cluster
gc current grant 2-way	24,906	11,217	450	12.86	Cluster
gc current block 2-way	71,984	8,190	114	9.39	Cluster
gc cr grant 2-way	34,748	6,061	174	6.95	Cluster
enq: TX – index contention	3,067	4,735	1544	5.43	Concurrency

如果從節點1看會認為是併發爭用enq index/row lock contention；還有gc cr/current grant 2-way, 該event表示Block is not in any cache，

Permission granted to read from the disk.

在診斷RAC問題時有幾個黃金規則：

1. 謹防被TOP Event所欺騙

在RAC環境中有可能根本問題在另一個節點中隱藏，基於不會顯示在top 5 event中, 被影響的等待事件可能比根本原因等待事件本身還要高，例項的效能錯綜複雜，很可能一個節點的後臺程序出現問題（如LMS），而影響了其它所有節點出現Application class等待事件。

gc buffer busy xx 等待事件是RAC效能分析根本問題之前的主要證狀，簡單認為在等待一個BLOCK的BL lock請求, 其它所有程序在等待這個BLOCK的BL lock可用其間的時間消耗都會記入gc buffer busy xx 等待事件中.

ASH和dump block trace 檔案可以定位GC BUFFER BUSY等待，主要是識別物件，物件型別，塊型別資訊

Object type	Block type	Possible issue(s)
Table	Segment header	Freelists, freelist groups, deletes from one node,ASSM bugs etc
Table	Segment blocks	Heavy insert workload, heavy update to few blocks, SQL performance issues scanning few objects aggressively.
Index	Leaf block	Numerous inserts on indexed columns, sequence or sysdate generated keys
Index	Root / branch block	Insert + numerous index scans, concurrent inserts and deletes etc
Undo	Undo header block	Numerous short transactions
Undo	Undo block	Long pending transaction and CR block generation
Dictionary	SEQ$ (object)	Sequences with nocache or order set and excessive access to sequence.

2. 消除基礎架構環境問題

Gc cr grants 2-way 如果BLOCK沒有在任何例項的Buffer cache中，時間會記入這個等待事件中，前臺程序傳送一個請求給遠端LMS, 遠端的LMS回覆‘read from disk’. 出現gc 2-way 3-way這類意為著高併發出現的擁擠產生的等待，一個block和message傳輸的最小時間通常<2ms, (Exadata<0.5ms). 需要檢視 AWR中的wait event histogram 檢視2/3- way事件不同時間段的佔比。排除網路問題

3. 識別問題例項

SELECT INSTANCE ||'->’ || inst_id transfer,
class, cr_block cr_blk,
Trunc(cr_block_time / cr_block / 1000, 2) avg_Cr,
current_block cur_blk,
Trunc(current_block_time / current_block / 1000, 2) avg_cur
FROM gv$instance_cache_transfer
WHERE cr_block > 0 AND current_block > 0
ORDER BY INSTANCE, inst_id, class
/

4. 檢查傳送端的效能指標，不只是接收端

傳送端

gc cr block receive time=

Time to send message to a remote LMS process by FG

+ Time taken by LMS to build block (statistics: gc cr block build time)

+ LMS wait for LGWR latency ( statistics: gc cr block flush time)

+ LMS send time (Statistics: gc cr block send time)

+ Wire latency.

5. 用柱狀圖去識別異常情況，而不只是平均值

AWR中的wait event histogram

案例分析

接著分析一開始AAS 上顯示GC wait問題案例，這是一個11.2.0.3 2-NODES RAC ON HPUX 11.31的資料庫，

# instance 1

Global Cache Load Profile

	Per Second	Per Transaction
Global Cache blocks received:	1,018.74	1.08
Global Cache blocks served:	78.53	0.08
GCS/GES messages received:	904.33	0.96
GCS/GES messages sent:	2,532.56	2.68
DBWR Fusion writes:	3.07	0.00
Estd Interconnect traffic (KB)	18,227.67

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:	99.48
Buffer access – remote cache %:	0.39
Buffer access – disk %:	0.14

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	2.9
Avg global cache cr block receive time (ms):	6.8
Avg global cache current block receive time (ms):	3.8
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	0.5
Avg global cache cr block flush time (ms):	29.3
Avg global cache current block pin time (ms):	1.4
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	2.1
Avg global cache current block flush time (ms):	30.8

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	250.4 –issue instance
Avg message sent queue time on ksxp (ms):	6.8
Avg message received queue time (ms):	3.0
Avg GCS message process time (ms):	1.6
Avg GES message process time (ms):	0.0
% of direct sent messages:	43.52
% of indirect sent messages:	45.21
% of flow controlled messages:	11.27

TIPS:

AWR中的 Average message queue time 資料來自 (G)V$GES_STATISTICS. 該值是把indirect message到message佇列和傳送它所需的時間(GCS/GES communication layer). 我們需要檢視KSXP訊息統計時間，以確定IPC層中是否存在問題。 IPC時間通常<1ms，您可以查詢X$KSXPCLIENT以獲取更多資料，該值很高時也有可能在這段時間內CPU非常繁忙（100％CPU）或者 interconnect達到頂峰, 這點需要結合OSW確認。

# instance 2

Global Cache Load Profile

	Per Second	Per Transaction
Global Cache blocks received:	96.24	0.75
Global Cache blocks served:	1,043.19	8.17
GCS/GES messages received:	3,607.20	28.26
GCS/GES messages sent:	1,342.88	10.52
DBWR Fusion writes:	3.09	0.02
Estd Interconnect traffic (KB)	19,197.66

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:	99.43
Buffer access – remote cache %:	0.20
Buffer access – disk %:	0.37

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	3.5
Avg global cache cr block receive time (ms):	509.7
Avg global cache current block receive time (ms):	96.6
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	1.0
Avg global cache cr block flush time (ms):	1.3
Avg global cache current block pin time (ms):	0.2
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	0.1
Avg global cache current block flush time (ms):	1.7

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	0.0
Avg message sent queue time on ksxp (ms):	142.6
Avg message received queue time (ms):	0.0
Avg GCS message process time (ms):	0.0
Avg GES message process time (ms):	0.0
% of direct sent messages:	73.94
% of indirect sent messages:	18.43
% of flow controlled messages:	7.63

Tips:

Avg global cache cr block receive time (ms) 從請求例項傳送訊息到mastering instance花費的時間。這個時間包括在holding instance生成資料塊一致性讀映像的時間。CR資料塊獲取耗費的時間不應該大於15ms。

Avg global cache current block receive time (ms): 從請求例項傳送訊息到mastering instance（2-way get）和一些到holding instance (3-way get)花費的時間。這個時間包括holding instance日誌重新整理花費的時間。Current Block獲取耗費的時間不大於30ms.

Avg message sent queue time on ksxp (ms)對端IPC層確認收到的訊息並返回ACK所花費的時間，通常是個很小的毫秒值，直接反應了網路延遲，一般小於1ms, The send time is measured when a block is handed to the IPC layer to be sent. The requests are queued in a completion queue, and the send time is measured from when a block is handed off to the IPC layer until the system call returns a completion, i.e. a status indicating that the send request has been successfully passed on to the network and control returns to the cache layer.

如果該值比較大，可能是CLUSTER INTERCONNECT和LMS程序出現了問題。

做ipc診斷：

oradebug setmypid

oradebug ipc

1）如果網路配置沒有問題，可以檢查一下TCP 和UDP的buffer是否不足

root#netstat -s | grep overflow 值為了0一般情況下。

root#netstat -i

2) 檢視負載ps aux | grep lms

檢視是否是有某些表產生了過多的global cr request

selectSTATISTIC_NAME stat,
OWNER,
OBJECT_NAME obj,
sum(value) val
from v$segment_statistics
where STATISTIC_NAME like 'global%'
and value > 10000
group bySTATISTIC_NAME,OWNER,OBJECT_NAME
order by val desc;

2、檢視一段時間內，CR的值是否很大。

select CR_REQUESTS cr,
CURRENT_REQUESTS cur,
DATA_REQUESTS data,
UNDO_REQUESTS undo,
TX_REQUESTS tx
from v$cr_block_server;

確認問題節點

WAIT time	NODE1	NODE2
Avg global cache cr block receive time (ms)	6.8	509.7
Avg global cache current block receive time (ms)	3.8	96.6

Node:

RAC環境中，慢的節點會導致其他節點出現gc效能問題，根據這個邏輯，確認是節點1 出了問題。

#檢視當時的LMS程序CPU使用

# node1
oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms
F SUIDPIDPPIDC PRI NIADDRSZWCHANSTIME TTYTIME COMD
1401 Soracle51171 237 178 20 e000001c934ba340 94816 e000000792a7f050Jul 13?15776:29 ora_lms1_tyjc1
1401 Sgrid 2381010 178 20 e000001c9de0e040 90895 e000000792a7f1d0Jan 25?1550:30 asm_lms0_+ASM1
1401 Roracle51191 254 178 20 e000001ce7c1b9c0 94816-Jul 13?16214:17 ora_lms2_tyjc1
401 Soracle936677300 154 20 e000001cdb304c4030 e000000da0da6472 10:50:38 pts/40:00 grep lms
1401 Roracle51151 252 178 20 e000001cf7042a00 94752-Jul 13?16329:36 ora_lms0_tyjc1

# node 2
oracle@qdty2:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep
F SUIDPIDPPIDC PRI NIADDRSZWCHANSTIME TTYTIME COMD
1401 Sgrid 1251610 178 20 e000001b43146400 90854 e000000792d71348Jan 25?1461:04 asm_lms0_+ASM2
1401 Soracle64461 24 178 20 e000001ef09b0d00 95666 e000000792d710d0Jul 13?16447:18 ora_lms2_tyjc2
1401 Soracle64441 30 178 20 e000001ebe6c7d00 95666 e000000792d71248Jul 13?17583:22 ora_lms1_tyjc2
1401 Soracle64421 43 178 20 e000001e7bc97c80 95666 e000000792d71240Jul 13?17764:57 ora_lms0_tyjc2

Note:

當前為HPUX 11.31平臺，節點1所以PS顯示的C[PU] 列會超過100%現象，但也表明了當前的CPU是非常繁忙。節點2是正常的。

# node1

Instance Activity Stats

Statistic	Total	per Second	per Trans
data blocks consistent reads – undo records applied	3.7 TB	1.1 GB	1.1 MB
rollback changes – undo records applied	4,074,774	1,124.49	1.19
transaction tables consistent reads – undo records applied	100.1 MB	28.3 KB	0 B

# node 2

Instance Activity Stats

Statistic	Total	per Second	per Trans
data blocks consistent reads – undo records applied	4.6 GB	2.7 MB	21.6 KB
rollback changes – undo records applied	1,043	0.59	0.00
transaction tables consistent reads – undo records applied	112.0 KB	0 B	0 B

如果不是應用SQL和網路問題，那為什麼LMS會佔用那麼高的cpu呢？去MOS中搜，不難發現，又命中了一個bug 14672057 .

原因是因為從11.2.0.2版本開始為了解決之前的bug 9569316,而引入了_max_cr_rollbacks引數，該引數預設為0。表示最大cr block rolback blocks.沒有找到過多的關於該引數的解釋。如果當出現花費大量的時間在undo cr block生成時，把該引數調整為非0值，會啟用light work 規則，有助於減少lms程序壓力。該bug隻影響11.2.0.2和11.2.0.3, 解決解決方法可以動態的修改 “_max_cr_rollbacks”引數解決，所以沒有再提供one-off patch, 該bug 在11.2.0.4版本已修復，當執行完如下修改後，GC 等待和enq 瞬間消失，恢復了正常狀態。所以問題不是節點1 top event中表現的index split問題。

SQL> alter system set "_max_cr_rollbacks"=200;
System altered.

# 修改後LMS 程序 node1

oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep
F SUIDPIDPPIDC PRI NIADDRSZWCHANSTIME TTYTIME COMD
1401 Soracle51171 55 178 20 e000001c934ba340 94816 e000000792a7f0d0Jul 13?15778:04 ora_lms1_tyjc1
1401 Sgrid 2381010 178 20 e000001c9de0e040 90895 e000000792a7f350Jan 25?1550:30 asm_lms0_+ASM1
1401 Soracle51191 76 178 20 e000001ce7c1b9c0 94816 e000000792a7f1c8Jul 13?16216:40 ora_lms2_tyjc1
1401 Soracle51151 80 178 20 e000001cf7042a00 94752 e000000792a7f3c0Jul 13?16331:29 ora_lms0_tyjc1

Troubleshooting high “gc current grant 2-way” and “gc cr grant 2-way” 案例(二)

2. 消除基礎架構環境問題

3. 識別問題例項

5. 用柱狀圖去識別異常情況，而不只是平均值

案例分析

確認問題節點

Instance Activity Stats

Instance Activity Stats

您可能也會喜歡…