記一次ora.asm服務無法啟動和GPnP Get Item的問題

資料庫 · 發表 2018-10-27 00:25:40

摘要：版權宣告：本文為Buddy Yuan原創文章，未經允許不得轉載。原文地址：記一次ora.asm服務無法啟動和GPnP Get Item的問題昨天晚上去打DB PSU，這是一套11.2...

版權宣告：本文為Buddy Yuan原創文章，未經允許不得轉載。原文地址：ofollow,noindex" target="_blank">記一次ora.asm服務無法啟動和GPnP Get Item的問題

昨天晚上去打DB PSU，這是一套11.2.0.3 RAC。很快我就把資料庫和叢集全部停止，然後把補丁安裝完了。我心想還是蠻順利的啊，結果在重啟叢集的時候，突然發現叢集沒辦法起來。通過crsctl check crs可以發現無法啟動crsd的服務，而我去檢視crsd.log的時候發現根本沒有任何記錄。一直是空的。然後我通過ocrcheck命令檢查會一直hang住，執行crsctl query css votedisk確是好的，很快都能查到。然後我又執行了kfed去讀取asm磁碟，發現也是能讀的。通過進一步檢查，發現是ora.asm服務啟動不成功。進一步登陸到asm例項上檢視asm的alert日誌，發現能讀取spfile檔案，並且例項是能正常啟動的。我通過檢視asm的日誌發現，磁碟組並沒有走到最後一步mount成功的地方。如下所示：

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ 
NOTE: Diskgroups listed in ASM_DISKGROUPS are
  DG_FRA
  DG_DATA
NOTE: Diskgroup used for Voting files is:
  DG_OCR
Diskgroup with spfile:DG_OCR
Diskgroup used for OCR is:DG_OCR
NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4)
NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3)
NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2)
NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1)
NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra)
NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr)
GMON querying group 1 at 4 for pid 23, osid 128640
NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2
NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3
NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4
NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)

可以看到日誌前面都是正常能夠讀盤的，中間做了一些其他的動作，在最後面出現下列錯誤：

NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) 
NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache deleting context for group DG_OCR 3/0x50fc3258
GMON dismounting group 3 at 9 for pid 23, osid 128640
NOTE: Diskin mode 0x8 marked for de-assignment
ERROR: diskgroup DG_OCR was not mounted

接下來我通過asm例項進入到資料庫查了一下，在正常例項的磁碟組是mounted狀態，而異常的資料庫例項裡面磁碟組的狀態是mounting狀態。也就是正在mount的狀態。這裡的一個疑點就是怎麼mount不上。

為了把這個疑點弄清楚，我做了一個全域性的hang分析。這裡必須要做全域性的才能知道磁碟組為什麼無法mount。

*** 2018-10-26 04:36:34.280
===============================================================================
HANG ANALYSIS:
instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1
oradebug_node_dump_level: 3
analysis initiated by oradebug
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 04:36:34 ]
NOTE: scheduling delay has not been sampled for 0.184529 secs0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg
0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg
0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle'
Chain 1 Signature Hash: 0xe08712d6
 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
Chain 2 Signature Hash: 0xec6b634f
 [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
Chain 3 Signature Hash: 0xec6b634f
 
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 2 (+asm.+asm2)
os id: 88049
process id: 23, oracle@KSPX-DB2 (TNS V1-V3)
session id: 1427
session serial #: 3
}
is waiting for 'DFS lock handle' with wait info:
{
p1: 'type|mode'=0x43490005
p2: 'id1'=0x3c
p3: 'id2'=0x2
time in wait: 49 min 17 sec
timeout after: never
wait id: 107
blocking: 0 sessions
current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by:
{
instance: 1 (+asm.+asm1)
os id: 15643
process id: 18, oracle@KSPX-DB1 (RBAL)
session id: 1117
session serial #: 1
}
which is waiting for 'GPnP Get Item' with wait info:
{
time in wait: 221831 min 33 sec
timeout after: never
wait id: 124844244
blocking: 44 sessions
current sql: 
short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+
wait history:
* time between current wait and wait #1: 0.000007 sec
1.event: 'GPnP Initialization'
time waited: 0.018700 sec
wait id: 124844243
* time between wait #1 and #2: 0.000207 sec
2.event: 'CSS initialization'
time waited: 0.000027 sec
wait id: 124844242
* time between wait #2 and #3: 0.000062 sec
3.event: 'rdbms ipc message'
time waited: 0.002787 sec
wait id: 124844241p1: 'timeout'=0x12c
}

通過全域性的hang分析，我們可以發現，在節點2 asm啟動之後，運行了”ALTER DISKGROUP ALL MOUNT”的命令，而現在這個命令沒有執行成功，被阻塞了40多分鐘。它是被節點1的rbal程序阻塞的，而該程序正在等待一個叫 “GPnP Get Item”的等待事件。那麼既然我們知道問題的原因了，我們就可以開始在mos上查詢相關資料了。通過檢視文件Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文件 ID 1375505.1) ，發現這個問題是這是由未釋出的bug：12398300引起的。這個問題的最根本原因就是gpnp程序stuck了，然後就會導致rbal程序產生等待。解決辦法其實很簡單，就是幹掉節點1的gpnp程序就會釋放。而幹掉gpnp程序叢集不會重啟，它只會單獨的把該程序重啟。

通過kill gpnp之後，故障恢復，節點2的叢集順利啟動。

這裡有一個技巧，以後針對ora.asm服務無法啟動的問題，其實都可以使用全域性的hang分析。

記一次ora.asm服務無法啟動和GPnP Get Item的問題

您可能也會喜歡…