記一次ora.asm服務無法啟動和GPnP Get Item的問題
版權宣告:本文為Buddy Yuan原創文章,未經允許不得轉載。原文地址:ofollow,noindex" target="_blank">記一次ora.asm服務無法啟動和GPnP Get Item的問題
昨天晚上去打DB PSU,這是一套11.2.0.3 RAC。很快我就把資料庫和叢集全部停止,然後把補丁安裝完了。我心想還是蠻順利的啊,結果在重啟叢集的時候,突然發現叢集沒辦法起來。通過crsctl check crs可以發現無法啟動crsd的服務,而我去檢視crsd.log的時候發現根本沒有任何記錄。一直是空的。然後我通過ocrcheck命令檢查會一直hang住,執行crsctl query css votedisk確是好的,很快都能查到。然後我又執行了kfed去讀取asm磁碟,發現也是能讀的。通過進一步檢查,發現是ora.asm服務啟動不成功。進一步登陸到asm例項上檢視asm的alert日誌,發現能讀取spfile檔案,並且例項是能正常啟動的。我通過檢視asm的日誌發現,磁碟組並沒有走到最後一步mount成功的地方。如下所示:
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ NOTE: Diskgroups listed in ASM_DISKGROUPS are DG_FRA DG_DATA NOTE: Diskgroup used for Voting files is: DG_OCR Diskgroup with spfile:DG_OCR Diskgroup used for OCR is:DG_OCR NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258 NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258 NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4) NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3) NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2) NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1) NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra) NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr) GMON querying group 1 at 4 for pid 23, osid 128640 NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2 NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3 NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4 NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)
可以看到日誌前面都是正常能夠讀盤的,中間做了一些其他的動作,在最後面出現下列錯誤:
NOTE: detached from domain 3 NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258 NOTE: cache deleting context for group DG_OCR 3/0x50fc3258 GMON dismounting group 3 at 9 for pid 23, osid 128640 NOTE: Diskin mode 0x8 marked for de-assignment ERROR: diskgroup DG_OCR was not mounted
接下來我通過asm例項進入到資料庫查了一下,在正常例項的磁碟組是mounted狀態,而異常的資料庫例項裡面磁碟組的狀態是mounting狀態。也就是正在mount的狀態。這裡的一個疑點就是怎麼mount不上。
為了把這個疑點弄清楚,我做了一個全域性的hang分析。這裡必須要做全域性的才能知道磁碟組為什麼無法mount。
*** 2018-10-26 04:36:34.280 =============================================================================== HANG ANALYSIS: instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1 oradebug_node_dump_level: 3 analysis initiated by oradebug os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 04:36:34 ] NOTE: scheduling delay has not been sampled for 0.184529 secs0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg 0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg 0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg =============================================================================== Chains most likely to have caused the hang: [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle' Chain 1 Signature Hash: 0xe08712d6 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 2 Signature Hash: 0xec6b634f [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 3 Signature Hash: 0xec6b634f =============================================================================== Non-intersecting chains: ------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 2 (+asm.+asm2) os id: 88049 process id: 23, oracle@KSPX-DB2 (TNS V1-V3) session id: 1427 session serial #: 3 } is waiting for 'DFS lock handle' with wait info: { p1: 'type|mode'=0x43490005 p2: 'id1'=0x3c p3: 'id2'=0x2 time in wait: 49 min 17 sec timeout after: never wait id: 107 blocking: 0 sessions current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by: { instance: 1 (+asm.+asm1) os id: 15643 process id: 18, oracle@KSPX-DB1 (RBAL) session id: 1117 session serial #: 1 } which is waiting for 'GPnP Get Item' with wait info: { time in wait: 221831 min 33 sec timeout after: never wait id: 124844244 blocking: 44 sessions current sql: short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+ wait history: * time between current wait and wait #1: 0.000007 sec 1.event: 'GPnP Initialization' time waited: 0.018700 sec wait id: 124844243 * time between wait #1 and #2: 0.000207 sec 2.event: 'CSS initialization' time waited: 0.000027 sec wait id: 124844242 * time between wait #2 and #3: 0.000062 sec 3.event: 'rdbms ipc message' time waited: 0.002787 sec wait id: 124844241p1: 'timeout'=0x12c }
通過全域性的hang分析,我們可以發現,在節點2 asm啟動之後,運行了”ALTER DISKGROUP ALL MOUNT”的命令,而現在這個命令沒有執行成功,被阻塞了40多分鐘。它是被節點1的rbal程序阻塞的,而該程序正在等待一個叫 “GPnP Get Item”的等待事件。那麼既然我們知道問題的原因了,我們就可以開始在mos上查詢相關資料了。通過檢視文件Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文件 ID 1375505.1) ,發現這個問題是這是由未釋出的bug:12398300引起的。這個問題的最根本原因就是gpnp程序stuck了,然後就會導致rbal程序產生等待。解決辦法其實很簡單,就是幹掉節點1的gpnp程序就會釋放。而幹掉gpnp程序叢集不會重啟,它只會單獨的把該程序重啟。
通過kill gpnp之後,故障恢復,節點2的叢集順利啟動。
這裡有一個技巧,以後針對ora.asm服務無法啟動的問題,其實都可以使用全域性的hang分析。