1. 程式人生 > >記一次ORA-07445[opiaba()+639],ORA-00600[17147]資料庫異常中止故障案例

記一次ORA-07445[opiaba()+639],ORA-00600[17147]資料庫異常中止故障案例

故障描述:

資料庫版本:11.2.0.3.0(單例項) 作業系統版本:redhat linux 6.5

資料庫在凌晨3點自動關閉,同事發現故障後,立馬手動startup開啟了資料庫,開啟後資料庫執行正常。

故障時資料庫alert日誌如下:

Wed Oct 10 02:51:47 2018
Thread 1 advanced to log sequence 75978 (LGWR switch)
  Current log# 5 seq# 75978 mem# 0: /u01/app/oracle/oradata/XXX/redo05.log
Wed Oct 10 02:51:52 2018
Archived Log entry 149289 added for thread 1 sequence 75977 ID 0xbca5e435 dest 1:
Wed Oct 10 03:00:05 2018
ALTER SYSTEM ARCHIVE LOG
Wed Oct 10 03:00:05 2018
Thread 1 advanced to log sequence 75979 (LGWR switch)
  Current log# 2 seq# 75979 mem# 0: /u01/app/oracle/oradata/XXX/redo02.log
Archived Log entry 149291 added for thread 1 sequence 75978 ID 0xbca5e435 dest 1:
Wed Oct 10 03:06:34 2018
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x17E86CF, opiaba()+639] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_ora_48752.trc  (incident=113956):
ORA-07445: 出現異常錯誤: 核心轉儲 [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x17E86CF] [SI_KERNEL(general_protection)] []
Incident details in: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_113956/XXX_ora_48752_i113956.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Oct 10 03:06:48 2018
Dumping diagnostic data in directory=[cdmp_20181010030648], requested by (instance=1, osid=48752), summary=[incident=113956].
Wed Oct 10 03:12:03 2018
Thread 1 advanced to log sequence 75980 (LGWR switch)
  Current log# 6 seq# 75980 mem# 0: /u01/app/oracle/oradata/XXX/redo06.log
Wed Oct 10 03:12:15 2018
Archived Log entry 149293 added for thread 1 sequence 75979 ID 0xbca5e435 dest 1:
Wed Oct 10 03:17:09 2018
Process 0x0x26c1c99fc8 appears to be hung while dumping
Current time = 1550270160, process death time = 1550208043 interval = 60000
Attempting to kill process 0x0x26c1c99fc8 with OS pid = 48752
OSD kill succeeded for process 0x26c1c99fc8
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc  (incident=112148):
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_112148/XXX_pmon_4152_i112148.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc:
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []
PMON (ospid: 4152): terminating the instance due to error 472
Wed Oct 10 03:17:12 2018
System state dump requested by (instance=1, osid=4152 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_diag_4164.trc
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (50967) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (50979) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (53178) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:12 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:13 2018
opiodr aborting process unknown ospid (53176) as a result of ORA-1092
Wed Oct 10 03:17:13 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:13 2018
opiodr aborting process unknown ospid (53174) as a result of ORA-1092
Wed Oct 10 03:17:13 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:14 2018
ORA-1092 : opitsk aborting process
Instance terminated by PMON, pid = 4152
Wed Oct 10 03:57:17 2018
Adjusting the default value of parameter parallel_max_servers
from 1600 to 985 due to the value of parameter processes (1000)
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

通過alert日誌我們可以看出,

資料庫在3:00:05的時候還是正常執行的。

在03:06:34報了第一個錯誤ORA-07445[opiaba()+639],

在03:17:09報了第二個錯誤ORA-00600[17147],緊接著pmon程序中止了例項(PMON (ospid: 4152): terminating the instance due to error 472),

在03:57:17,同事發現數據庫故障後,手動開啟了資料庫Starting ORACLE instance (normal)。開啟後資料庫執行正常,alert日誌也未出現報錯。

故障分析:

1,資料庫是因為pmon程序中止的,先檢視pmon程序的trace檔案。(/u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc)

…………
Process 0x0x26c1c99fc8 appears to be hung while dumping
Current time = 1550270160, process death time = 1550208043 interval = 60000
Attempting to kill process 0x0x26c1c99fc8 with OS pid = 48752
OSD kill succeeded for process 0x26c1c99fc8
KGX cleanup...
KGX Atomic Operation Log 0x1ef541bc0
 Mutex 0x25a26eace0(1095, 0) idn 96fd9677 oper EXCL
 Cursor Parent uid 1095 efd 11 whr 17 slp 0
 oper=OPERATION_DEFAULT pt1=0x25afcf4400 pt2=0x25a26eac68 pt3=(nil)
 pt4=(nil) u41=0 stt=0
Incident 112148 created, dump file: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_112148/XXX_pmon_4152_i112148.trc
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []

PMON: fatal error while deleting s.o. 0x1782b9760 in this tree:
----------------------------------------
…………
…………
    SO: 0x1782b9760, type: 78, owner: 0x26e1f01fb8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x26c1c99fc8, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8547, pg=0

    LibraryObjectLock:  Address=0x1782b9760 Handle=0x280f602980 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0       
      ClusterLock=0x178588100 Context=0x7fb6cc0cb790 User=0x26e1f01fb8 Session=0x26e1f01fb8 ReferenceCount=1       
      Flags=BRO/CBK/[0020] SavepointNum=0 
    LibraryHandle:  Address=0x280f602980 Hash=0 LockMode=N PinMode=X LoadLockMode=0 Status=INVL 
      Name:  Namespace=SQL AREA(00) Type=CURSOR(00) 
      Statistics:  InvalidationCount=1 ExecutionCount=0 LoadCount=1 ActiveLocks=1 TotalLockCount=1 TotalPinCount=2 
      Counters:  BrokenCount=2 RevocablePointer=2 KeepDependency=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0 
      Concurrency:  DependencyMutex=0x280f602a30(0, 0, 0, 0) Mutex=0x1b0564720(0, 24, 0, 0) 
      Flags=RON/PIN/PN0/EXP/CHD/[10010111] 
      WaitersLists:  
        Lock=0x280f602a10[0x280f602a10,0x280f602a10] 
        Pin=0x280f6029f0[0x280f6029f0,0x280f6029f0] 
        LoadLock=0x280f602a68[0x280f602a68,0x280f602a68] 
      LibraryObject:  Address=0x25bba25230 HeapMask=0000-0001-0001-0000 
        DataBlocks:  
          Block:  #='0' name=KGLH0^96fd9677 pins=0 Change=NONE   
            Heap=0x280a78ad78 Pointer=0x25bba252d0 Extent=0x25bba251b0 Flags=I/-/P/A/-/- 
            FreedLocation=0 Alloc=1.875000 Size=4.000000 LoadTime=15502013660 
      NamespaceDump:  
        Child Cursor:  Heap0=0x25bba252d0 Heap6=0x280f602980 Heap0 Load Time=10-10-2018 03:05:41 Heap6 Load Time=10-10-2018 03:05:41     ----------------------------------------
    KGX Atomic Operation Log 0x178588100
     Mutex 0x25a26eb1c0(1095, 0) idn 96fd9677 oper LONG_EXCL
     Cursor Pin uid 1095 efd 0 whr 1 slp 0
     opr=3 pso=0x1782b9760 flg=0
     pcs=0x25a26eb128 nxt=(nil) flg=31 cld=0 hd=0x280f602980 par=0x25a26eac20
     ct=0 hsh=0 unp=(nil) unn=0 hvl=a26ebac8 nhv=0 ses=(nil)
     hep=0x25a26eb1c0 flg=80 ld=1 ob=0x25bba25230 ptr=0x2446d95de0 fex=0x2446d95188
…………
…………

這裡可以看到,pmon程序kill process(OS pid = 48752),在做KGX cleanup...的時候發生致命的錯誤:PMON: fatal error while deleting s.o. 0x1782b9760 in this tree。也就是ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []該錯誤。

由於pmon是為了回收OS pid = 48752程序資源時,出現的錯誤。檢視OS pid = 48752程序出現dead的原因,檢視報錯的trace檔案:/u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_113956/XXX_ora_48752_i113956.trc

*** 2018-10-10 03:06:34.474
*** SESSION ID:(1095.5747) 2018-10-10 03:06:34.474
*** CLIENT ID:() 2018-10-10 03:06:34.474
*** SERVICE NAME:(XXX) 2018-10-10 03:06:34.474
*** MODULE NAME:(XXXX.Service.Console.exe) 2018-10-10 03:06:34.474
*** ACTION NAME:() 2018-10-10 03:06:34.474

Dump continued from file: /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_ora_48752.trc
ORA-07445: 出現異常錯誤: 核心轉儲 [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x17E86CF] [SI_KERNEL(general_protection)] []

========= Dump for incident 113956 (ORA 7445 [opiaba()+639]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x17E86CF, opiaba()+639] [flags: 0x0, count: 1]
Registers:
%rax: 0x00000025a26eb540 %rbx: 0x0000002449f5fd00 %rcx: 0x0000000000005000
%rdx: 0x0000000000000000 %rdi: 0x0000000000000400 %rsi: 0x0000000000000400
%rsp: 0x00007fff86989b50 %rbp: 0x00007fff86989bc0  %r8: 0x0000002581884168
 %r9: 0x0000000000000059 %r10: 0x0000000000000b33 %r11: 0x00000025a26eb4c0
%r12: 0x00000025afe69330 %r13: 0x0000000000000000 %r14: 0x0200000025a26eb5
%r15: 0x0000000000000005 %rip: 0x00000000017e86cf %efl: 0x0000000000010202

*** 2018-10-10 03:06:34.664
  opiaba()+625 (0x17e86c1) jmp 0x17e86fe
  opiaba()+627 (0x17e86c3) movzbl 0x2c(%rbx),%r15d
  opiaba()+632 (0x17e86c8) mov %r14,-0x60(%rbp)
  opiaba()+636 (0x17e86cc) mov %rax,%r14
> opiaba()+639 (0x17e86cf) movswq 0xa(%r14),%rcx
  opiaba()+644 (0x17e86d4) cmp %ecx,%r15d
  opiaba()+647 (0x17e86d7) jne 0x17e86f2
  opiaba()+649 (0x17e86d9) mov 0x18(%rbx),%rdi
  opiaba()+653 (0x17e86dd) lea 0xc(%r14),%rsi

*** 2018-10-10 03:06:34.664
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=fpthk5ubgv5mr) -----
BEGIN insert into TB_1 (col1,…………,col100) values (:p1,…… ……,:p100);
insert into TB_1 (col1,…… ……,col100) values (:p101,…… ……,:p200);
…… ……
…… ……
insert into TB_1 (col1,…… ……,col100) values (:p29901,…… ……,:p30000);
insert into TB_2 (col1,…… ……,col100) values (:p30001,…… ……,:p30100);
…… ……
…… ……
insert into TB_2 (col1,…… ……,col100) values (:p65901,…… ……,:p66000);END;
----- Call Stack Trace -----
calling              call     entry                argument values in hex    
…… ……
…… ……

這時在trace檔案中,我們可以看到故障程序的SESSION ID,MODULE NAME,sql_id=fpthk5ubgv5mr等資訊。通過sql_id這裡可以看出非常奇怪的現象,(案例中所有的資訊都脫敏處理,但關鍵資訊保留)從begin 到end的執行體中,執行的是帶有繫結變數的insert寫法。但是繫結變數竟然高達6萬多個。這裡不禁懷疑是否是因為觸發了資料庫bug造成問題,(正常情況下sql語句不會有6萬多個繫結變數)

通過查詢MOS,發現和如下bug相匹配:

這裡提供的解決辦法是打patch 12578873補丁包,但是補丁包僅僅是解決超過65535個繫結變數不會造成例項的中止,資料庫仍然不支援大於65535個繫結變數。

最終解決

找出出現問題時的sql語句,和程式名。反饋給研發同事。研發同事改寫應用程式,不適用超過65535個繫結變數,問題得到解決。