1. 程式人生 > >x86服務器MCE(Machine Check Exception)問題

x86服務器MCE(Machine Check Exception)問題

api ati soft area process with 現象 exce 機器

MCE現象

Intel在Pentium 4、Xenon和P6系列處理器中實現了機器檢查(Machinecheck)架構,提供能夠檢測和報告硬件(機器)的錯誤機制,如系統總線錯誤、ECC錯誤、奇偶校驗錯誤、緩存錯誤、TLB錯誤等。它包括一直MSR(Model-Specific Registers)寄存器,用來設置機器檢查和額外的bank MSR記錄錯誤。

當機器檢查到不可糾正的machine-check錯誤時,就觸發一個machine-check異常。machine-check架構不允許在出現MCE後處理器重啟,但MCE處理程序可以從MSR寄存器收集相關信息。

CPU 7: Machine Check Exception: 5 Bank 0: b200004010000400

RIP !INEXACT! 10:<ffffffff8010f16e> {mwait_idle+0x5e/0x90}

TSC 1952dbeebcc8

Kernel panic: Machine check

Reconfiguring memory bank information….

This may take a while….

done waiting: 3 cpus not responding

Warning: Non-empty request queue

I/O requests in flight at dump time

CPU 7: Machine Check Exception: 4 Bank 0: f200004040000400

RIP !INEXACT! 10:<ffffffff8011ef69>


MCE錯誤判斷原則

凡是內核死機打印“Machine Check Exception“或內核棧信息中打印有do_machine_check()函數,均為MCE問題。


MCE錯誤來源

  • PCI-E設備信號質量/時鐘
  • CPU芯片損壞/設計BUG

    CPU Cache損壞或其它故障

  • CPU可能的缺陷

    如CPU生產制造過程中帶來的缺陷

  • 內存壞/接觸不良
  • BIOS配置不當
  • OS/MCE中斷程序Bug
  • 環境因素,如溫度/濕度

技術分享圖片


MCE錯誤碼解析

技術分享圖片

以上面MCE錯誤為例,Machine Check Exception和Bank 0(5)的值分別對應IA32_MCG_STATUS MSR、IA32_MCi_STATUS寄存器。

則對應的寄存器值為:

IA32_MCG_STATUS MSR寄存器的值為0000000000000004

IA32_MC0_STATUS MSR的值為f200000410000800

IA32_MC5_STATUS MSR的值為f200001044100e0f


根據MSR的值,對照Intel編程手冊和Intel其他資料,就可以比較容易找出MCE原因。

dmesg顯示

1
2
3
4
5
6
7
8
...

sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr= 0x67081b300 => socket=0, Channel=3(mask=8), rank=0

...

保存4行log為mlog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# mcelog --ascii < /tmp/mlog
WARNING: with --dmi mcelog --ascii must run on the same machine with the
	 same BIOS/memory configuration as where the machine check occurred.
sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
Wed Sep  2 16:14:36 2015
CPU 0 BANK 5 MISC 2140040486 ADDR 67081b300
STATUS 8c00004000010093 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 45
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
<24> DIMM 1333 Mhz Res13 Width 72 Data Width 64 Size 16 GB
Device Locator: Node0_Channel2_Dimm0
Bank Locator: Node0_Bank0
Manufacturer: Hynix Semiconducto
Serial Number: 40743B5A
Asset Tag: Dimm2_AssetTag
Part Number: HMT42GR7BFR4A-PB
TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x67081b300 => socket=0, Channel=3(mask=8), rank=0

根據
Part Number: HMT42GR7BFR4A-PB
Serial Number: 40743B5A

在lshw中找相應硬件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
...

	 *-memory:0
	      description: System Memory
	      physical id: 2d
	      slot: System board or motherboard
	    *-bank:0
	         description: DIMM 1333 MHz (0.8 ns)
	         product: HMT42GR7BFR4A-PB
	         vendor: Hynix Semiconducto
	         physical id: 0
	         serial: 905D21AE
	         slot: Node0_Channel1_Dimm0
	         size: 16GiB
	         width: 64 bits
	         clock: 1333MHz (0.8ns)
	    *-bank:1
	         description: DIMM Synchronous [empty]
	         product: A1_Dimm1_PartNumber
	         vendor: Dimm1_Manufacturer
	         physical id: 1
	         serial: Dimm1_SerNum
	         slot: Node0_Channel1_Dimm1
	         width: 64 bits
	    *-bank:2
	         description: DIMM 1333 MHz (0.8 ns)
	         product: HMT42GR7BFR4A-PB
	         vendor: Hynix Semiconducto
	         physical id: 2
	         serial: 40743B5A
	         slot: Node0_Channel2_Dimm0
	         size: 16GiB
	         width: 64 bits
	         clock: 1333MHz (0.8ns)

		...


x86服務器MCE(Machine Check Exception)問題