1. 程式人生 > >Hardware Error 內存報錯

Hardware Error 內存報錯

ces 服務 png 通過 part err ges status 查看內存

192.168.219.90 使用 dmesg|grep -i error 查看時發現這臺機器內存有問題,如下圖所示:
[Hardware Error]: MC4 Error (node 1): L3 cache tag error.
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000018edfd9100
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.

EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.

EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.

[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

進一步查詢發現是第5條內存有問題,需要聯系私有雲那邊報修。
grep [0-9] /sys/devices/system/edac/mc/mc/csrow

/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146
/sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0

count不為0的行即代表存在內存錯誤。
mc:第幾個CPU。
csrow
:內存通道。
ch*:通道內的第幾根內存。

然後通過dmidecode查看:

[root@customer log]# dmidecode -t memory |grep ‘Locator: DIMM‘
Locator: DIMM01
Locator: DIMM02
Locator: DIMM03
Locator: DIMM04
Locator: DIMM05
Locator: DIMM06
Locator: DIMM07
Locator: DIMM08
Locator: DIMM09
Locator: DIMM10
Locator: DIMM11
Locator: DIMM12
Locator: DIMM13
Locator: DIMM14
Locator: DIMM15
Locator: DIMM16
Locator: DIMM17
Locator: DIMM18
Locator: DIMM19
Locator: DIMM20
Locator: DIMM21
Locator: DIMM22
Locator: DIMM23
Locator: DIMM24
Locator: DIMM25
Locator: DIMM26
Locator: DIMM27
Locator: DIMM28
Locator: DIMM29
Locator: DIMM30
Locator: DIMM31
Locator: DIMM32
通過服務器控制臺查看內存:
技術分享圖片

主板上內存插槽的分布:
技術分享圖片

結合報錯日誌:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
應該是內存插槽DIMM_F1的問題。

解決:
最後我們要做的就是,把有問題的F1插槽上的內存拔出來或是更換到其它的內存插槽上面,之後系統啟動後不再報錯。

Hardware Error 內存報錯