我们的 DELL 机器很少(有RHEL 7.6
) ,当我们更换机器上的 DIMM 卡时,因为我们从内核消息中看到的错误
一段时间后,我们再次检查内核消息,发现以下内容,我们可以看到有关 RAM 内存的错误(也与 RHEL 案例有关 - https://access.redhat.com/solutions/6961932)
[Mon May 8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May 8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May 8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Error 0, type: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: fru_text: B6
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: section_type: memory error
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_status: 0x0000000000000400
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: physical_address: 0x000000446e0d5f00
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_type: 2, single-bit ECC
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: MISC 0
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May 9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May 9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported
只是为了确保上述消息不是随机消息,我们决定重新启动机器并查看是否重现了有关内存的错误消息
但有关 RAM 内存的错误消息仍然存在。
所以我们对从内核消息中看到的问题感到困惑
尽管我们更换了 DIMM 卡,但我们如何仍然得到关于 RAM 的错误
我必须在这里提供有关我们从IDRAC看到的内容的更多信息
因为我们可以在上面的 IDRAC 上完成有关 DIMM 卡或 RAM 内存的信息
所以问题是 -dmesg
尽管更换了所有 DIMM,但(内核消息)怎么会抱怨 RAM 内存?
有没有可能是其他东西坏了而不是 DIMM 卡?比如DELL机器的主板?
您看到的错误是由硬件纠正的单位 ECC 可纠正内存错误。这些不会触发在 iDRAC 中列为故障的组件,至少在它们的数量超过某个内部定义的阈值之前是这样,但是您应该会在 iDRAC SEL(系统事件日志)下看到此内存错误。
不建议混合使用单列和双列模块,但您的里程可能会因处理器/主板版本而异。