PowerFlex:DIMM 硬件问题导致 CPU 使用率过高和 SDS 分离
Summary: 某些内存问题(即 DIMM 问题)可能会导致 CMCI 风暴,并实际上导致 SDS 分离。 出现此特殊问题的原因是作系统未适当地响应例行可纠正内存通知。 当服务器上的 RAM DIMM 模块出现故障时,也可能会发生这种情况,但其他硬件问题可能会导致相同的情况。
Symptoms
kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {1}[Hardware Error]: event severity: corrected
kernel: {1}[Hardware Error]: Error 0, type: corrected
kernel: {1}[Hardware Error]: fru_text: A1
kernel: {1}[Hardware Error]: section_type: memory error
kernel: {1}[Hardware Error]: error_status: 0x0000000000000400
kernel: {1}[Hardware Error]: physical_address: 0x0000000ad6a38ac0
kernel: {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
kernel: {1}[Hardware Error]: error_type: 13, scrub corrected error
kernel: {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {2}[Hardware Error]: event severity: corrected
kernel: {2}[Hardware Error]: Error 0, type: corrected
kernel: {2}[Hardware Error]: section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
kernel: {2}[Hardware Error]: Error 1, type: corrected
kernel: {2}[Hardware Error]: section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
kernel: EDAC skx MC0: ADDR 0xad6a38ac0
kernel: EDAC skx MC0: MISC 0x0
kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 - err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
kernel: mce: [Hardware Error]: Machine check events logged
mcelog: Hardware event. This is not a software error.
mcelog: MCE 0
mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
mcelog: ADDR ad6a38ac0
mcelog: TIME 1669993821 Fri Dec 2 15:10:21 2022
mcelog: MCG status:
mcelog: MCi status:
mcelog: Corrected error
mcelog: Error enabled
mcelog: MCi_ADDR register valid
Dec 8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode Dec 8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode (...) Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode
高 CPU 使用率可能会导致 SDS 进程暂停 IO(这将影响整体 IO 系统延迟),甚至将 SDS 与 MDM 分离。如果在进行重建期间或另一个 SDS 遇到类似问题时发生这种情况,则可能会导致 DU 情况。
Cause
在 DIMM 出错期间,Intel CPU 可能会遭受“中断风暴”的影响。根据 Red Hat 知识库文章:
从 45 纳米英特尔 64 处理器开始,CPUID 将 DisplayFamily_DisplayModel 报告为 06H_1AH,处理器可以报告有关已更正的机器检查错误的信息,并为软件提供可编程中断以响应 MC 错误,这称为纠正的机器检查错误中断 (CMCI)。当错误级别超过可编程阈值时,英特尔的硬件可以提供中断。如果错误持续存在,CPU 就会以足够高的速率接收持续涌入或中断风暴,以致影响 CPU 执行有用工作的能力。发生这种情况时,内核会禁用 CMCI 机制,并恢复到更经典的方法,即定期轮询机器检查错误。错误率降低后,内核会重新启用 CMCI。
有关更多信息,请访问:https://access.redhat.com/solutions/2710451
此问题可能会导致 CMCI 风暴,也可由作系统功能和软件触发,这些功能和软件拦截可纠正错误,而不是允许 Dell iDRAC 捕获和处理错误。这通常在同时启用 EDAC 和 CMCI 时发生。
Resolution
将受影响的 SDS 置于维护模式和/或将其从群集中删除,以减轻对整个系统的影响。
请与硬件供应商联系,以检查是否有任何潜在的硬件问题。如果未检测到硬件问题(特别是在可纠正错误的情况下),请联系作系统供应商并请求禁用 EDAC 和 CMCI 的帮助。
Additional Information
受影响的版本
不适用 — 不是 PowerFlex 问题
已修复问题的版本
不适用 — 硬件问题