PowerFlex:DIMM 硬件问题导致 CPU 使用率过高和 SDS 分离

Summary: 某些内存问题(即 DIMM 问题)可能会导致 CMCI 风暴,并实际上导致 SDS 分离。 出现此特殊问题的原因是作系统未适当地响应例行可纠正内存通知。 当服务器上的 RAM DIMM 模块出现故障时,也可能会发生这种情况,但其他硬件问题可能会导致相同的情况。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

在 iDRAC 或作系统日志中可以观察到发生故障的 DIMM,例如:
 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {1}[Hardware Error]: event severity: corrected
 kernel: {1}[Hardware Error]:  Error 0, type: corrected
 kernel: {1}[Hardware Error]:  fru_text: A1
 kernel: {1}[Hardware Error]:   section_type: memory error
 kernel: {1}[Hardware Error]:   error_status: 0x0000000000000400
 kernel: {1}[Hardware Error]:   physical_address: 0x0000000ad6a38ac0
 kernel: {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
 kernel: {1}[Hardware Error]:   error_type: 13, scrub corrected error
 kernel: {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {2}[Hardware Error]: event severity: corrected
 kernel: {2}[Hardware Error]:  Error 0, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: {2}[Hardware Error]:  Error 1, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
 kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
 kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
 kernel: EDAC skx MC0: ADDR 0xad6a38ac0
 kernel: EDAC skx MC0: MISC 0x0
 kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
 kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
 kernel: mce: [Hardware Error]: Machine check events logged
 mcelog: Hardware event. This is not a software error.
 mcelog: MCE 0
 mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
 mcelog: ADDR ad6a38ac0
 mcelog: TIME 1669993821 Fri Dec  2 15:10:21 2022
 mcelog: MCG status:
 mcelog: MCi status:
 mcelog: Corrected error
 mcelog: Error enabled
 mcelog: MCi_ADDR register valid
 
检测到硬件问题后不久,将报告 CMCI 风暴:
Dec  8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode
Dec  8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode
(...)                
Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode
Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode


高 CPU 使用率可能会导致 SDS 进程暂停 IO(这将影响整体 IO 系统延迟),甚至将 SDS 与 MDM 分离。如果在进行重建期间或另一个 SDS 遇到类似问题时发生这种情况,则可能会导致 DU 情况。

Cause

在 DIMM 出错期间,Intel CPU 可能会遭受“中断风暴”的影响。根据 Red Hat 知识库文章:

从 45 纳米英特尔 64 处理器开始,CPUID 将 DisplayFamily_DisplayModel 报告为 06H_1AH,处理器可以报告有关已更正的机器检查错误的信息,并为软件提供可编程中断以响应 MC 错误,这称为纠正的机器检查错误中断 (CMCI)。当错误级别超过可编程阈值时,英特尔的硬件可以提供中断。如果错误持续存在,CPU 就会以足够高的速率接收持续涌入或中断风暴,以致影响 CPU 执行有用工作的能力。发生这种情况时,内核会禁用 CMCI 机制,并恢复到更经典的方法,即定期轮询机器检查错误。错误率降低后,内核会重新启用 CMCI。

有关更多信息,请访问:https://access.redhat.com/solutions/2710451
 

此问题可能会导致 CMCI 风暴,也可由作系统功能和软件触发,这些功能和软件拦截可纠正错误,而不是允许 Dell iDRAC 捕获和处理错误。这通常在同时启用 EDAC 和 CMCI 时发生。

 

Resolution

将受影响的 SDS 置于维护模式和/或将其从群集中删除,以减轻对整个系统的影响。

请与硬件供应商联系,以检查是否有任何潜在的硬件问题。如果未检测到硬件问题(特别是在可纠正错误的情况下),请联系作系统供应商并请求禁用 EDAC 和 CMCI 的帮助。 

Additional Information

受影响的版本

不适用 — 不是 PowerFlex 问题

已修复问题的版本

不适用 — 硬件问题

Affected Products

PowerFlex rack, VxFlex Ready Nodes, ScaleIO
Article Properties
Article Number: 000197735
Article Type: Solution
Last Modified: 08 Apr 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.