PowerFlex: DIMM Hardware Issue Causes High CPU Usage And SDS Decoupling

Summary: Certain memory issues (i.e. DIMM problems) can cause CMCI storms and in effect, SDS decoupling. This particular issue arose due to the operating system not responding appropriately to routine correctable memory notifications. This may also occur when a RAM DIMM module is failing on a server, but other hardware problems can potentially cause the same scenario. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Failing DIMM can be observed in iDRAC or the Operating System logs, for example:
 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {1}[Hardware Error]: event severity: corrected
 kernel: {1}[Hardware Error]:  Error 0, type: corrected
 kernel: {1}[Hardware Error]:  fru_text: A1
 kernel: {1}[Hardware Error]:   section_type: memory error
 kernel: {1}[Hardware Error]:   error_status: 0x0000000000000400
 kernel: {1}[Hardware Error]:   physical_address: 0x0000000ad6a38ac0
 kernel: {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
 kernel: {1}[Hardware Error]:   error_type: 13, scrub corrected error
 kernel: {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {2}[Hardware Error]: event severity: corrected
 kernel: {2}[Hardware Error]:  Error 0, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: {2}[Hardware Error]:  Error 1, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
 kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
 kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
 kernel: EDAC skx MC0: ADDR 0xad6a38ac0
 kernel: EDAC skx MC0: MISC 0x0
 kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
 kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
 kernel: mce: [Hardware Error]: Machine check events logged
 mcelog: Hardware event. This is not a software error.
 mcelog: MCE 0
 mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
 mcelog: ADDR ad6a38ac0
 mcelog: TIME 1669993821 Fri Dec  2 15:10:21 2022
 mcelog: MCG status:
 mcelog: MCi status:
 mcelog: Corrected error
 mcelog: Error enabled
 mcelog: MCi_ADDR register valid
 
Shortly after the hardware issue is detected, the CMCI storm is reported:
Dec  8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode
Dec  8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode
(...)                
Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode
Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode


High CPU usage can cause the SDS process to stall the IOs (which will impact overall IO system latency) or even decouple the SDS from the MDM. If it happens during an ongoing rebuild or while another SDS is experiencing similar problems, it can lead to a DU situation.

Cause

Intel CPUs can suffer from "interrupt storms" during DIMM errors. As per Red Hat KB:

Starting with a 45 nm Intel 64 processor on which CPUID reports DisplayFamily_DisplayModel as 06H_1AH, the processor can report information about corrected machine-check errors and deliver a programmable interrupt for software to respond to MC errors, seen as corrected machine-check error interrupt (CMCI). The Intel's hardware can deliver interrupts when the level of errors exceeds a programmable threshold. If the error is persistent, the CPU will then receive a constant influx or storm of interrupts at a high enough rate that it impacts the CPU's ability to do useful work. When that happens the kernel disables the CMCI mechanism and reverts to a more classical approach of polling regularly for machine check errors. Once the rate of errors decreases the kernel re-enables CMCI back.

See more information at: https://access.redhat.com/solutions/2710451
 

This problem may result in a CMCI storm, which can also be triggered by the OS features and software that intercept correctable errors instead of permitting them to be captured and handled by Dell iDRAC. This typically occurs when both EDAC and CMCI are enabled.

 

Resolution

Put the affected SDS into Maintenance Mode and/or remove it from the cluster to alleviate the impact on the entire system.

Contact the hardware vendor to inspect for any potential hardware problems. If no hardware issue is detected, particularly in the case of Correctable Errors, reach out to the OS vendor and request assistance with disabling EDAC and CMCI. 

Additional Information

Impacted Versions

N/A - not a PowerFlex issue

Fixed In Version

N/A - hardware problem

Affected Products

PowerFlex rack, VxFlex Ready Nodes, ScaleIO
Article Properties
Article Number: 000197735
Article Type: Solution
Last Modified: 08 Apr 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.