PowerFlex: DIMM Hardware Issue Causes High CPU Usage And SDS Decoupling
Summary: Certain memory issues (i.e. DIMM problems) can cause CMCI storms and in effect, SDS decoupling. This particular issue arose due to the operating system not responding appropriately to routine correctable memory notifications. This may also occur when a RAM DIMM module is failing on a server, but other hardware problems can potentially cause the same scenario. ...
Symptoms
kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {1}[Hardware Error]: event severity: corrected
kernel: {1}[Hardware Error]: Error 0, type: corrected
kernel: {1}[Hardware Error]: fru_text: A1
kernel: {1}[Hardware Error]: section_type: memory error
kernel: {1}[Hardware Error]: error_status: 0x0000000000000400
kernel: {1}[Hardware Error]: physical_address: 0x0000000ad6a38ac0
kernel: {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
kernel: {1}[Hardware Error]: error_type: 13, scrub corrected error
kernel: {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {2}[Hardware Error]: event severity: corrected
kernel: {2}[Hardware Error]: Error 0, type: corrected
kernel: {2}[Hardware Error]: section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
kernel: {2}[Hardware Error]: Error 1, type: corrected
kernel: {2}[Hardware Error]: section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
kernel: EDAC skx MC0: ADDR 0xad6a38ac0
kernel: EDAC skx MC0: MISC 0x0
kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 - err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
kernel: mce: [Hardware Error]: Machine check events logged
mcelog: Hardware event. This is not a software error.
mcelog: MCE 0
mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
mcelog: ADDR ad6a38ac0
mcelog: TIME 1669993821 Fri Dec 2 15:10:21 2022
mcelog: MCG status:
mcelog: MCi status:
mcelog: Corrected error
mcelog: Error enabled
mcelog: MCi_ADDR register valid
Dec 8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode Dec 8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode (...) Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode
High CPU usage can cause the SDS process to stall the IOs (which will impact overall IO system latency) or even decouple the SDS from the MDM. If it happens during an ongoing rebuild or while another SDS is experiencing similar problems, it can lead to a DU situation.
Cause
Intel CPUs can suffer from "interrupt storms" during DIMM errors. As per Red Hat KB:
Starting with a 45 nm Intel 64 processor on which CPUID reports DisplayFamily_DisplayModel as 06H_1AH, the processor can report information about corrected machine-check errors and deliver a programmable interrupt for software to respond to MC errors, seen as corrected machine-check error interrupt (CMCI). The Intel's hardware can deliver interrupts when the level of errors exceeds a programmable threshold. If the error is persistent, the CPU will then receive a constant influx or storm of interrupts at a high enough rate that it impacts the CPU's ability to do useful work. When that happens the kernel disables the CMCI mechanism and reverts to a more classical approach of polling regularly for machine check errors. Once the rate of errors decreases the kernel re-enables CMCI back.
See more information at: https://access.redhat.com/solutions/2710451
This problem may result in a CMCI storm, which can also be triggered by the OS features and software that intercept correctable errors instead of permitting them to be captured and handled by Dell iDRAC. This typically occurs when both EDAC and CMCI are enabled.
Resolution
Put the affected SDS into Maintenance Mode and/or remove it from the cluster to alleviate the impact on the entire system.
Contact the hardware vendor to inspect for any potential hardware problems. If no hardware issue is detected, particularly in the case of Correctable Errors, reach out to the OS vendor and request assistance with disabling EDAC and CMCI.
Additional Information
Impacted Versions
N/A - not a PowerFlex issue
Fixed In Version
N/A - hardware problem