PowerFlex: DIMM 하드웨어 문제로 인해 높은 CPU 사용량 및 SDS 분리가 발생함

Table of Contents

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Provide Feedback

Summary: 특정 메모리 문제(예: DIMM 문제)는 CMCI 스톰을 유발하고 실제로 SDS 분리를 일으킬 수 있습니다. 이 문제는 운영 체제가 수정 가능한 일상적인 메모리 알림에 적절하게 응답하지 않아 발생합니다. 이 문제는 서버에서 RAM DIMM 모듈에 장애가 발생한 경우에도 발생할 수 있지만 다른 하드웨어 문제로 인해 동일한 시나리오가 발생할 수 있습니다. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

장애가 발생한 DIMM은 iDRAC 또는 운영 체제 로그에서 확인할 수 있습니다. 예:

 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {1}[Hardware Error]: event severity: corrected
 kernel: {1}[Hardware Error]:  Error 0, type: corrected
 kernel: {1}[Hardware Error]:  fru_text: A1
 kernel: {1}[Hardware Error]:   section_type: memory error
 kernel: {1}[Hardware Error]:   error_status: 0x0000000000000400
 kernel: {1}[Hardware Error]:   physical_address: 0x0000000ad6a38ac0
 kernel: {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
 kernel: {1}[Hardware Error]:   error_type: 13, scrub corrected error
 kernel: {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {2}[Hardware Error]: event severity: corrected
 kernel: {2}[Hardware Error]:  Error 0, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: {2}[Hardware Error]:  Error 1, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
 kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
 kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
 kernel: EDAC skx MC0: ADDR 0xad6a38ac0
 kernel: EDAC skx MC0: MISC 0x0
 kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
 kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
 kernel: mce: [Hardware Error]: Machine check events logged
 mcelog: Hardware event. This is not a software error.
 mcelog: MCE 0
 mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
 mcelog: ADDR ad6a38ac0
 mcelog: TIME 1669993821 Fri Dec  2 15:10:21 2022
 mcelog: MCG status:
 mcelog: MCi status:
 mcelog: Corrected error
 mcelog: Error enabled
 mcelog: MCi_ADDR register valid

하드웨어 문제가 감지되고 얼마 지나지 않아 CMCI 스톰이 보고됩니다.

Dec  8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode
Dec  8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode
(...)                
Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode
Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode

CPU 사용량이 높으면 SDS 프로세스가 IO를 지연시키거나(전체 IO 시스템 레이턴시에 영향을 미침) MDM에서 SDS를 분리할 수도 있습니다. 진행 중인 재구축 중에 또는 다른 SDS에서 유사한 문제가 발생하는 동안 이 문제가 발생하면 DU 상황이 발생할 수 있습니다.

Cause

인텔 CPU는 DIMM 오류 중에 "인터럽트 스톰"을 겪을 수 있습니다. Red Hat KB에 따르면:

CPUID가 DisplayFamily_DisplayModel 06H_1AH로 보고하는 45nm Intel 64 프로세서부터 프로세서는 수정된 기계 검사 오류에 대한 정보를 보고하고 소프트웨어가 CMCI(Corrected Machine-Check Error Interrupt)라고 하는 MC 오류에 응답할 수 있도록 프로그래밍 가능한 인터럽트를 제공할 수 있습니다. 인텔의 하드웨어는 오류 수준이 프로그래밍 가능한 임계값을 초과할 때 인터럽트를 제공할 수 있습니다. 오류가 지속되면 CPU는 유용한 작업을 수행하는 CPU의 기능에 영향을 줄 수 있을 만큼 충분히 빠른 속도로 지속적인 인터럽트 유입 또는 폭풍을 수신합니다. 이 경우 커널은 CMCI 메커니즘을 비활성화하고 머신 검사 오류를 정기적으로 폴링하는 보다 고전적인 접근 방식으로 되돌아갑니다. 오류율이 감소하면 커널이 CMCI를 다시 활성화합니다.

자세한 정보: https://access.redhat.com/solutions/2710451

이 문제로 인해 CMCI 폭풍이 발생할 수 있으며, 이는 Dell iDRAC에서 캡처하고 처리하도록 허용하는 대신 수정 가능한 오류를 가로채는 OS 기능 및 소프트웨어에 의해서도 트리거될 수 있습니다. 이 문제는 일반적으로 EDAC와 CMCI가 모두 활성화된 경우에 발생합니다.

Resolution

영향을 받는 SDS를 유지 보수 모드로 전환하거나 클러스터에서 제거하여 전체 시스템에 미치는 영향을 완화합니다.

하드웨어 공급업체에 문의하여 잠재적인 하드웨어 문제를 검사합니다. 하드웨어 문제가 감지되지 않는 경우, 특히 수정 가능한 오류의 경우 OS 공급업체에 연락하여 EDAC 및 CMCI 비활성화에 대한 지원을 요청하십시오.

Additional Information

영향을 받는 버전

해당 없음 - PowerFlex 문제가 아님

수정된 버전

해당 없음 - 하드웨어 문제

Affected Products

PowerFlex rack, VxFlex Ready Nodes, ScaleIO

Article Number: 000197735

Article Type: Solution

Last Modified: 08 Apr 2025

Version: 5

Check if your device is covered by Support Services.

PowerFlex: DIMM 하드웨어 문제로 인해 높은 CPU 사용량 및 SDS 분리가 발생함

Symptoms

Cause

Resolution

Additional Information

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services