PowerFlex:DIMM 硬體問題導致高 CPU 使用率和 SDS 解耦

Summary: 某些記憶體問題 (即 DIMM 問題) 可能會導致 CMCI 風暴,進而導致 SDS 解耦。 由於作業系統未適當回應例行可修正記憶體通知,因此會出現此特定問題。 當伺服器上的 RAM DIMM 模組故障時,也可能會發生這種情況,但其他硬體問題可能會導致相同的情況。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

您可以在 iDRAC 或作業系統記錄中發現 DIMM 故障,例如:
 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {1}[Hardware Error]: event severity: corrected
 kernel: {1}[Hardware Error]:  Error 0, type: corrected
 kernel: {1}[Hardware Error]:  fru_text: A1
 kernel: {1}[Hardware Error]:   section_type: memory error
 kernel: {1}[Hardware Error]:   error_status: 0x0000000000000400
 kernel: {1}[Hardware Error]:   physical_address: 0x0000000ad6a38ac0
 kernel: {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 1 device: 1 row: 58311 column: 712
 kernel: {1}[Hardware Error]:   error_type: 13, scrub corrected error
 kernel: {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
 kernel: {2}[Hardware Error]: event severity: corrected
 kernel: {2}[Hardware Error]:  Error 0, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: {2}[Hardware Error]:  Error 1, type: corrected
 kernel: {2}[Hardware Error]:   section type: unknown, xxxxxxxx-xxxx-xxxx-xxxx-000xxxxxxx1b
 kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
 kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 1: 0x940000000000009f
 kernel: EDAC skx MC0: TSC 0xcdaff277a3653a
 kernel: EDAC skx MC0: ADDR 0xad6a38ac0
 kernel: EDAC skx MC0: MISC 0x0
 kernel: EDAC skx MC0: PROCESSOR 0:0x50654 TIME 1669993821 SOCKET 0 APIC 0x0
 kernel: EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xad6a38 offset:0xac0 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:1 ba:1 row:0xe3c7 col:0x2c8)
 kernel: mce: [Hardware Error]: Machine check events logged
 mcelog: Hardware event. This is not a software error.
 mcelog: MCE 0
 mcelog: CPU 0 BANK 1 TSC cdaff277a3653a
 mcelog: ADDR ad6a38ac0
 mcelog: TIME 1669993821 Fri Dec  2 15:10:21 2022
 mcelog: MCG status:
 mcelog: MCi status:
 mcelog: Corrected error
 mcelog: Error enabled
 mcelog: MCi_ADDR register valid
 
偵測到硬體問題後,隨即回報 CMCI 風暴:
Dec  8 08:28:51 node01 kernel: CMCI storm detected: switching to poll mode
Dec  8 08:33:50 node01 kernel: CMCI storm subsided: switching to interrupt mode
(...)                
Dec 10 03:19:03 node01 kernel: CMCI storm subsided: switching to interrupt mode
Dec 10 03:19:03 node01 kernel: CMCI storm detected: switching to poll mode


高 CPU 使用率可能會導致 SDS 程序停止 IO (這會影響整體 IO 系統延遲),甚至將 SDS 從 MDM 分離。如果在進行中的重新建置期間或其他 SDS 遇到類似問題時發生這種情況,可能會導致 DU 情況。

Cause

Intel CPU 可能會在 DIMM 錯誤期間遭受「中斷風暴」。根據 Red Hat KB:

從 CPUID 報告DisplayFamily_DisplayModel為06H_1AH的 45 奈米英特爾 64 處理器開始,處理器可以報告有關更正的機器檢查錯誤的資訊,併為軟體提供可程式設計中斷以回應 MC 錯誤,稱為糾正的機器檢查錯誤中斷 (CMCI)。當錯誤等級超過可程式設計閾值時,Intel 的硬體可能會中斷。如果錯誤持續存在,CPU 將以足夠高的速率不斷收到中斷或風暴,從而影響 CPU 執行有用工作的能力。發生這種情況時,內核會禁用 CMCI 機制,並恢復為更經典的定期輪詢計算機檢查錯誤的方法。當錯誤率降低核心時,就會重新啟用 CMCI。

如需更多資訊,請前往:https://access.redhat.com/solutions/2710451
 

此問題可能會導致 CMCI 風暴,也可能是由作業系統功能和軟體攔截可更正的錯誤所觸發,而非由 Dell iDRAC 來擷取和處理這些錯誤。這通常發生在 EDAC 和 CMCI 同時啟用時。

 

Resolution

讓受影響的軟體定義儲存 (SDS) 進入維護模式,以及/或將其從叢集中移除,以減輕對整個系統的影響。

請聯絡硬體廠商,檢查是否有任何潛在的硬體問題。如果未偵測到硬體問題,特別是發生可修正錯誤,請聯絡作業系統廠商並要求停用 EDAC 和 CMCI 的協助。 

Additional Information

受影響的版本

不適用 - 不是 PowerFlex 問題

已修正問題的版本

不適用 - 硬體問題

Affected Products

PowerFlex rack, VxFlex Ready Nodes, ScaleIO
Article Properties
Article Number: 000197735
Article Type: Solution
Last Modified: 08 Apr 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.