Avamar: Gen4T 硬體:記憶體錯誤

Summary: 本文評論 Avamar Gen4T 節點報告的記憶體錯誤。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

疑似記憶體問題可透過下列方式確認:

 

系統記錄 (/var/log/messages) 會回報下列記憶體錯誤: 

grep -i "mcelog.*error:" /var/log/messages
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:27:08 test-ava-03 mcelog: Running trigger `socket-memory-error-trigger'
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:27:40 test-ava-03 mcelog: Running trigger `page-error-trigger'
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:31:53 test-ava-03 mcelog: SOCKET Fallback Socket memory error count 6474 exceeded threshold: 776460088 in 24h
...
[log-messages:109]  ERROR: <0001> kernel error: Jan 18 00:05:03 test-ava-03 mcelog: Corrected memory errors on page 6f58f8000 exceed threshold 10 in 24h: 10 in 24h
[log-messages:109]  <0007> kernel info: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363919] mce_notify_irq: 6232 callbacks suppressed
[log-messages:109]  ERROR: <0001> kernel error: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363925] [Hardware Error]: Machine check events logged
 

可使用 ipmitool 命令未回報 DIMM 插槽 (四個 DIMM 作為兩個插槽) 上的任何錯誤:

ipmitool sdr entity 32 
DIMM_Bank0       | 30h | ok  | 32.0 | 23 degrees C
DIMM_Bank1       | 31h | ok  | 32.1 | 24 degrees C
DIMM_Bank2       | 32h | ns  | 32.2 | No Reading
DIMM_Bank3       | 33h | ns  | 32.3 | No Reading
 

可使用 dmesg 輸出顯示記憶體損毀的跡象:

dmesg |grep -i "memory corruption" 
[7689715.473298] mce_notify_irq: 7109 callbacks suppressed
[7689715.473303] [Hardware Error]: Machine check events logged
[7689715.481284] [Hardware Error]: Machine check events logged
[7689723.508392] soft_offline: 0x812b4f: unknown non LRU page type 20000000000100
[7689723.514500] get_any_page: 0x4360d9: unknown zero refcount page type 20000000000000
[7689728.554720] MCE: Killing sudo:18667 due to hardware memory corruption fault at 7f732745a750
[7689728.559849] MCE: Killing sudo:18676 due to hardware memory corruption fault at 7feabc119750
[7689728.564050] MCE: Killing sudo:18678 due to hardware memory corruption fault at 7fe3f0b37750 
 

所有記憶體模組皆在線上:

cat /sys/devices/system/memory/*/state |grep -v online
注意:此命令不應傳回任何輸出。
 
 

查詢中 mcelog 後續幾天顯示修正的記憶體錯誤數量不斷增加:<

第一天:

mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
        9 total
        9 in 24h
uncorrected memory errors:
        0 total
        0 in 24h 

第二天:

mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
        30 total
        21 in 24h
uncorrected memory errors:
        0 total
        0 in 24h 
 

使用 arcconf 命令可能會回報匯流排錯誤:

arcconf getconfig 1 
Bus error 

Cause

這表示 DIMM 發生預測性故障。
 

Resolution

請連絡 DELL Technologies Avamar 支援 ,以建立服務要求以進一步調查此問題。請在 SR 中參照此 KB
 

Affected Products

Avamar

Products

Avamar, Avamar Data Store Gen4T, Avamar Server
Article Properties
Article Number: 000063609
Article Type: Solution
Last Modified: 01 May 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.