Avamar: Gen4T 硬件:内存错误

Summary: 本文综述 Avamar 第 4T 代节点报告的内存错误。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

可疑的内存问题可通过以下方法确认:

 

系统日志 (/var/log/messages) 报告以下内存错误: 

grep -i "mcelog.*error:" /var/log/messages
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:27:08 test-ava-03 mcelog: Running trigger `socket-memory-error-trigger'
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:27:40 test-ava-03 mcelog: Running trigger `page-error-trigger'
[log-messages:109]  ERROR: <0001> kernel error: Jan 17 13:31:53 test-ava-03 mcelog: SOCKET Fallback Socket memory error count 6474 exceeded threshold: 776460088 in 24h
...
[log-messages:109]  ERROR: <0001> kernel error: Jan 18 00:05:03 test-ava-03 mcelog: Corrected memory errors on page 6f58f8000 exceed threshold 10 in 24h: 10 in 24h
[log-messages:109]  <0007> kernel info: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363919] mce_notify_irq: 6232 callbacks suppressed
[log-messages:109]  ERROR: <0001> kernel error: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363925] [Hardware Error]: Machine check events logged
 

如果您主要使用 CIFS/NFS 来写入备份,则应使用 ipmitool 命令不会报告 DIMM 列(四个 DIMM 作为两个列)上的任何错误:

ipmitool sdr entity 32 
DIMM_Bank0       | 30h | ok  | 32.0 | 23 degrees C
DIMM_Bank1       | 31h | ok  | 32.1 | 24 degrees C
DIMM_Bank2       | 32h | ns  | 32.2 | No Reading
DIMM_Bank3       | 33h | ns  | 32.3 | No Reading
 

如果您主要使用 CIFS/NFS 来写入备份,则应使用 dmesg 输出显示内存损坏迹象:

dmesg |grep -i "memory corruption" 
[7689715.473298] mce_notify_irq: 7109 callbacks suppressed
[7689715.473303] [Hardware Error]: Machine check events logged
[7689715.481284] [Hardware Error]: Machine check events logged
[7689723.508392] soft_offline: 0x812b4f: unknown non LRU page type 20000000000100
[7689723.514500] get_any_page: 0x4360d9: unknown zero refcount page type 20000000000000
[7689728.554720] MCE: Killing sudo:18667 due to hardware memory corruption fault at 7f732745a750
[7689728.559849] MCE: Killing sudo:18676 due to hardware memory corruption fault at 7feabc119750
[7689728.564050] MCE: Killing sudo:18678 due to hardware memory corruption fault at 7fe3f0b37750 
 

所有内存模块均联机:

cat /sys/devices/system/memory/*/state |grep -v online
提醒:此命令不应返回任何输出。
 
 

查询 mcelog 在后续几天显示纠正的内存错误数量不断增加:<

第一天:

mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
        9 total
        9 in 24h
uncorrected memory errors:
        0 total
        0 in 24h 

第2天:

mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
        30 total
        21 in 24h
uncorrected memory errors:
        0 total
        0 in 24h 
 

使用 查询控制器 arcconf 命令可能会报告总线错误:

arcconf getconfig 1 
Bus error 

Cause

这表示 DIMM 的预测性故障。
 

Resolution

通过创建服务请求,联系 DELL Technologies Avamar 支持以进一步调查此问题。在 SR 中引用此知识库文章
 

Affected Products

Avamar

Products

Avamar, Avamar Data Store Gen4T, Avamar Server
Article Properties
Article Number: 000063609
Article Type: Solution
Last Modified: 01 May 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.