Avamar: Gen4T 硬件:内存错误
Summary: 本文综述 Avamar 第 4T 代节点报告的内存错误。
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
可疑的内存问题可通过以下方法确认:
系统日志 (/var/log/messages) 报告以下内存错误:
grep -i "mcelog.*error:" /var/log/messages
[log-messages:109] ERROR: <0001> kernel error: Jan 17 13:27:08 test-ava-03 mcelog: Running trigger `socket-memory-error-trigger'
[log-messages:109] ERROR: <0001> kernel error: Jan 17 13:27:40 test-ava-03 mcelog: Running trigger `page-error-trigger'
[log-messages:109] ERROR: <0001> kernel error: Jan 17 13:31:53 test-ava-03 mcelog: SOCKET Fallback Socket memory error count 6474 exceeded threshold: 776460088 in 24h
...
[log-messages:109] ERROR: <0001> kernel error: Jan 18 00:05:03 test-ava-03 mcelog: Corrected memory errors on page 6f58f8000 exceed threshold 10 in 24h: 10 in 24h
[log-messages:109] <0007> kernel info: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363919] mce_notify_irq: 6232 callbacks suppressed
[log-messages:109] ERROR: <0001> kernel error: Jan 18 00:05:04 test-ava-03 kernel: [7199506.363925] [Hardware Error]: Machine check events logged
如果您主要使用 CIFS/NFS 来写入备份,则应使用 ipmitool 命令不会报告 DIMM 列(四个 DIMM 作为两个列)上的任何错误:
ipmitool sdr entity 32
DIMM_Bank0 | 30h | ok | 32.0 | 23 degrees C
DIMM_Bank1 | 31h | ok | 32.1 | 24 degrees C
DIMM_Bank2 | 32h | ns | 32.2 | No Reading
DIMM_Bank3 | 33h | ns | 32.3 | No Reading
如果您主要使用 CIFS/NFS 来写入备份,则应使用 dmesg 输出显示内存损坏迹象:
dmesg |grep -i "memory corruption"
[7689715.473298] mce_notify_irq: 7109 callbacks suppressed
[7689715.473303] [Hardware Error]: Machine check events logged
[7689715.481284] [Hardware Error]: Machine check events logged
[7689723.508392] soft_offline: 0x812b4f: unknown non LRU page type 20000000000100
[7689723.514500] get_any_page: 0x4360d9: unknown zero refcount page type 20000000000000
[7689728.554720] MCE: Killing sudo:18667 due to hardware memory corruption fault at 7f732745a750
[7689728.559849] MCE: Killing sudo:18676 due to hardware memory corruption fault at 7feabc119750
[7689728.564050] MCE: Killing sudo:18678 due to hardware memory corruption fault at 7fe3f0b37750
所有内存模块均联机:
cat /sys/devices/system/memory/*/state |grep -v online
提醒:此命令不应返回任何输出。
查询 mcelog 在后续几天显示纠正的内存错误数量不断增加:<
第一天:
mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
9 total
9 in 24h
uncorrected memory errors:
0 total
0 in 24h
第2天:
mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
30 total
21 in 24h
uncorrected memory errors:
0 total
0 in 24h
使用 查询控制器 arcconf 命令可能会报告总线错误:
arcconf getconfig 1
Bus error Cause
这表示 DIMM 的预测性故障。
Resolution
通过创建服务请求,联系 DELL Technologies Avamar 支持以进一步调查此问题。在 SR 中引用此知识库文章
Affected Products
AvamarProducts
Avamar, Avamar Data Store Gen4T, Avamar ServerArticle Properties
Article Number: 000063609
Article Type: Solution
Last Modified: 01 May 2025
Version: 5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.