IDPA:DP4400 磁盘错误导致 Data Domain 文件系统不稳定
Resumen: DP4400 中记录过多错误的磁盘驱动器可能会导致 Data Domain 文件系统 (FS) 重新启动和不稳定。
Este artículo se aplica a
Este artículo no se aplica a
Este artículo no está vinculado a ningún producto específico.
No se identifican todas las versiones del producto en este artículo.
Síntomas
可能会出现以下症状:
- Data Domain 文件系统可能会报告为不可用或反复重新启动
- Data Domain 中的日志和警报可能会报告“vol1 不可用”
- Avamar 维护服务因MSG_ERR_DDR_ERROR而失败
- Avamar 维护或 Data Domain 清理反复失败导致使用容量意外高
- iDRAC 可能会显示所有磁盘都运行良好,但控制器日志可能会显示其他情况
示例:
Data Domain 可能会记录警报,例如:
ALERT Filesystem EVT-FILESYS-00002: Problem is preventing filesystem from running. EVT-STORAGE-00020: The Active tier is unavailable. EVT-FILESYS-00011: DDFS process died; restarting
日志文件中的内容 /ddr/var/log/debug/ddfs.info,您可能会看到如下错误:
Jun 30 11:48:28 idpa-dd ddfs[8504]: ERROR: MSG-SL-00004: Volume vol1 is unavailable. err:Missing storage device. Jun 30 11:58:20 idpa-dd ddfs[15962]: ERROR: MSG-SL-00004: Volume vol1 is unavailable. err:Missing storage device.
日志文件 /ddr/var/log/debug/kern.info 可能会报告磁盘组错误,例如:
Jun 30 18:51:08 idpa-dd kernel: [10002271.298276] (E4)DD_RAID: Array [dg2/ppart14] encountered READ I/O errors [57.57 dm-10p5 6000c290ea0836a3178bab0785368300] [dev idx: 0] [stripe: 516562] [gs:ffff880ce56ed210, request:ffff880ce9ebeb40] faults:1 Jun 30 18:51:08 idpa-dd kernel: [10002271.298302] (E4)ERROR: dd_dgrp.c:5731 dd_dgrp_array_internal_notification:: Too many disks failed [1, 14, 0] Jun 30 18:51:08 idpa-dd kernel: [10002271.298305] (E4)DD_RAID: DiskGroup [dg2] has total failure!
或进一步的错误,例如:
idpa-dd kernel: [56127713.299919] (E4)sd 2:0:1:0: [sds] tag#0 Sense Key : Medium Error [current] idpa-dd kernel: [56127713.299921] (E4)sd 2:0:1:0: [sds] tag#0 Add. Sense: No additional sense information idpa-dd kernel: [56127713.299924] (E4)sd 2:0:1:0: [sds] tag#0 CDB: Read(16) 88 00 00 00 00 01 ed 7c 57 42 00 00 02 01 00 00 idpa-dd kernel: [56127713.299926] (E4)dd_blk_update_request: I/O error, dev sds, sector 8279316290 idpa-dd kernel: [56127713.299949] (E4)DEBUG: dd_array_error.c:512 dd_array_handle_fault:: nr_faults:1 array->level_info.nr_disks:1 idpa-dd kernel: [56127713.299956] (E4)DD_RAID: Array [dg2/ppart8] encountered READ I/O errors [57.57 dm-18p5 6000c2963d6777f9dc56d52993b4f044] [dev idx: 0] [stripe: 806949] [gs:ffff880c10e92220, request:ffff880ce4ec4ca8] faults:1 idpa-dd kernel: [56128442.963940] (E4)DD_RAID: DiskGroup [dg2] has total failure! idpa-dd kernel: [56128442.963964] (E4)DD_RAID: Array [dg2/ext3]: Suspended idpa-dd kernel: [56128442.963988] (E4)DD_RAID: Array [dg2/ext3_1]: Suspended
Causa
在 IDPA DP4400 中,Data Domain 虚拟机使用由设备内的卷和磁盘驱动器组成的数据存储区。如果来自 VD02 或 VD03 的任何磁盘驱动器以高速率记录错误,则数据存储区性能可能会降低到足以使 DDOS 将卷标记为不可用并尝试重新启动文件系统。
DP4400 的物理磁盘到卷的映射如下所示:
| Virtual Disk(虚拟磁盘) | RAID 级别 | 物理磁盘 | 数据存储区名称 | 描述 |
| VD01型 | RAID 1 | 磁盘 00:01:00 和 00:01:01(磁盘 0 和 1) | DP-appliance-datastore | 虚拟机的数据存储区位置 |
| VD02型 | RAID 6 | 磁盘 00:01:02 到 01:09(磁盘 2 - 9) | DP-appliance-ddve1 | DDVE 文件系统的 DDVE1 数据存储区的位置(在 DP4400S 和 DP4400 型号中找到) |
| VD03型 | RAID 6 | 磁盘 00:01:10 到 01:17(磁盘 10 - 17) | DP-appliance-ddve2 | DDVE 文件系统的 DDVE2 数据存储区的位置(仅在 DP4400 型号中找到) |
Resolución
- 使用以下选项之一从 RAID 控制器 (PERC) 收集日志:
-
- 访问 DP4400 iDRAC 并查看存储子系统的运行状况
- 查看卷和每个物理磁盘的组件状态
- 查看事件日志和 Lifecycle Controller 日志,了解重复的磁盘消息迹象。
- 执行 TSR 收集,并确保选择存储日志。 Data Domain:如何在 PowerProtect DD3300、DD6900、DD9400、DD9900 和 DP4400 上收集 TSR 日志
- 使用 SSH 访问 ACM 并运行以下命令:
- 访问 DP4400 iDRAC 并查看存储子系统的运行状况
显示每个磁盘的状态:
-
-
-
Idpa-acm# showfru disk
从 ACM 收集 PERC 日志,如下所示: -
Idpa-acm# dpacli -host 192.168.100.101 -logs Perc -output perc_logs.tgz
-
- 使用 CLI 访问 ESXi 主机并运行以下命令:
-
Idpa-esx# perccli /c0 show termlog > /tmp/ttylog.txt
-
Idpa-esx# perccli /c0 show events > /tmp/events.txt
-
-
- 从这些日志中,您可以检视以下示例中所示的事件:
06/17/23 5:02:22: C0:EVT#97309-06/17/23 5:02:22: 113=Unexpected sense: PD 03(e0x20/s3) Path 50000399c882671a, CDB: 88 00 00 00 00 00 7e b4 72 29 00 00 01 d7 00 00, Sense: 3/11/01 06/17/23 5:02:22: C0:Raw Sense for PD 3: 72 03 11 01 00 00 00 34 00 0a 80 00 00 00 00 00 7e b4 72 29 02 06 00 00 80 00 3f 00 80 1e 00 88 81 07 02 0f 01 13 00 00 7f cd 01 38 00 02 00 22 1a 40 00 14 c0 c0 0f 00 7f d2 ff ff 06/17/23 5:02:22: C0:DM_PerformSenseDataRecovery:Medium Error DevId[3] devHandle d RDM=40d47600 retries=0 callback=c0358e30 06/17/23 5:02:22: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=427, ld=1, src=7, cmd=2, lba=2f83aac00, cnt=400, rmwOp=0 06/21/23 5:30:01: C0:EVT#97500-06/21/23 5:30:01: 110=Corrected medium error during recovery on PD 03(e0x20/s3) at d05a2e0a 06/21/23 5:30:01: C0:Issuing write verify pd=03 physArm=1 span=0 startBlk=d05a2e13 numBlks=1 06/21/23 5:30:01: C0:EVT#97501-06/21/23 5:30:01: 110=Corrected medium error during recovery on PD 03(e0x20/s3) at d05a2e13 06/21/23 5:30:01: C0:Issuing write verify pd=03 physArm=1 span=0 startBlk=d05a2e14 numBlks=1 seqNum: 0x00002999 Time: Mon Mar 20 17:53:50 2023 Code: 0x0000005d Class: 0 Locale: 0x02 Event Description: Patrol Read corrected medium error on PD 0a(e0x20/s10) at 8912fa1c Event Data: =========== Device ID: 10 Enclosure Index: 32 Slot Number: 10 LBA: 2299722268 seqNum: 0x0000299a Time: Mon Mar 20 17:53:50 2023 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 0a(e0x20/s10) Path 50000399e8429da2, CDB: 8f 00 00 00 00 00 89 12 fa 1d 00 00 10 00 00 00, Sense: 3/11/01 Event Data: =========== Device ID: 10 Enclosure Index: 32 Slot Number: 10 CDB Length: 16 CDB Data: 008f 0000 0000 0000 0000 0000 0089 0012 00fa 001d 0000 0000 0010 0000 0000 0000 Sense Length: 60 Sense Data: 0072 0003 0011 0001 0000 0000 0000 0034 0000 000a 0080 0000 0000 0000 0000 0000 0089 0012 00fa 001d 0002 0006 0000 0000 0080 0000 0000 0000 0080 001e 0000 008f 0081 0007 0002 000a 0000 00d6 0000 0000 008d 003e 0000 00ef 0000 0002 0000 0022 001f 0040 0000 0000 00fd 00fd 000a 0000 008d 003e 00ff 00ff 0000 0000 0000 0000
检查模式和重复错误。您可能会看到从单个驱动器记录许多事件,这指示是哪个设备导致了问题:
$ grep -i "medium error" ttylog.txt 05/08/23 17:30:18: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:18: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 05/08/23 17:30:21: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:21: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 05/08/23 17:30:24: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:24: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 05/08/23 17:30:26: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:26: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 05/08/23 17:30:28: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:28: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 05/08/23 17:30:31: C0:DM_PerformSenseDataRecovery:Medium Error DevId[b] devHandle 15 RDM=40da6800 retries=0 callback=c0358e2c 05/08/23 17:30:31: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=ae, ld=2, src=1, cmd=1, lba=26ca06f8b, cnt=200, rmwOp=0 . . $ grep -i "medium error" ttylog.txt | wc -l 2168 $ grep -i "command timeout" ttylog.txt 05/16/23 5:36:54: C0:EVT#06386-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 7b 82 d6 49 00 00 00 68 00 00 05/16/23 5:36:54: C0:EVT#06387-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 02 e9 7e 90 f2 00 00 00 3f 00 00 05/16/23 5:36:54: C0:EVT#06388-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 02 e9 7e 8e 7e 00 00 00 6d 00 00 05/16/23 5:36:54: C0:EVT#06389-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 7b 82 d9 5e 00 00 00 61 00 00 05/16/23 5:36:54: C0:EVT#06390-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 7b 82 d9 33 00 00 00 2b 00 00 05/16/23 5:36:54: C0:EVT#06391-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 7b 82 e6 c3 00 00 00 70 00 00 05/16/23 5:36:54: C0:EVT#06392-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 7b 82 e5 55 00 00 00 60 00 00 05/16/23 5:36:54: C0:EVT#06393-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 02 e9 7e 8e f0 00 00 00 7f 00 00 05/16/23 5:36:54: C0:EVT#06394-05/16/23 5:36:54: 267=Command timeout on PD 0b(e0x20/s11) Path 5000039aa853e82e, CDB: 88 00 00 00 00 03 81 91 08 00 00 00 00 4e 00 00 . . $ grep -i "command timeout" ttylog.txt |wc -l 58
在上面的示例中,您可以看到插槽 11 (devID b) 中的磁盘以高速率记录中等和超时错误。
提醒: 在 PERC 日志中,DevID 以十六进制格式显示。DevID“0b”是十进制的“11”,因此这指的是插槽 11。
以下示例显示了磁盘驱动器的问题,例如控制器记录的磁盘重置。
此示例显示了由驱动器不断重置并导致受影响的虚拟磁盘出现问题所导致的问题:
2022-01-21 01:58:39 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset. 2022-01-21 01:58:39 LOG007 The previous log entry was repeated 27 times. 2022-01-21 01:56:05 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset. 2022-01-21 01:56:05 LOG007 The previous log entry was repeated 988 times. . . 2022-01-21 04:00:36 545196 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset. 2022-01-21 03:58:39 545193 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset. 2022-01-21 03:56:05 545190 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset. . . 2022-01-25 19:21:49 545547 PDR3 Disk 12 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly. 2022-01-25 19:21:49 545548 VDR56 Redundancy of Virtual Disk 1 on Integrated RAID Controller 1 has been degraded. 2022-01-25 19:21:49 545549 PDR87 Disk 12 in Backplane 1 of Integrated RAID Controller 1 was reset.
标记为预测性故障的驱动器也可能导致问题:
2022-09-05 23:01:56 11008 PDR87 Disk 1 in Backplane 1 of RAID Controller in Slot 8 was reset. 2022-09-05 22:55:28 11003 PDR87 Disk 1 in Backplane 1 of RAID Controller in Slot 8 was reset 2022-09-05 23:02:23 11010 PDR87 Disk 1 in Backplane 1 of RAID Controller in Slot 8 was reset. 2022-09-05 23:01:56 11009 PDR16 Predictive failure reported for Disk 1 in Backplane 1 of RAID Controller in Slot 8. 2022-09-05 23:03:28 11012 PDR54 A disk media error on Disk 1 in Backplane 1 of RAID Controller in Slot 8 was corrected during recovery. 2022-09-05 23:02:28 11011 PDR16 Predictive failure reported for Disk 1 in Backplane 1 of RAID Controller in Slot 8. 2022-09-06 10:22:26 11034 PDR54 A disk media error on Disk 1 in Backplane 1 of RAID Controller in Slot 8 was corrected during recovery. 2022-09-06 00:11:27 11029 PDR54 A disk media error on Disk 1 in Backplane 1 of RAID Controller in Slot 8 was corrected during recovery. 2022-09-05 23:18:32 11015 PDR54 A disk media error on Disk 1 in Backplane 1 of RAID Controller in Slot 8 was corrected during recovery. 2022-09-05 23:06:26 11014 PDR16 Predictive failure reported for Disk 1 in Backplane 1 of RAID Controller in Slot 8.
- 使用以下方法之一查看和标识设备磁盘详细信息:
- 使用 iDRAC 或 TSR 数据查看驱动器详细信息
- 在 ACM 操作系统中,使用以下命令显示磁盘详细信息:showfru disk
- 联系戴尔支持 以创建服务请求,并参考本文以确认磁盘更换。
提醒:为了降低出现进一步问题的风险,建议在更换磁盘之前禁用 Data Domain 文件系统。
这可通过 Data Domain CLI 运行以下命令来完成:
filesys disable
注意:如果多个磁盘驱动器显示为故障或错误过多,请勿主动更换任何磁盘,直到联系戴尔支持。过多的磁盘故障可能会导致数据丢失。
Información adicional
注意:有关获取或解释日志的任何疑问或问题,应咨询融合备份一体机 SYS 团队或 PowerEdge 服务器团队。
提醒:如果确认问题是故障磁盘驱动器,并且 Data Domain 文件系统因此无法启动,则解决方法可能是以物理方式将故障磁盘从其插槽中取出或脱离。 另一个选项是尝试使用 perccli 实用程序使磁盘脱机。这会导致控制器将其标记为缺失,因此过多的错误日志记录应停止并允许 Data Domain 文件系统稳定下来。
使磁盘脱机的步骤:
使磁盘脱机的步骤:
- 以 root 用户身份登录 ESXi 主机
- 运行命令: perccli /c0 show
- 在此输出中,找到受影响的驱动器,并记下机柜和插槽 ID
-
运行此命令以使用上述输出中的值将驱动器设置为离线: perccli /c0[/ex]/sx set offline
-
例如,对于 e32 的插槽 2 中的脱机磁盘: perccli /c0/e32/s2 set offline
- 更换磁盘后,驱动器将自动再次被标记为联机。
磁盘仍应尽快更换,但这可以提供稳定性和恢复服务,同时为部件的交付和更换留出时间。
Productos afectados
PowerProtect Data Protection Appliance, PowerProtect DP4400, Integrated Data Protection Appliance Family, PowerProtect Data Protection Hardware, Integrated Data Protection Appliance SoftwarePropiedades del artículo
Número del artículo: 000216674
Tipo de artículo: Solution
Última modificación: 07 may 2026
Versión: 3
Encuentre respuestas a sus preguntas de otros usuarios de Dell
Servicios de soporte
Compruebe si el dispositivo está cubierto por los servicios de soporte.