Data Domain: Multiple Drives Fail During Failover of HA Systems | CA01 CA02 FW issue

Summary: Data Domain High Availability (HA) can experience drive failures during HA failover due to a known drive firmware issue. Some systems may experience the inability of the file system to startup after a failover or reboot. ...

Affected Products

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

The component failure is limited to 8 TB drives with CA01 and CA02 firmware revisions. Usually, multiple RAID groups (disk groups) are impacted. It can be single degraded (one failure) or double degraded (two disk failures).

The systems that do not reboot encounter file system interruption but should recover on its own; with multiple disk rebuilds ongoing and pending. For the systems that reboot, a failover is forced which may cause the file system to be stuck during startup.

Applies to:

DD9400 and DD9900 Models Only
8 TB hard drive in External storage enclosures with Firmware (FW) version CA01 or CA02

Disk Failure Symptoms:

Disk logs report sense key 4/0x29/0xcd when doing a SCSI WRITE cdb 0x8a, one write command error causes disk failure by the DD_RAID module.

Sep 16 06:17:59 DD9900 kernel: [11647695.019070] (E4)scsi16: (ffff88fe1522d800) (0/5/10000) chnl/tgt/lun 0/232/0 result 0x2, cdb 0x8a:00000001498b4728:00000308, sense 4/0x29/0xcd - Unit Attention Retry-able
Sep 16 06:20:58 DD9900 kernel: [11647874.161940] (E4)scsi16: (ffff88b96b72cc00) (1/5/10000) chnl/tgt/lun 0/246/0 result 0x2, cdb 0x8a:0000000149adb300:00000308, sense 4/0x29/0xcd - Unit Attention Retry-able
Sep 16 06:20:58 DD9900 kernel: [11647874.161997] (E4)scsi16: (ffff88b946a08e00) (1/5/10000) chnl/tgt/lun 0/237/0 result 0x2, cdb 0x8a:000000014a777478:00000308, sense 4/0x29/0xcd - Unit Attention Retry-able

DD_RAID fails disks due to "WRITE I/O" errors.

Sep 16 06:17:59 DD9900 kernel: [11647695.020655] (E4)DD_RAID: Failing working disk [6.35 dm-27p3 WSD48SRA 254:3635] from DiskGroup dg19
Sep 16 06:20:59 DD9900 kernel: [11647875.122961] (E4)DD_RAID: Failing working disk [2.32 dm-25p3 WSD49GCR 254:403] from DiskGroup dg4
Sep 16 06:21:54 DD9900 kernel: [11647930.659786] (E4)DD_RAID: Failing working disk [2.39 dm-46p3 WSD48TEG 254:739] from DiskGroup dg2
Sep 16 06:21:58 DD9900 kernel: [11647934.612549] (E4)DD_RAID: Failing working disk [{*}6.43{*} dm-233p3 WSD49GG6 254:3731] from DiskGroup dg18
Sep 16 06:22:04 DD9900 kernel: [11647940.363248] (E4)DD_RAID: Failing working disk [{*}6.21{*} dm-219p3 WSD47KYS 254:3507] from DiskGroup dg18
Sep 16 06:22:04 DD9900 kernel: [11647940.477630] (E4)DD_RAID: Failing working disk [{*}6.5{*} dm-242p3 WSD4B13V 254:3875] from DiskGroup dg17
Sep 16 06:22:04 DD9900 kernel: [11647940.651261] (E4)DD_RAID: Failing working disk [{*}6.18{*} dm-259p3 WSD47EWA 254:4147] from DiskGroup dg17
Sep 16 06:22:04 DD9900 kernel: [11647940.726575] (E4)DD_RAID: Failing working disk [{*}6.15{*} dm-265p3 WSD49BGL 254:4243] from DiskGroup dg16
Sep 16 06:22:05 DD9900 kernel: [11647941.100980] (E4)DD_RAID: Failing working disk [{*}6.26{*} dm-257p3 WSD49ART 254:4115] from DiskGroup dg16

DDFS Unable to Startup Symptoms:

In ddfs.info, the below messages are for a long period of time during DDFS startup. It does not progress until the DDFS process is terminated forcing a failover to the peer node.

09/15 21:49:21.962018 [a0cc980] SYSTEM_STARTUP: ===== completed <SegStore> - time (1663292961) =====
09/15 21:49:21.962028 [a0cc980] SYSTEM_STARTUP: ===== starting <CC-Agent> - time (1663292961) =====

09/15 21:57:11.699754 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub
09/15 21:59:11.819754 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub
09/15 22:01:11.939754 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub
...
09/16 02:01:26.339755 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub
09/16 02:03:26.459755 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub
09/16 02:05:26.579755 [7fc335f111f0] cp_scrub_partitions: cc is not initialized yet,Skipping scheduling of CP to scrub

Cause

The drive’s DRAM cache buffer encounters a false data integrity error under random I/O workloads. This triggers disk failures.

The drive manufacturer has provided a firmware fix to resolve this issue.

Resolution

Fix:

Fixed DDOS versions: DDOS 7.11.x, 7.10.1.0, and 7.7.5.1, and later have inbuilt CA04 firmware.
- Upgrade to a newer DDOS version
A Minimal Disruptive Upgrade (MDU) is available for all other DDOS 7.x versions.
- Link to MDU: Read me + Download: DDOS 7.X hard drive Firmware Minimally Disruptive Upgrade (MDU) Package - November 2022 (Log in as registered Dell Support user is required to view document)
- Applying the MDU package:
  1. Connect to the Active Node of the HA system. The update does not work from the passive or standby node.
  2. Schedule downtime with the user as DDFS has to be disabled during the FW update. Run the following commands to check current alerts and address them as needed. Stop the cleaning process if it is running. Disable the file system.
    # alerts show current
    # filesys clean status
    # filesys clean stop
    # filesys disable
  3. Check the autosupport for CA01/CA02 disks that are part of >=dg2 for systems that did not undergo storage migration. For systems that did undergo storage migration, the disk group with ext3 arrays may not be dg2.
    Search for ext3. It is showing DD_RAID Histogram for dgXX where XX is the DG number. See example in the "Additional Info" section below. If the dg2/dgXX disks have CA01/C02 firmware, the array must be suspended temporarily during the MDU upgrade process. Failure to do so can trigger another failure if all I/O activity in the disk group is not suspended. Open a Support case for assistance with suspending the disk group. If dg2/dgXX does not contain CA01/CA02 disks, a support case is not necessary. Go to step 4.
  4. Upload the drive-firmware-CA04.rpm from the DD System Manager UI.
  5. Upgrade the disks. Run the following command and wait for it to finish.
    # system upgrade start drive-firmware-CA04.rpm"
  6. Wait ~10 minutes.
  7. Verify that all disks are upgraded. If disks still show up with CA01/CA02 firmware, repeat step 5 and 6.
    # disk show hardware
  8. Check the current disk state. If there are multiple disk failures, contact Support for assistance. For single disk failures, check disk for any error history, and if no errors, unfail the disk.
    # disk unfail <disk.id>
  9. Verify alerts and enable DDFS.
    # alerts show current
    # filesys enable

Note: The Filesystem should be disabled when applying the MDU.

Workaround

If a reboot or failover occurred:
- DD_RAID cannot failback failed drives.
- Allow traditional parity-based disk rebuilds to complete.
- Disable GC until all rebuilds finish.
- If the file system fails to start, consult a DDFS TSE.
If reboot or failover did NOT occur:
- DD_RAID TSE can failback disks manually using dd_raidtool.
- Failed disk slots must be power-cycled (contact Dell Support).
- Allow rebuilds over 50% to complete before switching to failback reconstruction.
- After rebuilds, disks can be "unfailed" if slots were power-cycled.

Additional Information

FAQs

Q: Can the firmware upgrade be performed while the Data Domain file system is online?
A: No. The DDFS must be disabled during the firmware upgrade.

Q: Is a system reboot required after applying the firmware update?
A: No. A reboot is not required.

Q: Can the firmware be applied on the passive node?
A: No. The update must be performed on the active node only.

Q: How long does the firmware upgrade take for 180–240 drives (4 DS60 shelves)?
A: The upgrade runs in parallel mode and typically completes in 10–15 minutes, provided there are no issues.

Q: If some drives are not updated, will the system automatically retry the update?
A: No. The update must be rerun manually or repeated for the remaining drives. See "Additional Info" for manual update steps.

Q: Should the firmware patch be applied to non-HA Data Domain systems as well?
A: Yes. It is recommended to apply the drive-firmware-CA04.RPM patch to all systems with 8 TB drives running older firmware.

Q: If a drive fails during the update, can it be recovered?
A:

If the disk shows error history (check with disk show reliability-data), it must remain failed and be replaced.
If no errors are present, run:
```
disk unfail <disk.id>
```
This marks the disk as a spare.
If a replacement drive has older CA01/CA02 firmware, it must be manually updated. See "How to manually update replacement drives" in Additional Info.

Q: Why is it necessary to suspend activity in disk group 2 (dg2) even when DDFS is disabled?
A: DDOS continues to access EXT3 mounts in dg2. Suspending I/O activity prevents additional failures during the upgrade.

Q: How to check for dg2/dgXX disks which have ext3 array and contain CA01/CA02 firmware.
A: For non-storage migrated systems, dg2 has the ext3 arrays.

Search the autosupport for the following lines. (In Bold)

DD_RAID Histogram

DD_RAID Histogram for dg2 0xafc318cb033dc226
DG:dg2 UUID:0xafc318cb033dc226 MajorNr:61 MajorNrEx:62 GrpNr:10 State:Complete Status:Reassembled Index:0
    Total Disks:14 Working Disks:14 Max Disk Failures:2 Sectors:148681617408 Options:0x10000100 Type:StandAlone Primary
    Pool UUID:0x0:0 Tier Type:0x2 Host SN:CNWS300198000G MG UUID:0x0
Array [ext3] (active): [raid-type 0] [(0x0, 0x0) options] [NVR:N/N] [256KB stripe] [117458432 sectors] ID[0xb6fbb5a5a61ecf9]
Array [ppart] (active): [raid-type 6] [(0x10, 0x3) options] [NVR:N/N] [4608KB stripe] [148681617408 sectors] ID[0xfb32c1339fafc87b]

Storage Show All (This command can also be ran on the DD CLI)

dg2       2.1-2.3, 2.13-2.15, 2.25-2.27,     14      7.2 TiB               
          2.37-2.39, 2.49-2.50

Disk Show Hardware (This command can be ran on the DD CLI)

2.1          A0     SEAGATE STCRSEI1CLAR8000   CA01       WSD4R8GS       7.2 TiB    SASe      DG118000919             
2.2          A1     SEAGATE STCRSEI1CLAR8000   CA01       WSD4REMW       7.2 TiB    SASe      DG118000919             
2.3          A2     SEAGATE STCRSEI1CLAR8000   CA01       WSD4LM5C       7.2 TiB    SASe      DG118000919      
2.13         B0     SEAGATE STCRSEI1CLAR8000   CA01       WSD4SMHX       7.2 TiB    SASe      DG118000919             
2.14         B1     SEAGATE STCRSEI1CLAR8000   CA01       WSD4RF04       7.2 TiB    SASe      DG118000919             
2.15         B2     SEAGATE STCRSEI1CLAR8000   CA01       WSD4QHQE       7.2 TiB    SASe      DG118000919    
2.25         C0     SEAGATE STCRSEI1CLAR8000   CA01       WSD4RE9Y       7.2 TiB    SASe      DG118000919             
2.26         C1     SEAGATE STCRSEI1CLAR8000   CA01       WSD4LMME       7.2 TiB    SASe      DG118000919             
2.27         C2     SEAGATE STCRSEI1CLAR8000   CA01       WSD4REW8       7.2 TiB    SASe      DG118000919
2.37         D0     SEAGATE STCRSEI1CLAR8000   CA01       WSD4SMHM       7.2 TiB    SASe      DG118000919             
2.38         D1     SEAGATE STCRSEI1CLAR8000   CA01       WSD4QHWR       7.2 TiB    SASe      DG118000919             
2.39         D2     SEAGATE STCRSEI1CLAR8000   CA01       WSD4R862       7.2 TiB    SASe      DG118000919     
2.49         E0     SEAGATE STCRSEI1CLAR8000   CA01       WSD4SSKK       7.2 TiB    SASe      DG118000919             
2.50         E1     SEAGATE STCRSEI1CLAR8000   CA01       WSD4SV53       7.2 TiB    SASe      DG118000919             
2.51         E2     SEAGATE STCRSEI1CLAR8000   CA01       WSD4R944       7.2 TiB    SASe      DG118000919

In the example above, the drives have CA01 firmware. A Support case must be opened so Dell Technologies can assist with suspending the disk group (dg2) containing the ext3 array before the MDU upgrade is applied.

For storage migrated systems, the array containing ext3 may not be dg2. Search the autosupport for the following lines. (In bold)

Licenses (STORAGE-MIGRATION-FOR-DATADOMAIN-SYSTEMS)

Licenses
--------
System locking-id: APX00123456789
Licensing scheme: EMC Electronic License Management System (ELMS) node-locked mode
Capacity licenses:
##   Feature           Shelf Model    Capacity      Type        State    Expiration Date   Note
--   ---------------   ------------   -----------   ---------   ------   ---------------   ----
1    CAPACITY-ACTIVE   HIGH_DENSITY   1396.98 TiB   permanent   active   n/a                   
--   ---------------   ------------   -----------   ---------   ------   ---------------   ----
Licensed Active Tier capacity: 1396.98 TiB*
* Depending on the hardware platform, usable filesystem capacities may vary.
Feature licenses:
##   Feature                                    Count   Type         State    Expiration Date   Note           
--   ----------------------------------------   -----   ----------   ------   ---------------   ---------------
1    REPLICATION                                    1   permanent    active   n/a                              
2    VTL                                            1   permanent    active   n/a                              
3    DDBOOST                                        1   permanent    active   n/a                              
4    RETENTION-LOCK-GOVERNANCE                      1   permanent    active   n/a                              
5    ENCRYPTION                                     1   permanent    active   n/a                              
6    I/OS                                           1   permanent    active   n/a                              
7    RETENTION-LOCK-COMPLIANCE                      1   permanent    active   n/a                              
8    STORAGE-MIGRATION-FOR-DATADOMAIN-SYSTEMS       6   evaluation   grace    2023-11-20        
--   ----------------------------------------   -----   ----------   ------   ---------------   ---------------
License file last modified at : 2022/08/29 11:02:13.

DD_RAID Histogram

DD_RAID Histogram for dg23 0x323d6b863ae21b8f
DG:dg23 UUID:0x323d6b863ae21b8f MajorNr:61 MajorNrEx:62 GrpNr:18 State:Complete Status:Reassembled Index:0
    Total Disks:14 Working Disks:14 Max Disk Failures:2 Sectors:161373947904 Options:0x10000100 Type:StandAlone Primary
    Pool UUID:0x0:0 Tier Type:0x2 Host SN:CNWS30021O001N MG UUID:0x0
Array [ext3] (active): [raid-type 0] [(0x0, 0x0) options] [NVR:N/N] [256KB stripe] [117458432 sectors] ID[0x16222e80737dc6bf]
Array [ppart] (active): [raid-type 6] [(0x10, 0x3) options] [NVR:N/N] [4608KB stripe] [161373947904 sectors] ID[0x8febacd8140b2c05]

Storage Show All (This command can be ran on the DD CLI)

dg23      6.1-6.3, 6.13-6.15, 6.25-6.27,     14      7.2 TiB               
          6.37-6.39, 6.49-6.50

Disk Show Hardware (This command can be ran on the DD CLI)

6.1          A0     HITACHI H04728T8CLAR8000   A430       VYH2S3SS         7.2 TiB    SASe      DG118000785             
6.2          A1     HITACHI H04728T8CLAR8000   A430       VYH2RVSS         7.2 TiB    SASe      DG118000785             
6.3          A2     HITACHI H04728T8CLAR8000   A430       VYH2K9KS         7.2 TiB    SASe      DG118000785          
6.13         B0     HITACHI H04728T8CLAR8000   A430       VYH2JJBS         7.2 TiB    SASe      DG118000785             
6.14         B1     HITACHI H04728T8CLAR8000   A430       VYH1Y83S         7.2 TiB    SASe      DG118000785             
6.15         B2     HITACHI H04728T8CLAR8000   A430       VYH2RNGS         7.2 TiB    SASe      DG118000785    
6.25         C0     HITACHI H04728T8CLAR8000   A430       VYH1DN8S         7.2 TiB    SASe      DG118000785             
6.26         C1     HITACHI H04728T8CLAR8000   A430       VYH2124S         7.2 TiB    SASe      DG118000785             
6.27         C2     HITACHI H04728T8CLAR8000   A430       VYH0ZM6S         7.2 TiB    SASe      DG118000785  
6.25         C0     HITACHI H04728T8CLAR8000   A430       VYH1DN8S         7.2 TiB    SASe      DG118000785             
6.26         C1     HITACHI H04728T8CLAR8000   A430       VYH2124S         7.2 TiB    SASe      DG118000785             
6.27         C2     HITACHI H04728T8CLAR8000   A430       VYH0ZM6S         7.2 TiB    SASe      DG118000785           
6.47         D10    HITACHI H04728T8CLAR8000   A430       VYH1XGJS         7.2 TiB    SASe      DG118000785             
6.48         D11    HITACHI H04728T8CLAR8000   A430       VYH20VHS         7.2 TiB    SASe      DG118000785             
6.49         E0     HITACHI H04728T8CLAR8000   A430       VYH2G5XS         7.2 TiB    SASe      DG118000785

Since the drives do not have CA01 and CA02 firmware, a support case is not required. Go to step 3 of the MDU upgrade steps in the "Resolution" section above.

Affected Products

Data Domain, DD9400 Appliance, DD9900 Appliance

Products

DD OS 7.11

Article Number: 000204252

Article Type: Solution

Last Modified: 16 Dec 2025

Version: 20

Check if your device is covered by Support Services.

Data Domain: Multiple Drives Fail During Failover of HA Systems | CA01 CA02 FW issue

Summary: Data Domain High Availability (HA) can experience drive failures during HA failover due to a known drive firmware issue. Some systems may experience the inability of the file system to startup after a failover or reboot. ...

Symptoms

Cause

Resolution

Additional Info

Affected Products

Symptoms

Cause

Resolution

Applying the MDU package:

Workaround

Additional Information

FAQs

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Data Domain: Multiple Drives Fail During Failover of HA Systems | CA01 CA02 FW issue

Summary: Data Domain High Availability (HA) can experience drive failures during HA failover due to a known drive firmware issue. Some systems may experience the inability of the file system to startup after a failover or reboot. ... View More View Less

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Symptoms

Cause

Resolution

Applying the MDU package:

Workaround

Additional Information

FAQs

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Summary: Data Domain High Availability (HA) can experience drive failures during HA failover due to a known drive firmware issue. Some systems may experience the inability of the file system to startup after a failover or reboot. ...