Dell iDRAC reports fault on incorrect NVMe drive on vSAN cluster with deduplication enabled
Summary: On a Dell vSAN Original Storage Architecture (OSA) cluster with deduplication enabled, there are occurrences of drive faults reported on the NVMe devices. The vSAN OSA cluster is configured with NVMe devices as cache tier and SAS as capacity tiers. The drive faults are incorrectly reported on NVMe drives which are configured as cache tiers for the vSAN OSA cluster even though the failure is on the SAS drives participating in capacity tier. ...
Symptoms
The server BMC/Dell iDRAC may report an event log as the below in the System Event Log (SEL).

The Lifecycle Log (LCL) in Dell iDRAC may report the drive fault as PDR1001 as noted below.

From the vSAN OSA cluster perspective, you may notice that the entire disk group where the faulty drive is present is marked as Permanent Disk Loss(PDL). It is an expected behavior when vSAN OSA is enabled with deduplication. I.e., when one of the participating drive in a given vSAN OSA disk group is failed with deduplication enabled for the cluster, it is an expected design that the entire disk group is termed as PDL and Unhealthy.

However, from the iDRAC perspective, the corresponding NVMe device from the cache tier is reported to be faulty.
Cause
The symptoms noted above are expected behaviors both from vSAN OSA and from a Dell iDRAC perspective. Dell iDRAC report PDR1001 only on NVMe devices and not on SAS/SATA devices are an expected design. vSAN OSA using lsud daemon, writes the fault LED drive status to iDRAC over IPMI. In this case, even though the original fault is on a specific drive participating in the capacity tier, it affects the entire disk group with deduplication enabled. Hence ESXi also sends the drive fault on all the drives to the BMC/iDRAC.
Resolution
When vSAN OSA cluster is enabled with deduplication, any disk failure in the disk group is expected to fail the entire disk group. There are few methods available to identify the faulty drive which causes the disk group failure. So, it is worth double checking the drive faults before the replacement using any of the following methods.
- Log in to ESXi host using SSH
- Traverse to vsandevicemonitord.log under /var/run/log and look for the following entries, You may see entries where faulty disk is reported as Disk_Under_PERM_Error and remaining disk marked as DISKGROUP_UNDER_PERM_ERROR
2025-07-14T09:58:44ZIn(14)vsandevicemonitord[2104122]:[768345735872]:Device t10.NVMe____Dell_Express_Flash_PM1725a_800GB_SFF____8302B071E7382500 state is DISKGROUP_UNDER_PERM_ERROR2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a486f34a0 state is DISKGROUP_UNDER_PERM_ERROR2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a47abb360 state is DISK_UNDER_PERM_ERROR2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a47abb400 state is DISKGROUP_UNDER_PERM_ERROR
3. Locate the hard drive with the below command using device identifier.
~# esxcli storage core device physical get -d <NAA ID of the device>Physical Location: enclosure 3 slot 0
4. Alternatively, you can perform following command to identify the faulty drive in the disk group. You may see that the difference in overall health status reported as red(Failed) for fault drive and red (FAILED, PROPAGATED) for remaining drive in the disk group.
~# esxcli vsan debug disk listUUID: 52faedac-87fe-8a16-5117-222bd24dac8a Name: t10.NVMe____Dell_Express_Flash_PM1725a_800GB_SFF____8302B071E7382500 Owner: he-dhcp-pnw-192-168-28-213.helab.in Version: 20 Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a Disk Tier: Cache SSD: true In Cmmds: true In Vsi: true Fault Domain: N/A Model: Dell Express Flash PM1725a 800GB SFF Encryption: false Compression: true Deduplication: true Dedup Ratio: N/A Overall Health: red(FAILED,PROPAGATED) Metadata Health: green Operational Health: red Congestion Health: State: green Congestion Value: 0 Congestion Area: none All Congestion Fields: Space Health:
UUID: 52ec5051-d32b-1dca-08eb-49bb5e29d2b4 Name: naa.5002538a47abb360 Owner: he-dhcp-pnw-192-168-28-213.helab.in Version: 20 Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a Disk Tier: Capacity SSD: true In Cmmds: true In Vsi: true Fault Domain: N/A Model: MZILS3T8HMLH0D3 Encryption: false Compression: true Deduplication: true Dedup Ratio: 0.61x Overall Health: red(FAILED) Metadata Health: green Operational Health: red Congestion Health: State: green Congestion Value: 0 Congestion Area: none All Congestion Fields: Space Health: State: green Capacity: 3387.72 GB Used: 121.89 GB Reserved: 20.23 GB UUID: 52be568b-fca4-3494-492c-b6273f7100f7 Name: naa.5002538a47abb400 Owner: he-dhcp-pnw-192-168-28-213.helab.in Version: 20 Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a Disk Tier: Capacity SSD: true In Cmmds: true In Vsi: true Fault Domain: N/A Model: MZILS3T8HMLH0D3 Encryption: false Compression: true Deduplication: true Dedup Ratio: 0.61x Overall Health: red(FAILED,PROPAGATED) Metadata Health: green Operational Health: red Congestion Health: State: green Congestion Value: 0 Congestion Area: none All Congestion Fields:
5. Using Skyline health, Operation health tab reflects the permanent disk failure and propagated permanent disk failure.