Dell iDRAC reports fault on incorrect NVMe drive on vSAN cluster with deduplication enabled

Summary: On a Dell vSAN Original Storage Architecture (OSA) cluster with deduplication enabled, there are occurrences of drive faults reported on the NVMe devices. The vSAN OSA cluster is configured with NVMe devices as cache tier and SAS as capacity tiers. The drive faults are incorrectly reported on NVMe drives which are configured as cache tiers for the vSAN OSA cluster even though the failure is on the SAS drives participating in capacity tier. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

The server BMC/Dell iDRAC may report an event log as the below in the System Event Log (SEL). 

Dell PowerEdge iDRAC System Event Log(SEL) reporting drive fault

The Lifecycle Log (LCL) in Dell iDRAC may report the drive fault as PDR1001 as noted below.

Dell PowerEdge iDRAC Life Cycle Log (LCL) report fault on a drive

From the vSAN OSA cluster perspective, you may notice that the entire disk group where the faulty drive is present is marked as Permanent Disk Loss(PDL). It is an expected behavior when vSAN OSA is enabled with deduplication. I.e., when one of the participating drive in a given vSAN OSA disk group is failed with deduplication enabled for the cluster, it is an expected design that the entire disk group is termed as PDL and Unhealthy.

vSAN Cluster diskgroup show PDL

However, from the iDRAC perspective, the corresponding NVMe device from the cache tier is reported to be faulty.   

Cause

The symptoms noted above are expected behaviors both from vSAN OSA and from a Dell iDRAC perspective. Dell iDRAC report PDR1001 only on NVMe devices and not on SAS/SATA devices are an expected design. vSAN OSA using lsud daemon, writes the fault LED drive status to iDRAC over IPMI. In this case, even though the original fault is on a specific drive participating in the capacity tier, it affects the entire disk group with deduplication enabled. Hence ESXi also sends the drive fault on all the drives to the BMC/iDRAC.

Resolution

When vSAN OSA cluster is enabled with deduplication, any disk failure in the disk group is expected to fail the entire disk group. There are few methods available to identify the faulty drive which causes the disk group failure. So, it is worth double checking the drive faults before the replacement using any of the following methods.

  1. Log in to ESXi host using SSH
  2. Traverse to vsandevicemonitord.log under /var/run/log and look for the following entries, You may see entries where faulty disk is reported as Disk_Under_PERM_Error and remaining disk marked as DISKGROUP_UNDER_PERM_ERROR

2025-07-14T09:58:44ZIn(14)vsandevicemonitord[2104122]:[768345735872]:Device t10.NVMe____Dell_Express_Flash_PM1725a_800GB_SFF____8302B071E7382500 state is DISKGROUP_UNDER_PERM_ERROR
2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a486f34a0 state is DISKGROUP_UNDER_PERM_ERROR
2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a47abb360 state is DISK_UNDER_PERM_ERROR
2025-07-14T09:58:44Z In(14) vsandevicemonitord[2104122]: [768345735872]: Device naa.5002538a47abb400 state is DISKGROUP_UNDER_PERM_ERROR

      3. Locate the hard drive with the below command using device identifier.
~# esxcli storage core device physical get -d <NAA ID of the device>
Physical Location: enclosure 3 slot 0

      4. Alternatively, you can perform following command to identify the faulty drive in the disk group. You may see that the difference in overall health status reported as red(Failed) for fault drive and red (FAILED, PROPAGATED) for remaining drive in the disk group.

~# esxcli vsan debug disk list
UUID: 52faedac-87fe-8a16-5117-222bd24dac8a
   Name: t10.NVMe____Dell_Express_Flash_PM1725a_800GB_SFF____8302B071E7382500
   Owner: he-dhcp-pnw-192-168-28-213.helab.in
   Version: 20
   Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a
   Disk Tier: Cache
   SSD: true
   In Cmmds: true
   In Vsi: true
   Fault Domain: N/A
   Model: Dell Express Flash PM1725a 800GB SFF
   Encryption: false
   Compression: true
   Deduplication: true
   Dedup Ratio: N/A
   Overall Health: red(FAILED,PROPAGATED)
   Metadata Health: green
   Operational Health: red
   Congestion Health:
         State: green
         Congestion Value: 0
         Congestion Area: none
         All Congestion Fields:
   Space Health:


UUID: 52ec5051-d32b-1dca-08eb-49bb5e29d2b4
   Name: naa.5002538a47abb360
   Owner: he-dhcp-pnw-192-168-28-213.helab.in
   Version: 20
   Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a
   Disk Tier: Capacity
   SSD: true
   In Cmmds: true
   In Vsi: true
   Fault Domain: N/A
   Model: MZILS3T8HMLH0D3
   Encryption: false
   Compression: true
   Deduplication: true
   Dedup Ratio: 0.61x
   Overall Health: red(FAILED)
   Metadata Health: green
   Operational Health: red
   Congestion Health:
         State: green
         Congestion Value: 0
         Congestion Area: none
         All Congestion Fields:
   Space Health:
         State: green
         Capacity: 3387.72 GB
         Used: 121.89 GB
         Reserved: 20.23 GB
 
UUID: 52be568b-fca4-3494-492c-b6273f7100f7
   Name: naa.5002538a47abb400
   Owner: he-dhcp-pnw-192-168-28-213.helab.in
   Version: 20
   Disk Group: 52faedac-87fe-8a16-5117-222bd24dac8a
   Disk Tier: Capacity
   SSD: true
   In Cmmds: true
   In Vsi: true
   Fault Domain: N/A
   Model: MZILS3T8HMLH0D3
   Encryption: false
   Compression: true
   Deduplication: true
   Dedup Ratio: 0.61x
   Overall Health: red(FAILED,PROPAGATED)
   Metadata Health: green
   Operational Health: red
   Congestion Health:
         State: green
         Congestion Value: 0
         Congestion Area: none
         All Congestion Fields:

 

      5. Using Skyline health, Operation health tab reflects the permanent disk failure and propagated permanent disk failure.

vSAN Skyline Health Diagnostics reflecting device health 

Affected Products

Dell EMC vSAN Ready Nodes, VMware VSAN
Article Properties
Article Number: 000348652
Article Type: Solution
Last Modified: 25 Jul 2025
Version:  3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.