Dell Unity: LUN appears online but is unavailable for I/O
Summary: LUN unavailable for I/O even though it appears to be online.
Symptoms
No faults are present in Unisphere.
Initiators and paths are good.
Host unable to access the LUN.
From the host logs there are continuous failed attempts to access the LUN, for example from an ESX host:
cpu17:5647140)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "<LUN_naa_id>" state in doubt; requested fast path state update...
cpu32:2098231)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "<LUN_naa_id>": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
cpu18:5635448)WARNING: NMP: nmp_DeviceStartLoop:729: NMP Device "<LUN_naa_id>" is blocked. Not starting I/O from device.
cpu45:5646660)WARNING: lpfc: lpfc_sli_issue_abort:10919: 0:(0):3169 Abort failed: Abort INP: Data: x0 xd54 x7 x98
cpu24:2098231)WARNING: NMP: nmp_DeviceStartLoop:729: NMP Device "<LUN_naa_id>" is blocked. Not starting I/O from device.
cpu24:2098231)WARNING: NMP: nmp_DeviceStartLoop:729: NMP Device "<LUN_naa_id>" is blocked. Not starting I/O from device.
cpu28:2098107)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world restore device "<LUN_naa_id>" - no more commands to retry
Ktrace logs show LV::STATUS_CANCELLED on one or both SPs:
mlu 24b7470e c4_safe_ktrace E CancelIrpFromNotReadyQ[2520] LV::STATUS_CANCELLED Irp=0x7f99de29b3b0 Dev=0xA000000XX
mlu 24b7470e c4_safe_ktrace E CancelIrpFromNotReadyQ[2520] LV::STATUS_CANCELLED Irp=0x7f99de280d10 Dev=0xA000000XX
Cause
During an SP reboot, it can happen that "newSP" triggers a second restart of the safe container, and a rare trespass race condition in Mapped LUN (MLU) driver. This race condition impacts the activity flags on MLU objects where the leader and followers are at different states of activity (running to disabled/ready). The result is that the LUN is left in a ready state but the object flags of the family objects were left pending activate which will cause a problem during the next trespass of the LUN. This can result in the object getting stuck right away during the second reboot of the safe container, or it may create a latent problem which is exposed on the next LUN trespass.
Logs showing a newSP second restart:
Platform_Basic 30018 [NOTICE] Audit: Service user executed the following service script command: svc_shutdown --reboot
newSP 1510000 [ERROR] System: Info: Starting. @ newSP.cpp:2101
SP A 30000 [INFO] Audit: SP is rebooting(0x13)
newSP 1510000 [ERROR] System: Info: newSP reset array WWN, restarting safe @ newSP.cpp:2116
newSP 1510000 [ERROR] System: Info: Another component indicated that a restart of safe was required @ newSP.cpp:2137
newSP 1510000 [ERROR] System: Info: newSP will inform Linux that a restart of safe is required. @ newSP.cpp:2185
newSP 1510000 [ERROR] System: Info: Normal Exit. @ newSP.cpp:2893
Resolution
Rebooting SPs one at a time resolves this issue.
A fix for this issue is expected to be available in the next major code release (Unity OE 5.1).