Dell Unity: Single SP Panic due to SPCv Controller crash (Dell Correctable)

Summary: SPCv controller can hang in PHY_STOP (Physical Layer Stop) command with IOs. This can cause the SPCv controller to crash causing SP panic.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Single SP Panic caused due to SPCv controller crash. 

This issue typically occurs during specific code upgrades which include a SAS Controller firmware update.  We have only seen this issue occur roughly an hour after the conclusion of the upgrade.

From Ktrace:
07:13:27.284 511 7F581BED6700 SASPMC 0 (BE99) API INFO Timeout occurred, canceling cmd timer oc: 0xf htag: 0x8ff8920
07:13:27.284 2 7F581BED6700 SASPMC 0 (BE99) API ERRO SAS CTLR RECOVER ASSERT (PANIC CODE 0x0340402e, recoverable 1, ACTION_CHIP_RESET)
07:13:27.284 2 7F581BED6700 SASPMC 0 (BE99) API ERRO SPC FATAL ERROR, RESETTING SAS CTLR TO RECOVER, panic_code 0x0340402e
07:13:27.284 2 7F581BED6700 SASPMC 0 (BE99) TPM INFO Service Thread - Scheduling thread to process Ctlr System Error arrival on processor 19.
07:13:27.284 8 7F581B7FF700 SASPMC 0 (BE99) API INFO Timeout occurred, canceling cmd timer oc: 0xf htag: 0x8fe8920
07:13:27.284 1 7F581B7FF700 SASPMC 0 (BE99) API INFO SAS CTLR MPI Fatal Error processing in progress panic_code 0x0340402e

Cause

Race condition between PHY_STOP (Physical Layer Stop) command and PHY_UP event in SPCv controller. This caused SPCv contoller to get stuck and panic.

Resolution

Fix:
A fix is being investigated at this time.  Please watch this Knowledgebase solution for more details.

There is no viable workaround at the present time (June 2024).  If you are experiencing this condition, please contact Dell Technical Support or your Authorized Service Professional and quote this Knowledgebase article ID.

Additional Information

While normally this SPCv controller crash only triggers a SP panic, there have been a VERY small number of cases where we had uncorrectable errors as a result of a dual failure.  Drives that were offline when the SP returned from the panic attempted to complete partial writes which were a result of the panic.  If the partial writes target an offline drive, that stripe gets marked “uncorrectable” resulting in data loss.

For those cases, the SP will still likely panic around an hour after the NDU, but then after the panic LUNs or file systems may remain offline from corruption, including some uncorrectable errors found by find_bad_blocks.  We can zero the bad blocks, but if the blocks contained data, it may destroy the file, and those files must be restored from backup.   

Affected Products

Dell EMC Unity, Dell EMC Unity XT 380, Dell EMC Unity XT 380F, Dell EMC Unity XT 480, Dell EMC Unity XT 480F, Dell EMC Unity XT 680, Dell EMC Unity XT 680F, Dell EMC Unity XT 880, Dell EMC Unity XT 880F , Dell EMC Unity Family |Dell EMC Unity All Flash, Dell EMC Unity Family, Dell EMC Unity Hybrid ...
Article Properties
Article Number: 000216606
Article Type: Solution
Last Modified: 07 Jun 2024
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.