PowerFlex 3.x: SDS service continuously panics with function drl_IsClean
Summary: In rare scenarios, the SDS service may continuously panic with the function drl_IsClean. This issue has been observed when the SDS devices are larger than 2 TB in size.
Symptoms
SDS service continuously panics with the following stack trace:
/opt/emc/scaleio/sds/logs/exp.0
2024/07/22 21:54:33.819866 Panic in file /data/build/workspace/ScaleIO-Common-Job/src/tgt/bm/drl.c, line 1238, function drl_IsClean, PID 17253.Panic Expression !(offsetInLbs < pDrl->protectedOffsetInLbs) PANIC_ID_tgt_1497349762194.
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(mosDbg_PanicPrepare+0x13a) [0x93ab8a]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(drl_IsClean+0x5e) [0x9346ae]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(mgPhyDev_IsDrlGroupClean+0x4b) [0x93476b]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(mgPhyComb_ReadIntegrityBits+0x130) [0x906040]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(mgStorageRegion_ReadRegionIntegrity+0xb4) [0x906224]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(storageRegion_ReadDirtyRegion+0xad) [0x740f4d]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(raidComb_ReadDrl+0x7d) [0x74105d]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(ioh_ReadCombDrl+0x758) [0x5eb368]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(ioh_NewRequest+0x2084) [0x5fb4a4]
/opt/emc/scaleio/sds/bin/sds-3.6.400.107(contNet_RecvIORequest+0x2c4) [0x601534]
Impact
User data unavailability may occur if any other SDS decouples as part of it being in one of the following states:
- Instant Maintenance Mode (IMM)
- Error state
- During an ongoing rebuild
Cause
SDS service panics caused by large device offsets.
Resolution
Fix:
- PowerFlex 3.6.5 and above (end of support)
- PowerFlex 4.5 and above
Workaround:
Follow one of the options.
If Option 1 does not resolve the issue, go to Option 2.
Option 1:
-
- Enter the SDS node into IMM from scli command line or Presentation Server UI.
- If the SDS node cannot enter IMM, stop the SDS daemon by running the script
/opt/emc/scaleio/sds/bin/delete_service.sh.Take necessary precautions to prevent the cluster from entering a Data Unavailability (DU) state. Before stopping the SDS daemon, verify that no Rebuild is in progress. If you're unsure about the DU state, consult L2 or an SME.
- If the SDS node cannot enter IMM, stop the SDS daemon by running the script
- Stop the SDS service once the SDS is placed in IMM
/opt/emc/scaleio/sds/bin/delete_service.sh - Remove the shared memory on the SDS (including CloudLink shared memory).
- Move the files generated by the following command to a temporary directory
ls -l /dev/shm | egrep -i *EMC_sds* ls -l /dev/shm | egrep emc_scaleio_*
- Move the files generated by the following command to a temporary directory
- Start the SDS service
/opt/emc/scaleio/sds/bin/create_service.sh
- Enter the SDS node into IMM from scli command line or Presentation Server UI.
-
- Exit SDS out of IMM using scli or Presentation server UI. A rebuild is expected to start. If the SDS was not in IMM, go to the next step
- Check the output of the following command to ensure that the SDS is connected:
scli --query_all_sds
- Exit SDS out of IMM using scli or Presentation server UI. A rebuild is expected to start. If the SDS was not in IMM, go to the next step
Option 2:
-
If the system is not in a Data Failure state and sufficient free or spare capacity is available, remove the SDS node from the PowerFlex Cluster. Once the rebalance is complete, readd the SDS node with all the SDS devices.
IMPORTANT:
Background Scanner (BGS) and Partial Device Error (PDE) could potentially cause the issue to recur. If possible, disable BGS or use BGS in "report only" mode.
Persistent checksums should not trigger issues. However, if there is a checksum mismatch, a slight rebuild is initiated, which may cause the issue to arise again. If possible, disable Persistent Checksum.