PowerFlex Many Fixed Read Errors After An SDS Server Cold Boot
Summary: After an unexpected SDS server power cycle, the MDM is reporting many fixed read errors. PowerFlex System that uses a Medium Granularity (MG) storage pool and has the persistent checksum feature enabled. Devices that are larger than 2 TB. A single SDS server unexpectedly experiences a power cycle (cold boot). Two or more SDS servers unexpectedly experience a power cycle (cold boot). We confirmed that this issue arises from the Persistent Checksum (PC) feature with devices larger than 2 TB, particularly when coupled with a cold boot of an SDS server. ...
Symptoms
Scenario
PowerFlex System that uses a medium granularity storage pool and has the persistent checksum feature enabled.
Devices that are larger than 2 TB.
A single SDS server unexpectedly experiences a power cycle (cold boot).
Two or more SDS servers unexpectedly experience a power cycle (cold boot).
Symptoms
MDM event logs report many fixed read errors:
2023-12-05 12:01:42.634000:0031658:SCANNER_NEW_FIXED_ERRORS__INFO INFO SDS <name> encountered one or more read errors on device /dev/disk/by-id/scsi-<id>, and they were all fixed (Found: 29443, Fixed: 29443) ...
SDS trace logs show checksum mismatches:
2023/12/05 12:01:39.643280 7ff09dd3ddb0:mosT10Dif_VerifyContT10DIFBuffer:00381: (T10DIF) DIF Verification Failed: blk=0, blkSize=8, pData=0x7fedddbff000, pDif=0x7ff09dd38820, computed_guard=b5c2, DIF_guard=58e1, difGranularity=8 2023/12/05 12:01:39.643288 7ff09dd3ddb0:mgPhyDevPersChksm_IO_ReadValidate:03647: data Validation (state: PROTECTED) failed, devId 0xddd77b550046000e, combId 4716801282c6, combOffsetInLbs 16609280, dataOffsetInLbs 4306157568, dataSizeInLbs 2048, chksmRelativeOffsetBytes 1075099648, chksmSizeBytes 512, rc IO_ERR_PERS_CHECKSUM_MISMATCH (Pers. Checksum) 2023/12/05 12:01:39.643298 7ff09dd3ddb0:mgStorageRegion_ReadSync:03646: Reading tooth data failed: IO_ERR_PERS_CHECKSUM_MISMATCH. combId:4716801282c6,vTree:0xda6ddd6400000022,offsetVol:0x374ba9000,offsetInComb:16609280,sizeInLbs:2048,phyToothIdx:2101592,srcToothIdx:inv,dstToothIdx:inv New:(0,0) Requested:(37,1) volId:0 2023/12/05 12:01:39.643372 7ff09dd3ddb0:mgPhyDev_IncreaseInaccessibleCapacity:06587: PDE - devId ddd77b550046000e toothIndex 2101592 Increased inaccessible capacity to 1 2023/12/05 12:01:39.643383 7ff09dd3ddb0:raidComb_ReportCorruptionIfShould:19441: PDE - Comb 4716801282c6 Reported CORRUPT integrity result SUCCESS combId:4716801282c6,vTree:0xda6ddd6400000022,offsetVol:0x374ba9000,offsetInComb:16609280,sizeInLbs:2048,phyToothIdx:2101592,srcToothIdx:inv,dstToothIdx:inv New:(0,0) Requested:(37,1) volId:0 2023/12/05 12:01:39.643390 7ff09dd3ddb0:ioh_NewRequest:10209: Check for scan error on comb 4716801282c6 - Done rc is IO_ERR_PERS_CHECKSUM_MISMATCH (Lba 16609280 2048) (0 ms) 2023/12/05 12:01:39.647175 7ff098be4db0:storageRegion_PostIntegrityCorrection:04647: PDE - Clearing corruption in comb 4716801282c6 offsetInComb 16609280 extentSize 2048 after raidComb_WriteCombLocal combId:4716801282c6,vTree:0xda6ddd6400000022,offsetVol:0x374ba9000,offsetInComb:16609280,sizeInLbs:2048,phyToothIdx:2101592,srcToothIdx:inv,dstToothIdx:inv New:(0,0) Requested:(37,1) volId:0 2023/12/05 12:01:39.647259 7ff098be4db0:mgPhyDev_DecreaseInaccessibleCapacity:06604: PDE - devId ddd77b550046000e toothIndex 2101592 Decreased inaccessible capacity to 0 2023/12/05 12:01:39.647350 7ff098be4db0:ioh_NewRequest:09688: comb:4716801282c6,vTree:0x0,offsetVol:0xffffffffffffffff,offsetTooth:0x0, Succeeded to fix comb 4716801282c6, offset 16609280, by its primary
If multiple SDSs experience a cold boot, inaccessible capacity may be observed. This can be seen from the query_all:
Number of devices with inaccessible capacity: 367
Impact:
MDM alerts indicating fixed read errors that were corrected by the mirrored copy.
MDM event logs fill up with events about the fixed read errors.
Cause
Following a cold boot to an SDS server, there is a software code issue that prevents the full rebuilding of persistent checksums on devices larger than 2 TB. The background scanner (BGS) feature detects discrepancies between the primary and secondary copies of data because of the missing checksum, marking them as fixed read errors. It's important to note that there is no risk of data integrity or loss. The BGS function automatically corrects the identified differences by rectifying the checksums. As a result, observed fixed read errors are a byproduct of this process, with the assurance that data integrity remains intact.
Resolution
These events and alerts can safely be ignored. The events and alerts will eventually go away once BGS gets through all the devices.
If the MDM alerts and events are problematic, the SDS that experienced a cold boot can be removed from the system and added back in.
If multiple SDSes experience a cold boot and an inaccessible capacity is seen, the persistent checksum needs to be temporarily disabled.
To disable the persistent checksum feature, BGS needs to be disabled first:
1) Connect to the Primary MDM server.
2) Disable BGS:
scli --disable_background_device_scanner --protection_domain_name <pd> --storage_pool_name <sp>
3) Disable persistent checksum:
scli --disable_persistent_checksum --protection_domain_name <pd> --storage_pool_name <sp> |
The inaccessible areas should stop increasing and start decreasing. This may take some time. Sometimes, performing the SCLI command of test_inaccessible_capoacity for every device impacted may help speed this process up.
4) If the inaccessible areas are not decreasing from the above actions, place the SDSes that are flagged with PDE into IMM and restart the SDS service.
Once all the inaccessible capacity is gone, enable persistent checksum and BGS.
1) Enable persistent checksum:
scli --enable_persistent_checksum --protection_domain_name <pd> --storage_pool_name <sp>
This may take a long time as all checksums for all the data must be rebuilt. Progress for this can be tracked using the SCLI query_all command. Once the persistent checksums are calculated and protected, only then can BGS be enabled.
2) Enable BGS:
scli --enable_background_device_scanner --protection_domain_name <pd> --storage_pool_name <sp>
Additional Information
Impacted Versions
PowerFlex 3.x
PowerFlex 4.x
Fixed In Version
PowerFlex 3.6.3
PowerFlex 4.5.2