PowerFlex Device has fixed read errors
Summary: SDS devices have errors stating "Device has fixed read errors."
Symptoms
Scenario
This can occur when an SDS device has read errors that have been corrected by the SDS.
This can occur when the background scanner is disabled or enabled.
Symptoms
The fixed errors on a device can be shown in the following places:
-
The GUI shows an error:
-
The "--query_sds --sds_id <SDS_ID>" output shows a counter for each device with corrected read errors:
15: Name: /dev/sdr Path: /dev/sdr Original-path: /dev/sdr ID: 2d63f7c80003000e
Storage Pool: SAS_pool1, Capacity: 1116 GB Error-fixes: 6 scanned 0 MB, Compare errors: 0 State: Normal
The counters_dump.txt in MDM getInfoDump shows the FIXED_READ_ERROR_COUNT on different objects:
ID: df7700a600120012 DEVICE_TYPE READ_ERR FIXED_READ_ERROR_COUNT 1 ID: 1d1e4e5500000012 SDS_TYPE READ_ERR FIXED_READ_ERROR_COUNT 1 ID: 1c34e1f700000007 STORAGE_POOL_TYPE READ_ERR FIXED_READ_ERROR_COUNT 1 ID: b9b286df00000001 PROTECTION_DOMAIN_TYPE READ_ERR FIXED_READ_ERROR_COUNT 1 ID: 49b6b8057d1fc84b SYSTEM_TYPE READ_ERR FIXED_READ_ERROR_COUNT 1
Other possible symptoms: The device may be in an Error state. There may be errors on the block device in the system messages or syslog:
blk_update_request: critical medium error, dev sdr, sector 94390272 sd 0:2:15:0: [sdr] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 0:2:15:0: [sdr] tag#1 Sense Key : Medium Error [current] sd 0:2:15:0: [sdr] tag#1 Add. Sense: Unrecovered read error
There may be long inflight IO messages in SDS trc:
contDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 12250 millis contDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 13250 millis contDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 14250 millis
There may be Errors in the device's I/O counters in SDS' sdbg_out.txt:
13: Dev path:/dev/sdr Size(lbs):0 Time grn:520577464
Io Counters:
GENERAL
Writes: 4852 Lbs: 2160443 MBs: 1054 Errors: 0
Reads: 49283 Lbs: 111376 MBs: 54 Errors: 12744
BM
Writes: 0 Lbs: 0 MBs: 0 Errors: 0
Reads: 0 Lbs: 0 MBs: 0 Errors: 0
COMB_MAP
Writes: 5 Lbs: 1390 MBs: 0 Errors: 2
Reads: 0 Lbs: 0 MBs: 0 Errors: 0
TOOTH_MAP
Writes: 426 Lbs: 688528 MBs: 336 Errors: 424
Reads: 0 Lbs: 0 MBs: 0 Errors: 0
IO
Writes: 4319 Lbs: 603064 MBs: 294 Errors: 16
Reads: 2076 Lbs: 16608 MBs: 8 Errors: 22
The device's latency may be high according to counters_dump.txt:
ID: 2d63f7c60003000c DEVICE_TYPE DEV_LATENCY AVG_WRITE_LATENCY_IN_MICROSEC 0 ID: 2d63f7c70003000d DEVICE_TYPE DEV_LATENCY AVG_WRITE_LATENCY_IN_MICROSEC 0 ID: 2d63f7c80003000e DEVICE_TYPE DEV_LATENCY AVG_WRITE_LATENCY_IN_MICROSEC 11424 ID: 2d63f7c90003000f DEVICE_TYPE DEV_LATENCY AVG_WRITE_LATENCY_IN_MICROSEC 0 ID: 2d63f7ca00030010 DEVICE_TYPE DEV_LATENCY AVG_WRITE_LATENCY_IN_MICROSEC 0
Impact
The "Fixed Read Errors" counter does not have any direct impact on the system.
However, it may indicate an underlying condition that could cause SDS disconnections, rebuild activities, etc.
Cause
This can be seen when an SDS device has read errors that have been corrected, or fixed, by using the mirrored copy. The correction can happen in the following cases:
- The background scanner fails to read from one copy of the data, and uses the other copy to overwrite it.
- An SDS fails to serve an SDC's read requests due to failure to read the disk, and uses the secondary copy to serve the I/O and overwrite the local data.
The warning indicates that the disk may be slowing down, going bad, or having bad blocks. The mechanisms described above re-write the blocks, which can fix "soft" bad blocks.
Resolution
- Examine the disk. If necessary, contact the hardware vendor to replace it.
The counter usually indicates an underlying condition, and the disk is breaking. The SDS' action explained above is an attempt to fix soft bad blocks but may not succeed in all scenarios.
-
Clear the counter.
scli --reset_scanner_error_counters --protection_domain_id <pd id> --storage_pool_id <sp id> --reset_corrected_read_error_counter