PowerFlex 3.x Performance Issue With Potential I/O Errors During Backward Rebuild
Summary: I/O errors may be reported during backward rebuild events. In severe cases, volumes may become read-only and cause data unavailability. During a backward rebuild, comb roles (Primary/Secondary) are switched. A backward rebuild may occur when an SDS is 1) removed from maintenance mode, or 2) when an SDS goes down and comes back promptly. ...
Symptoms
Scenario
The MDM sends many control commands to an SDS during a rebuild, but the SDS is unable to process them promptly.
Symptoms
In the following entries, the MDM tries to perform a comb role switch, but the SDS took 30 s to complete the operation. This 30-second delay was sufficient for the volume to become read-only in this example.
15/06 04:18:18.362502 0x7f3cc590ddb0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE 15/06 04:18:23.364575 0x7f3cc5904db0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE 15/06 04:18:28.370493 0x7f3cc5928db0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE 15/06 04:18:33.379554 0x7f3cc5928db0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE 15/06 04:18:38.380498 0x7f3cc5955db0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE 15/06 04:18:43.380564 0x7f3cc590ddb0:multiHeadMgr_HandleNetRPCResponse:02653: TgtId: 5ce02d7e00000003 RC: TIMEOUT CombID: 71e0000480e3 msgType: TGT_MSG_TYPE__SET_COMB_STATE (...) 15/06 04:18:43.380573 0x7f3cc590ddb0:mdmTgtMsg_SendAsyncSetCombState:06228: devId: ff7fb6fc00030006 CombId: 71e0000480e3 CombState: PRI->SEC RaidState: 0x1->0x1 ProtType: SECONDARY Switch roles (subtask) GenNums: Primary: 71707 Cmd: 9 MH: 71746 Connection: 1357 15/06 04:18:46.920989 0x7f3cc5943db0:mdmTgtMsg_SendAsyncSetCombState:06228: devId: ffd7690d00070009 CombId: 71e0000480e3 CombState: SEC->PRI RaidState: 0x21->0x1 ProtType: SECONDARY Switch roles (subtask) GenNums: Primary: 71707 Cmd: 11 MH: 71758 Connection: 1353 15/06 04:18:46.921312 0x7f3cc5955db0:multiHeadRow_MoveState_Inner:03054: [multiHead_HandleNormStateFlow:1359]: MultiHead: e3c00009 Row: 227 NORMAL->NORMAL (NORM2NORM_ROLE_BALANCE)
To check for this condition (the 2nd and 3rd columns should be the same and set to 100):
scli --query_performance_parameters --all_sds --print_all | grep SDS_NUMBER_SDS_CONTROL_UMT
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
This will also be reported in the query_all output as follows:
grep CONTROL_UMT query_all.txt
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
SDS_NUMBER_SDS_CONTROL_UMT 10 100
Impact
I/O errors may be reported, and volumes may become read-only in severe cases.
Cause
The SDS cannot process the high number of control commands because SDS_NUMBER_SDS_CONTROL_UMT is only set to 10. This might happen either when the SDS performance profile is set to Compact or when it is set to High Performance, but the system was upgraded from PowerFlex 2.x to 3.x.
The high_performance profile sets SDS_NUMBER_SDS_CONTROL_UMT to 100; however, this may be incorrectly changed to 10 when upgrading from any 2.x version to any 3.x version below 3.0.1.5/3.5.1.3.
10 is the expected setting for SDS_NUMBER_SDS_CONTROL_UMT when the compact performance profile is set (In version 3.x, the high_performance profile is the default).
Resolution
If using a high_performance SDS profile, run the following command to correct this issue:
scli --set_performance_parameters --tech --all_sds --sds_number_sds_control_umt 100
If using a compact SDS profile, change to high_performance.
Impacted Versions
All versions of 3.x below 3.0.1.5 and 3.5.1.3
Fixed In Version
3.0.1.5 and 3.5.1.3