PowerFlex: SDS Process Instability Causes I/O Error
Summary: In MDM events repeating SDS disconnection (repeating decouple) could be observed, possibly application and SDC reporting I/O error.
Symptoms
SDS crashes cause repeating I/O errors, as the SDS is not evacuated from the system though it suffers from repeating crashes.
- SDS instability is observed in the MDM events:
# grep ee9b4eb200000002 events.txt | egrep -v "(OSC|SDC_CON|SDC_DISC)" 4284507 2020-10-26 23:38:02.330 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284546 2020-10-26 23:38:17.103 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284568 2020-10-26 23:38:33.123 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284587 2020-10-26 23:38:47.353 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284614 2020-10-26 23:39:05.371 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284635 2020-10-26 23:39:22.910 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284655 2020-10-26 23:39:58.008 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected 4284674 2020-10-26 23:40:12.318 SDS_RECONNECTED INFO SDS: sds-2li-dcwipph21v004 (ID ee9b4eb200000002) reconnected - SDC will also disconnect from SDS - for example from ESX:
vmkernel.0:2020-10-27T04:01:01.193Z cpu56:66319)WARNING: [14896504445] Disconnected from SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:02.296Z cpu32:66320)WARNING: [14896505547] Connected to SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:18.232Z cpu35:66319)WARNING: [14896521482] Disconnected from SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:19.332Z cpu35:66319)WARNING: [14896522582] Connected to SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:34.769Z cpu53:66320)WARNING: [14896538017] Disconnected from SDS with ID ee9b4eb200000002 -
I/O error appears on SDC:
2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[14895126141] IO-ERROR Type TEST_AND_SET. comb: 55880098015. offsetInComb 2721096. SizeInLB 1. SDS_ID 0. Comb Gen 4619. Head Gen 4b30. StartLB ad48. 2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:512 :Vol ID 0x735105ff0000001c. Last vol network error status NOT_CONN(4) Reason (ABORTED) RC (ABORTED) Retry count (5) chan (0) . . . 2020-10-27T04:08:20.234Z cpu35:66313)WARNING: ScaleIO netCon_IsKaNeeded:3761 :CON 0x439dc29f6700 didn't receive message for 30 iterations. Marking as down 2020-10-27T04:08:20.234Z cpu18:66894)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f5efc socket 0x439dc29f6418 2020-10-27T04:08:20.234Z cpu33:66806)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f817c socket 0x439dc29f8698 2020-10-27T04:08:20.234Z cpu0:66879)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f6a7c socket 0x439dc29f6f98 2020-10-27T04:08:20.234Z cpu23:66319)WARNING: [14896943442] Disconnected from SDS with ID ee9b4eb200000002 2020-10-27T04:08:23.246Z cpu37:65868)Res6: 2346: All helpers quiesced (12 cancelled) for vol 'SD4W21AVxFlexCU03': 1280 LFBCs, 20/1 buckets allocated (4 KB), 1 flush, 0 helpers -
SDS might report for different issues - as long as the behavior is repeating connection and disconnection we might hit the issue described in the KB. In the example below an NVDIMM HW issue, leads to a SIGBUS error (bad memory access), and causes an SDS crash with signal 7:
exp.026/10 23:37:55.305617 Termination due to signal 7. PID 2601 Faulting address 0x7efb85004000. errno 0 26/10 23:37:55.306321 Writing backtraces for all UMTs: 26/10 23:38:10.132585 Termination due to signal 7. PID 99889 Faulting address 0x7f5485004000. errno 0 26/10 23:38:10.133167 Writing backtraces for all UMTs:messages
Oct 26 23:37:55 dcwipph21v004 kernel: mce: Uncorrected hardware memory error in user-access at 3d84e04440 Oct 26 23:37:55 dcwipph21v004 kernel: MCE 0x3d84e04: Killing sds-3.0.1000.20:2601 due to hardware memory corruption Oct 26 23:37:55 dcwipph21v004 kernel: MCE 0x3d84e04: dax page page recovery: Recovered Oct 26 23:37:55 dcwipph21v004 kernel: sds-3.0.1000.20:4006 conflicting memory types 3d84e04000-3d84e05000 uncached-minus<->write-back Oct 26 23:37:55 dcwipph21v004 kernel: reserve_memtype failed [mem 0x3d84e04000-0x3d84e04fff], track uncached-minus, req uncached-minus Oct 26 23:37:55 dcwipph21v004 kernel: Could not invalidate pfn=0x3d84e04 from 1:1 map Oct 26 23:37:56 dcwipph21v004 sh: abrt-dump-oops: Found oopses: 1 Oct 26 23:37:56 dcwipph21v004 sh: abrt-dump-oops: Creating problem directories Oct 26 23:37:56 dcwipph21v004 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on Oct 26 23:37:56 dcwipph21v004 systemd: Configuration file /opt/nsr/admin/networker.service is marked executable. Please
Cause
When every few seconds the next scenario occurs we might hit this issue:
1. An HW or a SW failure causes the SDS process to crash and disconnect from MDM.
2. SDS recovers from the crash and passes successfully the "reconfig stage" which flags this SDS as generally available from the MDM point of view, and hence from all other system components like SDC.
3. After 15 seconds SDC retries the IO (default) - But meanwhile, SDS crash again (as described in point "1" above).
4. I/O fails upon timeout and SDC/application report I/O Error.
5. Steps "2"→"4" might occur again and again until this SDS will be evacuated from the system.
Resolution
The system is working as designed.
Option 1:
Remove the SDS from the cluster. You can remove an SDS at any time, with no downtime required. During removal, the associated data is replicated to different nodes. The removal process is asynchronous and may take a long time.
Fix the HW and SW issues that caused the SDS instability and return the SDS to the cluster.
Option 2:
Monitor the system, and if the SDS begins flapping under similar circumstances, stop the SDS service by running the following command on the SDS:
/opt/emc/scaleio/sds/bin/delete_service.sh
/opt/emc/scaleio/sds/bin/create_service.sh