PowerFlex SDS Process Instability Causes I/O Error

Summary: SDS is repeatedly unresponsive causing I/O errors because the SDS is not evacuated from the system.

Affected Products

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

In MDM events, repeated SDS disconnection (repeating decouple) may be observed with possible application and SDC reporting I/O errors. SDS instability is observed in the MDM events:

# grep ee9b4eb200000002 events.txt  | egrep -v "(OSC|SDC_CON|SDC_DISC)"
4284507 2020-10-26 23:38:02.330 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected 
4284546 2020-10-26 23:38:17.103 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected

4284674 2020-10-26 23:40:12.318 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected

SDC disconnects from SDS, for example from ESXi:

vmkernel.0:2020-10-27T04:01:01.193Z cpu56:66319)WARNING: [14896504445] Disconnected from SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:02.296Z cpu32:66320)WARNING: [14896505547] Connected to SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:18.232Z cpu35:66319)WARNING: [14896521482] Disconnected from SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:19.332Z cpu35:66319)WARNING: [14896522582] Connected to SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:34.769Z cpu53:66320)WARNING: [14896538017] Disconnected from SDS with ID ee9b4eb200000002

I/O error appears on SDC:

2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[14895126141] IO-ERROR Type TEST_AND_SET. comb: 55880098015. offsetInComb 2721096. SizeInLB 1. SDS_ID 0. Comb Gen 4619. Head Gen 4b30. StartLB ad48.
2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:512 :Vol ID 0x735105ff0000001c. Last vol network error status NOT_CONN(4) Reason (ABORTED) RC (ABORTED) Retry count (5) chan (0)
.
.
.
2020-10-27T04:08:20.234Z cpu35:66313)WARNING: ScaleIO netCon_IsKaNeeded:3761 :CON 0x439dc29f6700 didn't receive message for 30 iterations.  Marking as down
2020-10-27T04:08:20.234Z cpu18:66894)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f5efc socket 0x439dc29f6418
2020-10-27T04:08:20.234Z cpu33:66806)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f817c socket 0x439dc29f8698
2020-10-27T04:08:20.234Z cpu0:66879)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f6a7c socket 0x439dc29f6f98
2020-10-27T04:08:20.234Z cpu23:66319)WARNING: [14896943442] Disconnected from SDS with ID ee9b4eb200000002
2020-10-27T04:08:23.246Z cpu37:65868)Res6: 2346: All helpers quiesced (12 cancelled)  for vol 'SD4W21AVxFlexCU03': 1280 LFBCs, 20/1 buckets allocated (4 KB), 1 flush, 0 helpers

If repeated SDS disconnections and reconnections occur, the issue described in the KB is probably occurring. In the example below, an NVDIMM hardware (HW) issue, leads to a SIGBUS error (bad memory access) and causes an SDS crash with signal 7. exp.0:

26/10 23:37:55.305617 Termination due to signal 7. PID 2601 Faulting address 0x7efb85004000. errno 0
26/10 23:37:55.306321 Writing backtraces for all UMTs:
26/10 23:38:10.132585 Termination due to signal 7. PID 99889 Faulting address 0x7f5485004000. errno 0
26/10 23:38:10.133167 Writing backtraces for all UMTs:

Messages:

Oct 26 23:37:55  kernel: mce: Uncorrected hardware memory error in user-access at 3d84e04440
Oct 26 23:37:55  kernel: MCE 0x3d84e04: Killing sds-3.0.1000.20:2601 due to hardware memory corruption
Oct 26 23:37:55  kernel: MCE 0x3d84e04: dax page page recovery: Recovered
Oct 26 23:37:55  kernel: sds-3.0.1000.20:4006 conflicting memory types 3d84e04000-3d84e05000 uncached-minus<->write-back
Oct 26 23:37:55  kernel: reserve_memtype failed [mem 0x3d84e04000-0x3d84e04fff], track uncached-minus, req uncached-minus
Oct 26 23:37:55  kernel: Could not invalidate pfn=0x3d84e04 from 1:1 map
Oct 26 23:37:56  sh: abrt-dump-oops: Found oopses: 1
Oct 26 23:37:56  sh: abrt-dump-oops: Creating problem directories
Oct 26 23:37:56  sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
Oct 26 23:37:56  systemd: Configuration file /opt/nsr/admin/networker.service is marked executable.

Cause

Software (SW) or HW failure causes the SDS process to be unresponsive and disconnect from MDM.
SDS recovers from the crash and passes the "re-configuration stage" which marks this SDS as generally available from the MDM's point of view, and to all other system components, including SDC.
After 15 seconds, SDC retries the I/O (default), meanwhile SDS is unresponsive again as described in point "1."
I/O fails upon timeout, and the SDC application reports an I/O Error.
Steps "2"→"4" may occur again and again until this SDS is evacuated from the system.

Resolution

The system is working as designed.

Option 1:
Remove the SDS from the cluster. You can remove an SDS at any time, with no downtime required. During removal, the associated data is replicated to different nodes. The removal process is asynchronous and may take a long time.

Note: If volumes use the capacity of this SDS, and the capacity cannot be replaced due to lack of available free space, the removal fails.

Fix the HW and SW issues that caused the SDS instability and return the SDS to the cluster.

Option 2:
Monitor the system, and if the SDS begins flapping under similar circumstances, stop the SDS service by running the following command on the SDS:

 /opt/emc/scaleio/sds/bin/delete_service.sh

Note: Stopping the SDS service triggers a rebuild. Once the issue is resolved, restart the SDS service by running the following command on the SDS:

 /opt/emc/scaleio/sds/bin/create_service.sh

Additional Information

Resiliency to this type of event is planned for PowerFlex Software version 4.0.

Affected Products

PowerFlex rack, VxRack

Article Number: 000181511

Article Type: Solution

Last Modified: 30 Oct 2024

Version: 2

Check if your device is covered by Support Services.

PowerFlex SDS Process Instability Causes I/O Error

Summary: SDS is repeatedly unresponsive causing I/O errors because the SDS is not evacuated from the system.

Symptoms

Cause

Resolution

Additional Info

Affected Products

Symptoms

Cause

Resolution

Additional Information

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

PowerFlex SDS Process Instability Causes I/O Error

Summary: SDS is repeatedly unresponsive causing I/O errors because the SDS is not evacuated from the system.

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Symptoms

Cause

Resolution

Additional Information

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services