PowerFlex SDS panic caused by Linux Kernel bug
Summary: The issue only affecting Intel Haswell CPU SDS panic Data Unavailability (DU) caused by single SDS panic Long I/O serving causing SDC I/O failure. Linux Kernel bug
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
Scenario
- Intel Haswell CPU is being used.
- One of the SDSs report "data degraded" state and SDC's lost connection to volumes, with no obvious reason
- SDS panic
Symptoms
- ScaleIO system events report "data degraded":
ScaleIO system events report "data degraded": 205466 2015-12-10 08:11:49.450 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 205468 2015-12-10 08:12:04.688 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 205470 2015-12-10 08:12:06.699 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 205472 2015-12-10 08:12:16.931 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
SDS exp.0:
10/12 02:13:14.134144 Panic in file /emc/svc_flashbld/workspace/ScaleIO-SLES12/src/tgt/ioh/ioh.c, line 70, function iohIo_TimerExpired, PID 22333.Panic Expression !(1). /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosDbg_BackTrace+0x22) [0x479ba9] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosDbg_Panic+0xf0) [0x4740ad] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(iohIo_TimerExpired+0x5d) [0x43d92d] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosTimerQ_PollUnlocked+0x1b4) [0x46f6e3] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosTimer_PollQRange+0x83) [0x46fa6c] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(netPoll_StartIntr+0x2ef) [0x465808] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosUmt_StartFunc+0xbe) [0x47f07d] /opt/emc/scaleio/sds/bin/sds-1.32.3455.5(mosUmt_SignalHandler+0x4a) [0x47fa3a]
Impact
- Data unavailable
- SDC lost connection to volumes.
- I/O failure
- Long I/O service/performance degradation
Cause
Due to the Linux kernel bug, the SDS process behaved abnormally, because of this condition, the SDS process was in stress and the behavior was unpredictable.
While replying to keep alive requests, the SDS was not fully functional and was not responding to SDC I/O requests.
Such a condition did not allow ScaleIO to mark the SDS as failed, which eventually led to data unavailable.
- Linux Kernel bug information:
Futex: Fix a race condition between REQUEUE_PI and task death (bcn #851603 (futex scalability series).
Futex: Ensure get_futex_key_refs() always implies a barrier (bcn #851603 (futex scalability series)).
- For more information, see the following links:
Suse:SUSE-SU-2015:0068-1
Red Hat: Serious Red Hat Linux Bug Affects Haswell-based Servers - InfoQ
Resolution
Workaround
- Upgrade the Linux Kernel version.
Affected Products
PowerFlex rack, ScaleIOArticle Properties
Article Number: 000281636
Article Type: Solution
Last Modified: 06 Feb 2025
Version: 1
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.