Isilon: Increasing the drive stall timeout values in OneFS
Summary: Many OneFS versions have obsolete timeout values configured for the drive stall timer. Increasing these values can help prevent unnecessary drive stall events.
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
In OneFS, a drive is considered stalled if either of the following conditions is detected:
When a drive stall occurs, the system attempts to prevent operations from being sent to that drive, preferring other drives in the system. This attempts to improve performance by avoiding operations on drives that are already overloaded, but may have an adverse effect if the drive was marked stalled unnecessarily. Latency-sensitive workflows can be impacted by drive stalls. Typically, drive stalls also cause group changes, which may in turn lead to restripe jobs being initiated unnecessarily.
- A specific drive transaction takes longer than a certain amount of time to complete.
- 50 of the last 1300 input/output operations (I/Os) took longer than a certain amount of time to complete.
When a drive stall occurs, the system attempts to prevent operations from being sent to that drive, preferring other drives in the system. This attempts to improve performance by avoiding operations on drives that are already overloaded, but may have an adverse effect if the drive was marked stalled unnecessarily. Latency-sensitive workflows can be impacted by drive stalls. Typically, drive stalls also cause group changes, which may in turn lead to restripe jobs being initiated unnecessarily.
Cause
The previously set limits were appropriate for the older generations of Isilon hardware that had different performance and throughput characteristics from modern hardware. As both node and hard drive technology changed over time, these values were determined to be excessively strict on newer node types, causing stall events to be triggered unnecessarily.
Resolution
NOTE: OneFS 8.0 already has the correct values configured by default, so this procedure is not necessary on OneFS versions 8.0.0.0 and up.
Outputs similar to the following should appear:
clustername-1: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-2: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-3: hw.disk_event.thresh.slowacc_usec: 3500000
- Log in to the node's Command Line Interface using an SSH client, or establish a serial connection to the node.
- Run the following command to back up the /etc/mcp/override/sysctl.conf file:
# touch /etc/mcp/override/sysctl.conf && cp /etc/mcp/override/sysctl.conf /etc/mcp/override/sysctl.conf.bku1
- Run the following command to change the drive stall timeout value:
# isi_sysctl_cluster hw.disk_event.thresh.slowacc_usec=3500000
- Run the following command to verify that the value is now set correctly:
# isi_for_array -s sysctl hw.disk_event.thresh.slowacc_usec
Outputs similar to the following should appear:
clustername-1: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-2: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-3: hw.disk_event.thresh.slowacc_usec: 3500000
Affected Products
Isilon, PowerScale OneFSArticle Properties
Article Number: 000052229
Article Type: Solution
Last Modified: 28 Jun 2023
Version: 6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.