Isilon: Increasing the drive stall timeout values in OneFS

Summary: Many OneFS versions have obsolete timeout values configured for the drive stall timer. Increasing these values can help prevent unnecessary drive stall events.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

In OneFS, a drive is considered stalled if either of the following conditions is detected:
  • A specific drive transaction takes longer than a certain amount of time to complete.
  • 50 of the last 1300 input/output operations (I/Os) took longer than a certain amount of time to complete.
In OneFS versions prior to 8.0, these values were set to 1.5s and 150ms, respectively. EMC Isilon Engineering has determined that the 1.5s value is no longer appropriate for newer hardware running any version of OneFS, and should be changed to 3.5s. Making this change helps prevent the system from generating unnecessary drive stall messages.

When a drive stall occurs, the system attempts to prevent operations from being sent to that drive, preferring other drives in the system. This attempts to improve performance by avoiding operations on drives that are already overloaded, but may have an adverse effect if the drive was marked stalled unnecessarily. Latency-sensitive workflows can be impacted by drive stalls. Typically, drive stalls also cause group changes, which may in turn lead to restripe jobs being initiated unnecessarily.

Cause

The previously set limits were appropriate for the older generations of Isilon hardware that had different performance and throughput characteristics from modern hardware. As both node and hard drive technology changed over time, these values were determined to be excessively strict on newer node types, causing stall events to be triggered unnecessarily.

Resolution

NOTE: OneFS 8.0 already has the correct values configured by default, so this procedure is not necessary on OneFS versions 8.0.0.0 and up.
  • Log in to the node's Command Line Interface using an SSH client, or establish a serial connection to the node.
  • Run the following command to back up the /etc/mcp/override/sysctl.conf file:
# touch /etc/mcp/override/sysctl.conf && cp /etc/mcp/override/sysctl.conf /etc/mcp/override/sysctl.conf.bku1
  • Run the following command to change the drive stall timeout value:
# isi_sysctl_cluster hw.disk_event.thresh.slowacc_usec=3500000
  • Run the following command to verify that the value is now set correctly:
# isi_for_array -s sysctl hw.disk_event.thresh.slowacc_usec

Outputs similar to the following should appear:
 
clustername-1: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-2: hw.disk_event.thresh.slowacc_usec: 3500000
clustername-3: hw.disk_event.thresh.slowacc_usec: 3500000

Affected Products

Isilon, PowerScale OneFS
Article Properties
Article Number: 000052229
Article Type: Solution
Last Modified: 28 Jun 2023
Version:  6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.