PowerScale OneFS: Troubleshooting Performance Issues

Summary: Troubleshoot PowerScale OneFS slow performance with our comprehensive guide on network configuration, processing loads, and monitoring with InsightIQ for improved cluster efficiency.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Client computers perform slowly. Specific jobs, particularly those running on the cluster, either fail or take longer than expected.

Cause

Performance issues are typically due to network traffic, network configuration issues, client or cluster processing load, or a combination thereof. This article describes several effective ways to troubleshoot performance issues.

Resolution

Troubleshooting with InsightIQ

Table of Contents:

  • Using Isilon InsightIQ
  • Troubleshooting without InsightIQ
  • Network throughput
  • Distribution of client connections
  • SmartConnect
  • Cluster throughput
  • Cluster processing
  • Queued operations
  • CPU


Using Isilon InsightIQ

Using Isilon InsightIQ is the best way to monitor performance and to troubleshoot performance issues.

The Isilon InsightIQ virtual appliance enables you to monitor and analyze Isilon cluster activity through flexible, customizable chart views in the InsightIQ web-based application. These charts provide detailed information about cluster hardware, software, and file system and protocol operations. InsightIQ transforms data into visual information that emphasizes any performance outliers, enabling quick diagnosis of bottlenecks or optimize workflows.

For details on using InsightIQ, see the InsightIQ User Guide.


Troubleshooting without InsightIQ

If you are not using InsightIQ, you can run various commands to investigate performance issues. Troubleshoot performance issues first by examining network and cluster throughput, then by examining cluster processing, and finally by examining individual node CPU rates.


Network throughput

Use a network testing tool such as Iperf to determine the throughput capabilities of the cluster and client computers on your network.

Using Iperf, run the following commands on the cluster and client. These commands define a window size that is large enough to reveal if the network link is a potential cause of latency issues.

  • Cluster:
iperf -s -w 262144
  • Client:
iperf -c <cluster IP> -w 262144


Distribution of client connections

Check how many NFS and SMB clients are connected to the cluster to ensure they are not favoring one node.

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command to check NFS clients:
    isi statistics query - nodes=all --stats=node.clientstats.connected.nfs,node.clientstats.active.nfs
    The output displays the number of clients connected per node and how many of those clients are active on each node.
  3. Run the following command to check SMB clients:
    isi statistics query - nodes=all --stats=node.clientstats.connected.smb,
    node.clientstats.active.smb1,node.clientstats.active.smb2
    The output displays the number of clients connected per node and how many of those clients are active on each node.


SmartConnect

Check to ensure that the node that SmartConnect is running on is not burdened with network traffic.

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command:
    isi_for_array -sq 'ifconfig|grep em -A3'
    The output displays a list of all the IP addresses that are bound to the external interface.
  3. Check for any nodes that have one additional IP address than the rest.
  4. Check the status of the nodes that you noticed in step 3 by running the following command:
    isi status
    Check the throughput column of the output to determine the load of the nodes noticed in step 3.


Cluster throughput

Assess cluster throughput by conducting write and read tests that measure the amount of time it takes to read from and write to a file. Conduct at least one write test and one read test, as follows.

Write test

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Change to the /ifs directory:
    cd /ifs
  3. From the command-line interface (CLI) on the cluster or from a UNIX or Linux client computer, use the dd command to write a new file to the cluster. Run the following command:
    dd if=/dev/zero of=1GBfile bs=1024k count=1024
    This command creates a sample 1GB file and reports the amount of time it took to write it to disk.
  4. From the output of this command, extrapolate how many MB per second can be written to disk in single-stream workflows.
  5. If you have a MAC client and want to conduct further analysis,
    1. Start Activity Monitor.
    2. Run the following command, where pathToFile is the file path of the targeted file:
      cat /dev/zero > /pathToFile
      This command helps measure the throughput of write operations on the Isilon cluster. (Although it is possible to run the dd command from a MAC client, results can be inconsistent.)
    3. Monitor the results of the command in the Activity Monitor's Network tab.

Read test
When measuring the throughput of read operations, be sure not to conduct read tests on the file that you created during the write test. Because that file has been cached, the results of your read tests would be inaccurate. Instead, test a read operation of a file that has not been cached. Find a file on the cluster that is larger than 1GB, and reference that file in the read test.

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. From the CLI on the cluster or from a UNIX or Linux client computer, use the dd command to read a file on the cluster. Run the following command where pathToFile is the file path of the targeted file:
    dd if=/pathToLargeFile of=/dev/null bs=1024k
    This command reads the targeted file and reports the amount of time it took to read it.
  3. If you have a MAC client and want to conduct further analysis,
    1. Start Activity Monitor.
    2. Run the following command where pathToFile is the file path of the targeted file:
      time cp /pathToLargeFile > /dev/null
      This command helps measure the throughput of read operations on the Isilon cluster. (Although it is possible to run the dd command from a MAC client, results can be inconsistent.)
    3. Monitor the results of the command in the Activity Monitor's Network tab.


Cluster processing

Restripe jobs
Before examining input/output (I/O) operations (IOPS) of the cluster:

  • Determine which jobs are running on the cluster. If restripe jobs such as Auto-Balance, Collect, or Multi-Scan are running, consider why those jobs are running and if they should continue to run.
  • Consider the type of data being consumed. If client computers are working with large video files or virtual machines (VMs), the restriped job requires a higher amount of disk IOPS than normal.
  • Consider temporarily pausing a restripe job. Doing so can significantly improve performance and might be a viable short-term solution to a performance issue.

Disk I/O
Examining disk I/O can help determine if certain disks are being overused.

By cluster

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command to ascertain disk I/O:
    isi statistics pstat
  3. From the output of this command, divide the disk IOPS by the total number of disks in the cluster. For example, for an 8-node cluster using Isilon IQ 12000x nodes, which hosts 12 drives per node, you divide the disk IOPS by 96.

    For X-Series nodes and NL-Series nodes, you should expect to see disk IOPS of 70 or less for 100% random workflows, or disk IOPS of 140 or less for 100% sequential workflows. Because NL-Series nodes have less RAM and lower CPU speeds than X-Series nodes, X-Series nodes can handle higher disk IOPS.

By node and by disk

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command to ascertain disk IOPS by node, which can help discover disks that are overused:
    isi statistics query --nodes=all --stats=node.disk.xfers.rate.sum --top
  3. Run the following command to determine how to query for statistics on a per disk basis:
    isi statistics describe --stats=all | grep disk
    


Queued operations

Another way to determine if disks are being overused is to determine how many operations are queued for each disk in the cluster. For a single stream SMB-based workflow, a queue of 4 can indicate an issue, while for high concurrency NFS namespace operations, the queue is greater.

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command to determine how many operations are queued for each disk in the cluster:
    isi_for_array -s sysctl hw.iosched | grep total_inqueue
  3. Determine the latency caused by the queue operations:
    sysctl -aN hw.iosched|grep bios_inqueue|xargs sysctl -D


CPU

CPU issues are frequently traced to the operations clients perform on the cluster. Using the isi statistics command, you can determine the operations performed on the cluster, cataloged by either network protocol or client computer.

  1. Open an SSH connection on any node in the cluster and log in using the "root" account.
  2. Run the following command to determine which operations are being performed across the network and assess which of those operations are taking the most time:
    isi statistics protocol --orderby=TimeAvg --top
    This command output gives detailed statistics for all network protocols, organized by how long the cluster takes to respond to clients. Although the results of this command may not identify which operation is the slowest, it can point you in the right direction.
  3. Run the following command to obtain more information about CPU processing, such as which nodes' CPUs are the most heavily used:
    isi statistics system --top
  4. Run the following command to obtain the four processes on each node that are consuming the most CPU resources:
    isi_for_array -sq 'top -d1|grep PID -A4'

Affected Products

PowerScale, PowerScale OneFS

Products

Isilon, PowerScale OneFS
Article Properties
Article Number: 000015384
Article Type: Solution
Last Modified: 30 Jan 2025
Version:  11
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.