Start a Conversation

Unsolved

This post is more than 5 years old

4794

March 13th, 2017 12:00

InsightIQ - Analyzing IOPS

Community,

We have a script that is running on the cluster that sends us a email when the disk on the Isilon NAS are experiencing High Disk IOPS.  I am trying to use InsightIQ to analyze the IOPS and wanted to make sure I had the correct graph that depicts IOPS.  Let me know.

Thank you,

450 Posts

March 13th, 2017 12:00

Nope, That's the average size of Disk Operations, not the volume.  So here are some hints (Best to look instead at things that actually matter to clients.  For instance):

1. Watch latency per protocol, per node. 

2. Or the no of NFS or SMB protocol ops/sec split out again per node, is 1 node out of 10 getting 50% of all NFS writes?  That's probably a problem.

3. Look at imbalances between read and write on the network interfaces (usually there are more reads than writes and that's fine), but understand it.

4. Problems frequently occur when namespace read values start to climb because an application is doing tons of metadata work. 

5. The number of connected clients per node is also an interesting item to keep an eye on to ensure that the workload is being balanced across all the nodes correctly.  With a large number of clients, say 10K user homedirs, that's more than enough that Round-Robin as a SmartConnect load balancing policy is fine.  For a small number of clients perhaps you change it. 

Summary:

In general, there are no hard and fast rules on Isilon for you hit X value, then that is a problem, and as such performance tuning is more art than science.  It's also not a block array where filesystem IOPS is a terribly useful value.  There are back-end filesystem operations that'll create IO, but as long as it doesn't impact clients then who cares, which is why the jobs have low priorities and impact policies by default.  As a result, I've always seen and used IIQ as more of a tool used to track down problems when the are happening or after they occurred, rather than to watch for a particular value to indicate that there is a problem.

~Chris

March 13th, 2017 19:00

Hi,

What Chris said is correct.

Aside from client workloads, Isilon has a number of built-in jobs that will incur IOPS under the hood.  So IOPS alone isn't particularly useful to the administrator.  If you want some type of indicator of "possible issues", I'd focus on customer-facing metrics like the Protocol Operations Average Latency graph module, available from a number of performance report types: Cluster Performance, Client Performance, or Network Performance report.  A caveat -- keep in mind that "event notification" in SMB can throw this metric way off.  Your best bet is to break-out by protocol like Chris suggests, then break-out by node.

Another metric for "under-the-hood" visibility I'd use would be -- Average Pending Disk Operations Count graph module.  You would find this one under Disk Performance report.

1.2K Posts

March 14th, 2017 03:00

While we are at it, it appears that high protocol operations latencies

can also be induced solely by a slow/overloaded client machine.

I believe we have observed this phenomenon mainly with

the userspace NFS implementation starting from OneFS 7.2,

where release notes also stated unspecific 'inaccuracies' of the measurements.

FWIW -- anybody had similar experiences (with reporting behavior, not the actual performance)?

Curious

-- Peter

March 14th, 2017 17:00

Hi Peter,

Very good point there.  I have in the past been involved in an escalation to troubleshoot a strangely higher-than-normal NFS3 latency.  It turns out that OneFS calculates protocol latency timing by including the ACK that's supposed to come back from the client side.  This means that if your NFS client doesn't send that ACK, the protocol latency clock keeps ticking.

356 Posts

March 15th, 2017 04:00

Guys,

I get it that IOPS isn't helpful to us admin of the Isilon system when it comes to performance.  Our past TEM wrote the script because our lead storage admin wanted to know pre-InsightIQ why the storage system seemed to be having performance issues for certain users/systems.  So the the script was instituted and when it reports high IOPS alerts in our email he wants to link that to a specific user/system as the culprit for the high disk I/O.  Mainly to see if that user/system requires a different type of storage to be connected to because the performance of the isilon cluster may no meet the required I/O speeds for that application.  We may need to look at moving that users/systems data to a higher performing storage.  Hopefully this helps and you can point me to where I can find the needed in the InsightIQ tool.

Thank you,

1.2K Posts

March 15th, 2017 06:00

Thanks Rob!

1 Message

October 31st, 2019 06:00

Hi,

We are also interested in understanding what the internal disks are doing.

 

Could you share this script, or indicate what internal metric it was using ?

 

Thnx

36 Posts

November 15th, 2019 15:00

Lots of good comments/advice in this thread.

Overall performance monitoring on OneFS is not simple because there are a lot of moving parts. It's not irrelevant to look at drive performance (obviously that's a critical component), but it's far from the complete picture. What I will say wrt to looking specifically at drive statistics is that there are a few that stand out and a few that are "traps".

For read performance, timeavg is the most useful statistic. As your workload increases, timeavg will increase and this is the average amount of time to satisfy reads from the drive.

For write performance, the drive numbers are generally not interesting because our writes are cached by the journal and asynchronously destaged to storage. Nonetheless, queued and timeinq can give a good indicator as to the work happening, and obviously that has a cascading effect on the read times above.

The ops values are a trap because they're not measured on the drive itself and there are layers where sequential operations can and are coalesced in-between. That explains why you see 7200RPM SATA HDDs happily reporting >400 ops/s with reasonable latencies. Those cannot be random operations.

The percentage busy is interesting but only up to a point. While it is below 100%, you can infer the utilization of the drive, but once it hits 100%, you have no way to distinguish the drive being just perfectly utilized and it being loaded to the point that it's about to become a molten puddle at the bottom of the cluster!

HTH,

Tim

No Events found!

Top