chjatwork

2 Intern

•

356 Posts

0

7696

October 4th, 2016 04:00

Metrics for Isilon IOPS

Community,

I am trying to find the correct syntax to hopefully start collecting IOPS metric for specific folders on the cluster. Is this possible? I know the command I am looking for is isi statistics but the syntax to indicate a specific folder or directory and the triggers needed to reflect IOPS or throughput is where the difficulty comes into play. My experience with using isi statistics command is that the string can get every long and the order can be very confusing for the attributes used. It takes me hours to figure out how to get a favorable output. So I was hoping someone one here would be a wiz with use of the isi statistics command and could wipe this out in less than an hour?

The issue: I was tasked by my lead to pinpoint who the offender is that is causing the high IOPS notification from the isilon cluster. Granted this script is custom and its use to calculate IOPS across the entire cluster not any specific folder. We get alerts from this script when the IOPS threshold is met.

Thank you,

Responses(12)

Peter_Sero

1.2K Posts

1

October 4th, 2016 06:00

You are probably talking about the isi statistics query subcommand

where the query string can get very long.

For your case I would start with two other subcommands:

isi statistics client

isi statistics heat

(Check the help output for sorting, restricting or extending the output etc)

The former command is based on NAS (NFS, SMB) protocol operations

and helps revealing most the active users' names and client hostnames,

but not the directory and file names.

The second command is based on the internal file system

operations which do not necessarily match the NAS

protocol operations one-by-one. But it reveals the

directory and file names -- without knowing anything

about users and client hosts.

Usually a combined analysis of both outputs gives

a great clue about what is going on. But I haven't seen

an automated way of integrating these statistics;

we just inspect both outputs side by side,

either "live" or as gathered by a cron script.

hth

-- Peter

AU

Anonymous User

170 Posts

1

October 4th, 2016 10:00

What I do here is run isi statistics heat with a pre-defined path depth and send that to our graphite server. This lets me find hot volumes but won't usually let me find the user. In our HPC environment, it's next to impossible to find the offending user. For us, client-based statistics don't work at all since we have users running jobs across hundreds of clients at the same time.

You can start with this and tweak the pathdepth to see if something pops out:

isi statistics heat --degraded --nodes=all --totalby=Path --pathdepth=3 --noheader --noconversion | grep ifs | egrep -v ".ifsvar" |sed 's,/ifs/,,g' | sed -e "s/\/*$//" | head -n20

It's an instantaneous single point-in-time view of the cluster so you have to run it multiple times to see if you've got a short-term event or a trend. These are not absolute numbers but relative weights.

Happy hunting.

.../Ed

chjatwork

2 Intern

•

356 Posts

0

October 18th, 2016 06:00

Ed,

So because I am not up during the hours that the alerts show up how would you approach trying to catch who the possible offender is on the cluster causing the HIGH IOPS alerts?

Thank you,

jasdav02

7 Posts

0

October 22nd, 2016 08:00

If you're not looking at historical data when troubleshooting performance issues then you're missing out on a ton of insight. Obviously, I'd first start off looking at InsightIQ - assuming you have this deployed. I want to say that this now is a free add-in from Isilon (This used to be a paid-for licensed feature)

Also, just focusing on IOPs is really missing the point. When you say "IOP" what kind of operation? Remember these are NAS not block storage. There is a whole lot of complexity introduced with presenting a shared filesystem and then be cognisant that this is on a distributed cluster. This makes it even more complex when when comparing to a scale-up NAS.

Let's go through this!

Have you read the Ask The Experts discussion on performance troubleshooting?

Ask the Expert: Isilon Performance Analysis

Like Ed, my environment is a large HPC deployment (Hi Ed!) and as such we do a number of custom monitoring things to provide more inputs for analysis.

Basically, we dump in a ton of related stats from the various isi commands into a common time series database (Graphite).

Send Isilon stats to Graphite (Carbon) and to Dashing instance · GitHub

Quota data is also very important:

Sending directory quotas from the Isilon directly to a Graphite (carbon) instance · GitHub

I should also mention that Isilon has a tool that is something equivalent to this though it's a poller (pull vs. push) and it directs to InfluxDB.

GitHub - Isilon/isilon_data_insights_connector: Connector to fetch stats from OneFS and push them to InfluxDB/Grafana

As mentioned above, we've settled on Grafana as our universal visualization tool:

Grafana - Beautiful Metrics Dashboards, Data Visualization and Monitoring

The advantage of doing all of this custom monitoring is that not only can you collect stats from your Isilon cluster but you can collect stats from other sources. Once you have these in one place then you can do some interesting things, like correlation

Internally we've deployed both email alerting and large visual dashboards on large TVs that display the only metrics that really matter: read and write latencies! (isi statistics client --orderby timeavg --totalby class)

We warn/error threshold @ 15 and 30ms over a moving average of 10 minutes (In an attempt to exclude short lived, one-time events)

Once we're alerted then we look at some of the easy things first given that in our environment it's typically these 90% of of the time.

An example checklist from our internal documentation:

Check and see if any quotas are full:

isi quota quotas ls --exceeded --type directory

Check for evidence multiple writes incoming to a single file:

isi statistics heat --class write | awk '{print $4,$5}' | sort | uniq -c | sort -nrk 1 | less

If the first two aren't the issue then this becomes complicated
Start using dashboards!

I appreciate that you might not have all of this infrastructure setup so here is what I'd do in the CLI, ad-hoc:

Look for top users by write throughput: isi statistics client --orderby in --totalby username -t
Look for top users by write operations: isi statistics client --orderby ops --class write --totalby username -t
Look for top users by namespace_write operations (This can't be overstated! Metadata write operations are IMPORTANT): isi statistics client --orderby ops --class namespace_write --totalby username -t
Look for top users by averaged write latency: isi statistics client --orderby timeavg --class write --totalby username -t
Look for top users by averaged namespace_write latency: isi statistics client --orderby timeavg --class namespace_write --totalby username -t

Once you have an idea who the busiest users are then start digging through heat: isi statistics heat --orderby ops --class write,namespace_write --totalby path | less

Note that I'm really hammering on write latencies as for most performance issues I've seen, read saturation hasn't been the culprit

cadencep45

2 Intern

•

301 Posts

0

April 24th, 2017 05:00

Is there an updated link for

Sending Isilon stats to grafite(carbon) and to Dashing instance

as the link seems dead ?

AU

Anonymous User

170 Posts

0

April 24th, 2017 06:00

I clicked on the link and it's live for me.

https://gist.github.com/scr512/112397effdf1bce78908

You might want to start up at Jason's root and look at all of the stuff he's put up: https://gist.github.com/scr512

cadencep45

2 Intern

•

301 Posts

0

April 24th, 2017 07:00

yup, corporate firewall strikes again

cadencep45

2 Intern

•

301 Posts

0

April 24th, 2017 07:00

I suspect a corporate firewall. I will try from home

chjatwork

2 Intern

•

356 Posts

0

April 25th, 2017 08:00

Hey I used a version of ed.witts isi statistics command to get information, but for the most part we are using InsightIQ. Its not the best tool but its something.

RobChang-Isilon

136 Posts

0

April 25th, 2017 10:00

Hi chjatwork ,

Couple of things here:

1) If you could share with us what areas of InsightIQ you'd like to see improved, it would help us tremendously. The next evolution of InsightIQ will have a ton of goodness; but we'd still like to capture end-user's workflows and use cases as much as we can. If you don't mind sharing your thoughts, please connect with your account SE. The Dell EMC account SE will capture what you are trying to accomplish, and forward that information to product management as a feature request. Product Management and Isilon Engineering regularly comb through feature requests and define product requirement for these frequently-asked features.

2) As jasdav02 mentioned earlier, this is a NAS system, not block. To measure performance (or in your case, gauge potential "issues") purely based on IOPS can often lead to nowhere. As you can imagine, many tasks within the filesystem incurs IOPS. FlexProtect, SmartPools, FSA, and a host of under-the-hood maintenance jobs incur IOPS.

3) If you are trying to locate "the worst offenders" -- 2 InsightIQ reports come to mind. You could use the "Client Performance" report in InsightIQ -- use the client performance table and sort by one of the metrics. Or you could use the Cluster Performance report and perform various break-out to look for your offending client.

For the Cluster Performance report I mentioned above, I had produced a short demo video a while back. Let me first apologize for the video being a bit unpolished -- it was the first demo video I had produced:

InsightIQ Demo Video: Identify Demanding Clients

https://youtu.be/C7qII9kamiQ?list=PLbssOJyyvHuXHcAu13ZFwJU7GGVdiL_z7

cadencep45

2 Intern

•

301 Posts

0

May 16th, 2017 04:00

question re isi statistics v REST

from the above, it appears a lot of folks are scheduling on the isilon isi statistics commands and sending to graph in grafana. This sounds good. My question is I see a lot of "here be dragons" warnings re using cron on isilon.

Is it the way to go or can I do equivalent to isi statistics queries via REST API so I can schedule polling of data outside the isilon itself.

My plan is to present isilon stats via grafana to correlate with applications using isilon storage.

Peter_Sero

1.2K Posts

1

May 16th, 2017 05:00

Take a look here:

https://community.emc.com/blogs/keith/2017/01/26/isilon-data-insights-connector--do-it-yourself-isilon-monitoring

Or, in case you'd like to start from scratch with the Isilon API:

REST API and performance stats

have fun!

-- Peter

View All

No Events found!