Unsolved
This post is more than 5 years old
2 Intern
•
356 Posts
0
7696
Metrics for Isilon IOPS
Community,
I am trying to find the correct syntax to hopefully start collecting IOPS metric for specific folders on the cluster. Is this possible? I know the command I am looking for is isi statistics but the syntax to indicate a specific folder or directory and the triggers needed to reflect IOPS or throughput is where the difficulty comes into play. My experience with using isi statistics command is that the string can get every long and the order can be very confusing for the attributes used. It takes me hours to figure out how to get a favorable output. So I was hoping someone one here would be a wiz with use of the isi statistics command and could wipe this out in less than an hour?
The issue: I was tasked by my lead to pinpoint who the offender is that is causing the high IOPS notification from the isilon cluster. Granted this script is custom and its use to calculate IOPS across the entire cluster not any specific folder. We get alerts from this script when the IOPS threshold is met.
Thank you,
Peter_Sero
1.2K Posts
1
October 4th, 2016 06:00
You are probably talking about the isi statistics query subcommand
where the query string can get very long.
For your case I would start with two other subcommands:
isi statistics client
isi statistics heat
(Check the help output for sorting, restricting or extending the output etc)
The former command is based on NAS (NFS, SMB) protocol operations
and helps revealing most the active users' names and client hostnames,
but not the directory and file names.
The second command is based on the internal file system
operations which do not necessarily match the NAS
protocol operations one-by-one. But it reveals the
directory and file names -- without knowing anything
about users and client hosts.
Usually a combined analysis of both outputs gives
a great clue about what is going on. But I haven't seen
an automated way of integrating these statistics;
we just inspect both outputs side by side,
either "live" or as gathered by a cron script.
hth
-- Peter
Anonymous User
170 Posts
1
October 4th, 2016 10:00
What I do here is run isi statistics heat with a pre-defined path depth and send that to our graphite server. This lets me find hot volumes but won't usually let me find the user. In our HPC environment, it's next to impossible to find the offending user. For us, client-based statistics don't work at all since we have users running jobs across hundreds of clients at the same time.
You can start with this and tweak the pathdepth to see if something pops out:
isi statistics heat --degraded --nodes=all --totalby=Path --pathdepth=3 --noheader --noconversion | grep ifs | egrep -v ".ifsvar" |sed 's,/ifs/,,g' | sed -e "s/\/*$//" | head -n20
It's an instantaneous single point-in-time view of the cluster so you have to run it multiple times to see if you've got a short-term event or a trend. These are not absolute numbers but relative weights.
Happy hunting.
.../Ed
chjatwork
2 Intern
2 Intern
•
356 Posts
0
October 18th, 2016 06:00
Ed,
So because I am not up during the hours that the alerts show up how would you approach trying to catch who the possible offender is on the cluster causing the HIGH IOPS alerts?
Thank you,
jasdav02
7 Posts
0
October 22nd, 2016 08:00
If you're not looking at historical data when troubleshooting performance issues then you're missing out on a ton of insight. Obviously, I'd first start off looking at InsightIQ - assuming you have this deployed. I want to say that this now is a free add-in from Isilon (This used to be a paid-for licensed feature)
Also, just focusing on IOPs is really missing the point. When you say "IOP" what kind of operation? Remember these are NAS not block storage. There is a whole lot of complexity introduced with presenting a shared filesystem and then be cognisant that this is on a distributed cluster. This makes it even more complex when when comparing to a scale-up NAS.
Let's go through this!
Have you read the Ask The Experts discussion on performance troubleshooting?
Ask the Expert: Isilon Performance Analysis
Like Ed, my environment is a large HPC deployment (Hi Ed!) and as such we do a number of custom monitoring things to provide more inputs for analysis.
Basically, we dump in a ton of related stats from the various isi commands into a common time series database (Graphite).
Send Isilon stats to Graphite (Carbon) and to Dashing instance · GitHub
Quota data is also very important:
Sending directory quotas from the Isilon directly to a Graphite (carbon) instance · GitHub
I should also mention that Isilon has a tool that is something equivalent to this though it's a poller (pull vs. push) and it directs to InfluxDB.
GitHub - Isilon/isilon_data_insights_connector: Connector to fetch stats from OneFS and push them to InfluxDB/Grafana
As mentioned above, we've settled on Grafana as our universal visualization tool:
Grafana - Beautiful Metrics Dashboards, Data Visualization and Monitoring
The advantage of doing all of this custom monitoring is that not only can you collect stats from your Isilon cluster but you can collect stats from other sources. Once you have these in one place then you can do some interesting things, like correlation
Internally we've deployed both email alerting and large visual dashboards on large TVs that display the only metrics that really matter: read and write latencies! (isi statistics client --orderby timeavg --totalby class)
We warn/error threshold @ 15 and 30ms over a moving average of 10 minutes (In an attempt to exclude short lived, one-time events)
Once we're alerted then we look at some of the easy things first given that in our environment it's typically these 90% of of the time.
An example checklist from our internal documentation:
isi
quota
quotas
ls
--exceeded --
type
directory
isi statistics heat --class write |
awk
'{print $4,$5}'
|
sort
|
uniq
-c |
sort
-nrk 1 |
less
Start using dashboards!
I appreciate that you might not have all of this infrastructure setup so here is what I'd do in the CLI, ad-hoc:
Once you have an idea who the busiest users are then start digging through heat: isi statistics heat --orderby ops --class write,namespace_write --totalby path | less
Note that I'm really hammering on write latencies as for most performance issues I've seen, read saturation hasn't been the culprit
cadencep45
2 Intern
2 Intern
•
301 Posts
0
April 24th, 2017 05:00
Is there an updated link for
Sending Isilon stats to grafite(carbon) and to Dashing instance
as the link seems dead ?
Anonymous User
170 Posts
0
April 24th, 2017 06:00
I clicked on the link and it's live for me.
https://gist.github.com/scr512/112397effdf1bce78908
You might want to start up at Jason's root and look at all of the stuff he's put up: https://gist.github.com/scr512
cadencep45
2 Intern
2 Intern
•
301 Posts
0
April 24th, 2017 07:00
yup, corporate firewall strikes again
cadencep45
2 Intern
2 Intern
•
301 Posts
0
April 24th, 2017 07:00
I suspect a corporate firewall. I will try from home
chjatwork
2 Intern
2 Intern
•
356 Posts
0
April 25th, 2017 08:00
Hey I used a version of ed.witts isi statistics command to get information, but for the most part we are using InsightIQ. Its not the best tool but its something.
RobChang-Isilon
136 Posts
0
April 25th, 2017 10:00
Hi chjatwork ,
Couple of things here:
1) If you could share with us what areas of InsightIQ you'd like to see improved, it would help us tremendously. The next evolution of InsightIQ will have a ton of goodness; but we'd still like to capture end-user's workflows and use cases as much as we can. If you don't mind sharing your thoughts, please connect with your account SE. The Dell EMC account SE will capture what you are trying to accomplish, and forward that information to product management as a feature request. Product Management and Isilon Engineering regularly comb through feature requests and define product requirement for these frequently-asked features.
2) As jasdav02 mentioned earlier, this is a NAS system, not block. To measure performance (or in your case, gauge potential "issues") purely based on IOPS can often lead to nowhere. As you can imagine, many tasks within the filesystem incurs IOPS. FlexProtect, SmartPools, FSA, and a host of under-the-hood maintenance jobs incur IOPS.
3) If you are trying to locate "the worst offenders" -- 2 InsightIQ reports come to mind. You could use the "Client Performance" report in InsightIQ -- use the client performance table and sort by one of the metrics. Or you could use the Cluster Performance report and perform various break-out to look for your offending client.
For the Cluster Performance report I mentioned above, I had produced a short demo video a while back. Let me first apologize for the video being a bit unpolished -- it was the first demo video I had produced:
InsightIQ Demo Video: Identify Demanding Clients
https://youtu.be/C7qII9kamiQ?list=PLbssOJyyvHuXHcAu13ZFwJU7GGVdiL_z7
cadencep45
2 Intern
2 Intern
•
301 Posts
0
May 16th, 2017 04:00
question re isi statistics v REST
from the above, it appears a lot of folks are scheduling on the isilon isi statistics commands and sending to graph in grafana. This sounds good. My question is I see a lot of "here be dragons" warnings re using cron on isilon.
Is it the way to go or can I do equivalent to isi statistics queries via REST API so I can schedule polling of data outside the isilon itself.
My plan is to present isilon stats via grafana to correlate with applications using isilon storage.
Peter_Sero
1.2K Posts
1
May 16th, 2017 05:00
Take a look here:
https://community.emc.com/blogs/keith/2017/01/26/isilon-data-insights-connector--do-it-yourself-isilon-monitoring
Or, in case you'd like to start from scratch with the Isilon API:
REST API and performance stats
have fun!
-- Peter