Reid3

14 Posts

2492

May 11th, 2015 14:00

Looking for Isilon metrics that indicate cache utilization/under-utilization

At the EMC World session on Maximizing Isilon Performance a lot of discussion centered around tuning things to better utilize cache. We were told that there were some metrics on the Isilon that can be analyzed to identify if our cache is being well utilized, but these may not be documented anywhere. Can anyone shed some light on the commands and strategies I might use to identify how my cache is being used and how I might be able to tune things?

Thanks,

Reid

Responses(11)

C

crklosterman

450 Posts

1

May 11th, 2015 15:00

Reid,

Ideally what you're going to be looking for are things like the hit/miss rates of your existing cache configuration, as well as the volume of metadata operations, which are visible in InsightIQ Performance charts. Look for namespace_read and namespace_write operations.

'isi_cache_stats'

Can provide you a good real-time view of hits/misses on any single node for L1, L2, and L3 (if you have it).

But that'll get you started. Engage with your EMC Isilon Account Team if you need further assistance in this arena;

~Chris Klosterman

Senior Solution Architect

EMC Isilon Offer & Enablement Team

kipcranford

125 Posts

0

May 11th, 2015 16:00

Yes, I was the presenter and said those things Today there doesn't exist very thorough documentation on actually analyzing a lot of our gathered statistics, especially the cache stats.

As Chris mentioned, the kernel stats that OneFS collects are what you want to analyze. At EMC World, I specifically mentioned the 'isi_cache_stats' utility, which is available from the CLI of any node. Additionally, the same stats buckets that that utility draws from are also accessible to IIQ, as Chris also mentioned. IIQ and 'isi_cache_stats', and 'isi statistics' in general, are all just consumers of the same stats buckets.

Take some time to familiarize yourself with the various display tools, whether it be GUI (IIQ) or CLI (cache stats, 'isi statistics', etc) and what they can show you, then post back in the community if you have more questions.

kipcranford

125 Posts

0

May 11th, 2015 20:00

Thanks dynamox!

His name is Anton Rang, software developer within the Isilon Engineering performance group...

dynamox

2 Intern

•

20.4K Posts

0

May 11th, 2015 20:00

Anton is very well spoken and great at explaining complicated concepts ...are you sure he is a developer ?!

dynamox

2 Intern

•

20.4K Posts

0

May 11th, 2015 20:00

that was one of my favorite sessions by the way , your co-presenter did great too (can't remember his name, sounded Russian)

kipcranford

125 Posts

0

May 11th, 2015 22:00

I lurk on a few lists where I see his commits, he's definitely a developer I'll pass along the kind words...

Peter_Sero

1.2K Posts

0

May 14th, 2015 21:00

Just saw that he slides are available for download - thanks for the awesome stuff, Kip & other presenters!

As for the cache statistics, I have noticed on our cluster that L1 data cache hit figures

have drastically changed from 7.0.2 and 7.1.1.

With 7.0.2 (and 6.5.5) the L1 data cache + prefetch hits where usually very low

for streaming reads, all the action was taking place in L2.

With 7.1.1. L1 activity is about as high as the L2 activity (sum of cache + prefetch hits

almost equally high on both, and both roughly equalling the overall outgoing NAS traffic).

I wonder wether only the L1 cache reporting has changed or the underlying functionality...

Curious

-- Peter

kipcranford

125 Posts

0

May 15th, 2015 11:00

> I wonder wether only the L1 cache reporting has changed or the underlying functionality...

I see the same thing you do. Attached (sorry, I couldn't figure out how to include the text inline without this tool screwing up the format) is output from two NFS3 concurrent streaming read tests where the filesystem was set to 'streaming' access and layout. The first output is from 7.0.2, the second from 7.2.0. This output comes from parsing and collating stats gathered using "isi_cache_stats" over the life of each test (the parser is something I wrote).

This seems to be in line with what you're seeing. As to the reasons why, I'm not sure about the reporting angle, but I do know that in OneFS 7.1.1+, the prefetch code did see a major rewrite, to remove some latencies, add some efficiencies, and mostly to lay the foundation for the Adaptive Prefetch functionality that will be debuting later this year.

1 Attachment

cache.txt

Peter_Sero

1.2K Posts

0

May 20th, 2015 01:00

Thanks Kip, one can interpret the 7.0 report basically illustrating how the L1 data misses

are showing up as L2 requests.. as one would expect from stacked caches and

much in line with the OneFS 7.1.1 cache paper. The 7.2. report says all reads

are going to both L1 and L2 levels at once... wierd...

Time for a new White Paper...

-- Peter

Nikschen

179 Posts

0

May 26th, 2015 08:00

Hi Peter,

I will submit your feedback on the white paper to isicontent@emc.com , feel free to do so in the future for other requests

Niki

kipcranford

125 Posts

0

June 26th, 2015 10:00

> The 7.2. report says all reads are going to both L1 and L2 levels at once... wierd...

I try to describe what's going on in 7.2.0 (and 7.1.1) in this blog post:

https://community.emc.com/blogs/kip_cranford/2015/06/26/deconstructing-onefs-cache-statistics

As for the 7.0.2 stats, they are reported differently possibly because of the underlying functionality of prefetch in that release, or possibly because the reporting in that release is slightly broken. I'm still trying to figure that all out ( I have a bug filed with Engineering).

View All

No Events found!