2 Intern

 • 

300 Posts

January 7th, 2016 00:00

Hi Peter,

nice insight. didn't think about that. This means to warm up you would have to connect to each node and make a complete treewalk.

In the end I didn't warm up the cache, since the activation time was quite unpredictive (i.e. 2 Clusters one standby for the other, so the same configuration, nodetype, nodecount and data amount: standby needed ~70h, active needed (with usr-load!) ~18h) and my Administrators were not willing to take a look every hour on the weekend...

All in all im quite happy with the activation.

after the users warmed up the cache and didn't reset the stats we have

data hit rates ~ 70%

meta hit rates ~80%

the ratio of disk opsout to ssd ops out is ~1:15 - 1:20

the ratio of disk BytesOut to ssd BytesOut is ~1:5

4 Operator

 • 

1.2K Posts

January 7th, 2016 01:00

Glad to see it works well for you!

Can you put the L3 hit rates into one context with the L2 hit and miss rates?

Because the actual L3 hit rate with respect to the requested data (or metadata)

reads conceptually is: (L2 miss%) x (L3 hits%)

In other words: if L2 cache works great (low L2 miss%), there is not much left for the L3 to do, and the L3 hit percentage becomes less meaningful...

2 Intern

 • 

300 Posts

January 7th, 2016 02:00


i knew before i have relatively high hit rates in L1 and L2 and thus my logic is i will have high hit rates in L3 too. that's correct as my stats show.

and yes you do profit more, if the L3 has more request to answer. But you can also have wins in the "low number"

calculation:

Without L3: L1.miss x L2.miss = disk-hits

                   2% x 6,5% = 0,13%

With L3:      L1.miss x L2.miss x L3.miss = disk.hits

                   2% x 6,5% x 31,3% = 0,04%

so even if i have high hitrates i reduce the disk.hits.

cache_stats.png

4 Operator

 • 

1.2K Posts

January 7th, 2016 05:00

I'd leave the L1 cache out of the picture here.

Firstly, the L1 cache happens where the file is accessed (node where client is connected to), while L2/L3 are local to the node that has the disk blocks. One would need to summarize all L1 activity in a cluster, and compare to the summarized L2/L3 activity (or by nodes pools if they match the configured NAS access zones).

Secondly, the accounting of the L1 inOneFS  7.1 and later is "overlapping" with the L2 activity so that the cache "levels" are not visible as "stages" in the same sense as with L2 and L3, even if summarizes over the cluster.

In your example the hits and missed don't add up to 100% (= L2 starts),

because the prefetch hits etc are omitted. The picture I get is as follows:

Data:

L2 hits/pref:  93.5% = 100% - 6.5%      (assuming all non-misses are hits or prefetch hits)

L3 hits:           4.5%   = 6.5% * 68.7%

L3 misses:     2.0%   = 6.5% * 31.2%    (i.e. FINAL misses)

Meta:

L2 hits+pref:   92.5% = 100% - 7.5%

L3 hits:            5.9%  = 7.5% * 78.1%

L3 misses:      1.6%  = 7.5% * 21.9%

Which means that the final misses are still pretty low, 2% and 1.6%, really nice!

Happy caching!

-- Peter

4 Operator

 • 

1.2K Posts

January 7th, 2016 05:00

l2_meta hits are reported as 81.5% and misses as 7.5%, so still missing 11% to make up for 100%. For a consistent picture where my three figures should add up to 100% I had "adjusted" the 81.5% by those 11%, assuming the best case namely prefetch hits.

I fyou have the full cache statistics, first double check that the L2 misses (block rate) are basically the same as the L3 starts (block rate). This simply reflects how L3 is staged after L2. And it allows you to draw all L3 activity to the scale of the initial cache starts (from L2) by multiplying with the L2 misses (percentage!). Think of drawing all L2 and L3 activity into one bar chart, all at the same scale.

Sad to learn that your L3 is still "nonexistent" -- anything went wrong with the drive (re-)purposing? There has been a KB on this... I remember your last drive statistics were kind of inexplicable...

-- Peter

2 Intern

 • 

205 Posts

January 7th, 2016 05:00

Peter, my brain is addled right now, but I'm not seeing where you're getting the 92.5% number in your metadata calculation.

Looking at mine right now, I'm trying ti figure out how it compares (bearing in mind that my L3 is largely nonexistant, other than those 6 NL410s)

And looking at the command run on an NL410 vs an S210 right now, I'm seeing that the command is node specific, is it not? The command on the S210 shows 100% misses on L3 (as it should, there isn't any), and the NL410 shows 19.4% misses on the L3 (which is pretty crummy).

2 Intern

 • 

205 Posts

January 7th, 2016 14:00

We've decided not to use L3 at all, but we still have those 6TB NL410s that require it on metadata accel. That's pretty much all I meant.

4 Operator

 • 

1.2K Posts

January 7th, 2016 23:00

I see. So it's GNA with all SSDs in the main cluster then?

4 Operator

 • 

1.2K Posts

January 19th, 2016 01:00

Re my concerns

The L3 cache is *local* to each node, as we know. This has consequences:

With metadata being mirrored on at least three nodes, any naive attempt to warm up the metadata cache by a single treewalk will only cache one mirror for each affected LIN.

Which means that later-on the odds of a cache MISS are still 2/3, or 2 against 1. 

Unless of course there is one "preferred" copy of a LINs metadata to be always used. Having this one in the cache would be sufficient for all subsequent accesses from other nodes.

I have received good news from the Isilon team. There is in fact for each LIN one preferred copy of metadata that gets read and cached. It is determined by a hash function of the LIN id, so any access from any node goes to that very copy (unless offline, where access fails over to another copy). So one can effectively warm up an L3 metadata cache by a single treewalk...

Cheers

-- Peter

2 Intern

 • 

205 Posts

January 19th, 2016 04:00

Except the L3 metadata only caches that are provided are not big enough to hold all the metadata, depending on workload. And it also doesn't solve the metadata first access problem for L3 data/metadata cache; metadata will still be evacuated in favor of data a fair amount of the time (unless your L3 is the same size as your data, in which case, why do you have HDDs? ;-)

4 Operator

 • 

1.2K Posts

January 19th, 2016 06:00

I understand there is the problem of predicting how data and metadata compete for L3 space. Same for L2 btw, my wish to EMC Isilon for more insight persists....

When comparing metadata-only L3 with classic metadata-read GNA, it is now clear that both store at most one copy per LIN, until  SSD capacity is exhausted. So how would metadata-only L3 ever consume more SSD space than metadata-read GNA, on the same file set? That remains a mystery to me.

fwiw

-- Peter

2 Intern

 • 

205 Posts

January 19th, 2016 16:00

It wouldn't. But the metadata only ones are not allowed to use gna/metadata-read and therefore have insufficient ssd for it (for certain workloads. And they also have insufficient ssd to cache all metadata. Alternately, it simply doesn't work right.

--Ken

1 Rookie

 • 

16 Posts

January 19th, 2016 23:00

This could probably be a new post, but it's related so i'll add to this one.

We are planning on enabling L3 cache on our cluster. What steps would be advisable to perform from a 'performance data' collecting side of things ?

I can do tree walks prior to the change, according to InsightIQ, we have ~1.2PB of data in ~226M files and ~29M directories. Walking over SMB can be slow, did the question of FSAnalyse populating the L3 cache get answered ? What is the best way to perform the treewalks ?

We are moving from pure SATA NL nodes with S200 SSD nodes for GNA to X400 nodes with 4x800GB SSD for L3. Our data is fileshares with user interactive work for the most of it. and a small percentage of active data.

January 20th, 2016 08:00

So when data is written to an Isilon cluster, blocks are written to L2 at the same time they are written to disk, this accelerates a subsequent read of those blocks. Then after some time, those blocks will fall out of L2 into L3.

I am assuming you will connect the new nodes to the same cluster and use SmartPools to migrate the data to the new nodes. If that is the case, the question here is going to be, do data and MD blocks hit L2 cache from a Smart Pools job? If so, they will fall into L3 so your L3 will be warmed without having to force a tree walk. I would suggest reaching out to your account team to verify this.

If you are using SyncIQ to migrate, then this will absolutely work, since SyncIQ has to go through the proper write path.

Also, I believe OneFS.Next will have some improvements around keeping as much or all MD on the SSD's as long as you have enough SSD capacity, which your X400's should have plenty of.

2 Intern

 • 

205 Posts

January 20th, 2016 16:00

If OneFS.Next is going to keep MD on the SSDs with L3, that would be excellent! Last I talked to EMC peeps, that wasn't even on the radar.

No Events found!

Top