Hi Guys, I'm planning to enable L3 Cache on several upgraded clusters. And have some questions regarding that process. Current Situation: - Freshly upgraded Clusters on 7.1.1.x - MultiScan Jobs etc. are finished - The Clusters are productive for our customer - We use our SSDs for Metadata-Read acceleration. - The current workload is mostly random access and files with <128K are also quite often in use --> we should profit from L3 Cache - Smartpool license is NOT in use - on the clusters is only one nodepool used - the SSDs have space left Planned process: isi storagepool nodepool modify -l3 true wait for jobs to complete profit. Since the L3 activation process requires the metadata copies to be evacuated from the SSD with SetProtectPlus and FlexProtect before reformatting the drives as L3 cache, we won't profit from metadataread nor L3 cache until reformatting finished. I want to minimize the performance impact for our customer and thus have the following questions L3 Cache is populated by evicted blocks from L2 cache. My plan is to have the L3 Cache populated with metadata at the time the customer accesses the data (after the weekend), so he has no performance impact by not having metadataread enabled. Can i force a population of L3 cache with metadata by just stupidly doing treewalks (via SMB/NFS)? In my mind the metadata will reside either in L1/L2 or L3 Cache after a complete treewalk (and nothing else!) finished. As soon as data is queried, the meta and data will be evicted from L1/L2 and populated to L3 . if it gets old, it will also be evicted from L3. Since metadata operations (in my environment) are more frequently used than data reads i hope to have a lot of often used metadata and additional data in L3 Cache. Does FSAnalyze or Snapshot(Deletes) populate metadata to Caches? If i delete a Directorytree with files do i populate this data into the caches or are caches cleared from deleted files? (i guess the first...) Is a timecalculation based on experiences from another (comparable) cluster valid? i.e. Cluster 1 has 3Nodes and 1Million LINs. it needs 15 Minutes to evacuate Metadata and reformat the SSDs. Cluster 2 has 3Nodes and 10Million LINs (based on last FSAnalyze). I calculate 2:30 hours for metadata evacuation and reformating of the SSDs Thanks and Regards --sluetze

‎Enabling L3 Cache and populating it

Responses(44)
Solutions(1)

sluetze

2 Intern

•

300 Posts

0

January 7th, 2016 00:00

Hi Peter,

nice insight. didn't think about that. This means to warm up you would have to connect to each node and make a complete treewalk.

In the end I didn't warm up the cache, since the activation time was quite unpredictive (i.e. 2 Clusters one standby for the other, so the same configuration, nodetype, nodecount and data amount: standby needed ~70h, active needed (with usr-load!) ~18h) and my Administrators were not willing to take a look every hour on the weekend...

All in all im quite happy with the activation.

after the users warmed up the cache and didn't reset the stats we have

data hit rates ~ 70%

meta hit rates ~80%

the ratio of disk opsout to ssd ops out is ~1:15 - 1:20

the ratio of disk BytesOut to ssd BytesOut is ~1:5

Peter_Sero

4 Operator

•

1.2K Posts

0

January 7th, 2016 01:00

Glad to see it works well for you!

Can you put the L3 hit rates into one context with the L2 hit and miss rates?

Because the actual L3 hit rate with respect to the requested data (or metadata)

reads conceptually is: (L2 miss%) x (L3 hits%)

In other words: if L2 cache works great (low L2 miss%), there is not much left for the L3 to do, and the L3 hit percentage becomes less meaningful...

sluetze

2 Intern

•

300 Posts

0

January 7th, 2016 02:00

i knew before i have relatively high hit rates in L1 and L2 and thus my logic is i will have high hit rates in L3 too. that's correct as my stats show.

and yes you do profit more, if the L3 has more request to answer. But you can also have wins in the "low number"

calculation:

Without L3: L1.miss x L2.miss = disk-hits

2% x 6,5% = 0,13%

With L3: L1.miss x L2.miss x L3.miss = disk.hits

2% x 6,5% x 31,3% = 0,04%

so even if i have high hitrates i reduce the disk.hits.

Peter_Sero

4 Operator

•

1.2K Posts

0

January 7th, 2016 05:00

I'd leave the L1 cache out of the picture here.

Firstly, the L1 cache happens where the file is accessed (node where client is connected to), while L2/L3 are local to the node that has the disk blocks. One would need to summarize all L1 activity in a cluster, and compare to the summarized L2/L3 activity (or by nodes pools if they match the configured NAS access zones).

Secondly, the accounting of the L1 inOneFS 7.1 and later is "overlapping" with the L2 activity so that the cache "levels" are not visible as "stages" in the same sense as with L2 and L3, even if summarizes over the cluster.

In your example the hits and missed don't add up to 100% (= L2 starts),

because the prefetch hits etc are omitted. The picture I get is as follows:

Data:

L2 hits/pref: 93.5% = 100% - 6.5% (assuming all non-misses are hits or prefetch hits)

L3 hits: 4.5% = 6.5% * 68.7%

L3 misses: 2.0% = 6.5% * 31.2% (i.e. FINAL misses)

Meta:

L2 hits+pref: 92.5% = 100% - 7.5%

L3 hits: 5.9% = 7.5% * 78.1%

L3 misses: 1.6% = 7.5% * 21.9%

Which means that the final misses are still pretty low, 2% and 1.6%, really nice!

Happy caching!

-- Peter

Peter_Sero

4 Operator

•

1.2K Posts

0

January 7th, 2016 05:00

l2_meta hits are reported as 81.5% and misses as 7.5%, so still missing 11% to make up for 100%. For a consistent picture where my three figures should add up to 100% I had "adjusted" the 81.5% by those 11%, assuming the best case namely prefetch hits.

I fyou have the full cache statistics, first double check that the L2 misses (block rate) are basically the same as the L3 starts (block rate). This simply reflects how L3 is staged after L2. And it allows you to draw all L3 activity to the scale of the initial cache starts (from L2) by multiplying with the L2 misses (percentage!). Think of drawing all L2 and L3 activity into one bar chart, all at the same scale.

Sad to learn that your L3 is still "nonexistent" -- anything went wrong with the drive (re-)purposing? There has been a KB on this... I remember your last drive statistics were kind of inexplicable...

-- Peter

carlilek

2 Intern

•

205 Posts

0

January 7th, 2016 05:00

Peter, my brain is addled right now, but I'm not seeing where you're getting the 92.5% number in your metadata calculation.

Looking at mine right now, I'm trying ti figure out how it compares (bearing in mind that my L3 is largely nonexistant, other than those 6 NL410s)

And looking at the command run on an NL410 vs an S210 right now, I'm seeing that the command is node specific, is it not? The command on the S210 shows 100% misses on L3 (as it should, there isn't any), and the NL410 shows 19.4% misses on the L3 (which is pretty crummy).

carlilek

2 Intern

•

205 Posts

0

January 7th, 2016 14:00

We've decided not to use L3 at all, but we still have those 6TB NL410s that require it on metadata accel. That's pretty much all I meant.

Peter_Sero

4 Operator

•

1.2K Posts

0

January 7th, 2016 23:00

I see. So it's GNA with all SSDs in the main cluster then?

Peter_Sero

4 Operator

•

1.2K Posts

0

January 19th, 2016 01:00

Re my concerns

The L3 cache is *local* to each node, as we know. This has consequences:

With metadata being mirrored on at least three nodes, any naive attempt to warm up the metadata cache by a single treewalk will only cache one mirror for each affected LIN.

Which means that later-on the odds of a cache MISS are still 2/3, or 2 against 1.

Unless of course there is one "preferred" copy of a LINs metadata to be always used. Having this one in the cache would be sufficient for all subsequent accesses from other nodes.

I have received good news from the Isilon team. There is in fact for each LIN one preferred copy of metadata that gets read and cached. It is determined by a hash function of the LIN id, so any access from any node goes to that very copy (unless offline, where access fails over to another copy). So one can effectively warm up an L3 metadata cache by a single treewalk...

Cheers

-- Peter

carlilek

2 Intern

•

205 Posts

0

January 19th, 2016 04:00

Except the L3 metadata only caches that are provided are not big enough to hold all the metadata, depending on workload. And it also doesn't solve the metadata first access problem for L3 data/metadata cache; metadata will still be evacuated in favor of data a fair amount of the time (unless your L3 is the same size as your data, in which case, why do you have HDDs? ;-)

Peter_Sero

4 Operator

•

1.2K Posts

0

January 19th, 2016 06:00

I understand there is the problem of predicting how data and metadata compete for L3 space. Same for L2 btw, my wish to EMC Isilon for more insight persists....

When comparing metadata-only L3 with classic metadata-read GNA, it is now clear that both store at most one copy per LIN, until SSD capacity is exhausted. So how would metadata-only L3 ever consume more SSD space than metadata-read GNA, on the same file set? That remains a mystery to me.

fwiw

-- Peter

carlilek

2 Intern

•

205 Posts

0

January 19th, 2016 16:00

It wouldn't. But the metadata only ones are not allowed to use gna/metadata-read and therefore have insufficient ssd for it (for certain workloads. And they also have insufficient ssd to cache all metadata. Alternately, it simply doesn't work right.

--Ken

Adam_M1

1 Rookie

•

16 Posts

0

January 19th, 2016 23:00

This could probably be a new post, but it's related so i'll add to this one.

We are planning on enabling L3 cache on our cluster. What steps would be advisable to perform from a 'performance data' collecting side of things ?

I can do tree walks prior to the change, according to InsightIQ, we have ~1.2PB of data in ~226M files and ~29M directories. Walking over SMB can be slow, did the question of FSAnalyse populating the L3 cache get answered ? What is the best way to perform the treewalks ?

We are moving from pure SATA NL nodes with S200 SSD nodes for GNA to X400 nodes with 4x800GB SSD for L3. Our data is fileshares with user interactive work for the most of it. and a small percentage of active data.

KennySabarese

36 Posts

0

January 20th, 2016 08:00

So when data is written to an Isilon cluster, blocks are written to L2 at the same time they are written to disk, this accelerates a subsequent read of those blocks. Then after some time, those blocks will fall out of L2 into L3.

I am assuming you will connect the new nodes to the same cluster and use SmartPools to migrate the data to the new nodes. If that is the case, the question here is going to be, do data and MD blocks hit L2 cache from a Smart Pools job? If so, they will fall into L3 so your L3 will be warmed without having to force a tree walk. I would suggest reaching out to your account team to verify this.

If you are using SyncIQ to migrate, then this will absolutely work, since SyncIQ has to go through the proper write path.

Also, I believe OneFS.Next will have some improvements around keeping as much or all MD on the SSD's as long as you have enough SSD capacity, which your X400's should have plenty of.

carlilek

2 Intern

•

205 Posts

0

January 20th, 2016 16:00

If OneFS.Next is going to keep MD on the SSDs with L3, that would be excellent! Last I talked to EMC peeps, that wasn't even on the radar.

1
2
3

View All

No Events found!

Isilon

Was this post helpful?