L3 cache sizing and performance and future directions

Question

Our primary prod cluster is composed of a mix of S2x0 nodes and NL4x0 nodes (~60 nodes total) and split into two zones, which we sell to our internal customers as tier1 and tier2 storage, requring Linux home directories be on the tier1 storage. Both zones are accessible from our HPC cluster. We use GNA extensively, having far more SSD available in the S nodes than is necessary to accelerate the entire cluster (~4PB in size). Our users are not particularly discerning about where they put their large quantities of small files and tend to be more sensitive to their price than to their performance (although of course, they expect excellent performance from the tier2 because why not; storage is storage, right?) Recently we bought some NL410s with 35x6TB drives+800GB SSD, as well as a cluster of HD400s. As you may know, the NL and HD nodes with 6TB drives are required to be purchased with an SSD. Leaving aside the strangeness of losing a 3.5' bay to an SSD, there is also the consideration that those SSDs are used exclusively and without the ability to turn it off as L3 metadata only cache. On the HD400s, this has been great, since that cluster is replacing our backup target cluster, which was composed of old 72NLs and NL400s with no SSDs, so anything is a benefit. On the NL410s, it has been very, very bad. These nodes are replacing older NL400s with 3TB drives and no SSD and are living alongside NL400s with 4TB drives and no SSD. From what I have been able to suss out, having the L3 SSDs in the node removes it from using GNA. This would be fine if the L3 cache was pre-populated and LARGE ENOUGH to contain all of the metadata on the nodes. Guess what, it's not. First access times on namespace operations are hellish on these things. Home directory listing and logging into things can become incredibly slow if even one or two directories or files at the top level are on the NL410s. All of this is made much worse by our filepool policies. Our S tier is too small to hold the entirety of the tier1 data, so we tier down things with an mtime older than 6 months to the NL tier. This worked fine when the files remained GNA accelerated, but now that they float down to the 6TB nodes (sometimes!), that advantage is completely gone. So we're kind of at a loss as to what use L3 is in its current incarnation. If the cache is not big enough to hold metadata, and the metadata gets evacuated, the performance impact is disastrous. What I'd like to see is the ability to turn off L3 in the required nodes if GNA is available (hell, at this point, I'll give up the space on those SSDs entirely if I have to) and the ability to mix GNA and L3. GNA would be preferred. Give us the ability to tune how many SSDs are used for GNA, and how many are used for L3. For some evidence on this stuff, have a look at these graphs: That's the hit rate over 3 days on the NL410s. What that says to me is there is a huge amount of misses, which can be read as a huge amount of time clients are waiting. for. ....metadata. This is worse. This is the cached data age in the NL410s over the same time period. There's no way in hell all of the metadata even that's not first access is managing to stay on the NL410s. L3 metadata only simply doesn't work for this kind of environment. But let me repeat again, GNA did and does. The SSD policy must be more flexible in future releases, especially with the way that EMC/Isilon is driving L3 as the future. Either that, or L3 needs to act like GNA and hold the vast majority (or all!) of the metadata before it starts caching data.

Peter_Sero · Answer

Sad to see it hasn't worked out for you yet. If I recall correctly, your original SSDs weren't fully utilized with GNA, so using L3 cache covering also data was suggested as worthwhile considering.

Now you have increased the SSD capacity, use it for L3 metadata cache only, and find it to be overused. That's certainly strange and there should be some explanation.

When you ran SmartPools, how did the NL410 SSDs fill over time,

i.e. did you see them fill up way before the job finished?

Quick notes on the data analysis shown:

The cache age is for data in the L2 cache -- looks normal to me, given that it's just the RAM-sized cache. Nothing to worry about from the metadata side I'd say.

The L3 metadata hits... are percentages of the L3 cache read attempts. And we don't know how many L3 attempt there are.

Can you look at the absolute L2 and L3 metadata read attempts (=starts) and hits? As discussed in an earlier thread with sluetze, if your L2 hit rate happens to be high already, L3 starts = L2 misses are comparatively few, and L3 misses even less (i.e. as percentage of all ORIGINAL requests as measured by L2 starts). Seeing the bigger picture will help.

Or -- if you manage to get an exception for using SSD for GNA with 6TB equipped nodes, let us know...

-- Peter

carlilek · Answer

But really, if it was big enough to hold all the metadata, it's still not pre-populated. So first access will always be slow. And since a percentage of the files are by definition old, when the user goes to look at them... or worse yet at a folder that contains them, it's gonna be slow.

Basically, my premise is that if combined with regular old metadata SSD patterns (GNA, metadata-read, metadata-write), L3 would make a hell of a lot of sense. Use the leftover unused space for cache. Awesome. But without it, it just results in a lot of pain for first access. I suppose if the workflows involved rarely encountered first access situations (I'm not sure what would), it wouldn't be a big deal, but as it stands, L3 becomes a very bad solution, despite how it's being pushed.

carlilek · Answer

Peter_Sero wrote:

Sad to see it hasn't worked out for you yet. If I recall correctly, your original SSDs weren't fully utilized with GNA, so using L3 cache covering also data was suggested as worthwhile considering.

Now you have increased the SSD capacity, use it for L3 metadata cache only, and find it to be overused. That's certainly strange and there should be some explanation.

We still use GNA for all of our node pools except the NL410s (by OneFS requirement). It is still wildly underused. But since L3 is a nodepool only construct, it's vastly overused on the NL410s with a ratio of 0.4% SSD to HDD. Our average metadata SSD usage is 0.46%, so not really tooooo far off--but probably off enough, especially since our total SSD ratio not counting the L3 SSDs or L3 enabled nodes is ~1.2% (yes, we have an exception for GNA).

Ah, I didn't realize the cached data age was L2 only. That does make more sense.

As far as hits vs. misses...

Screen Shot 2015-11-18 at 10.58.01 AM.png

It's ooglay.

I am not certain how the NL410 SSDs filled; that information seems to be pretty well hidden.

Peter_Sero · Answer

> I am not certain how the NL410 SSDs filled; that information seems to be pretty well hidden.

Not shure about InsightIQ (waiting for our FREE license...) but in the CLI on a node with SSD:

isi statistics drive --node --type SSD --long

There is also some historical information retained in every cluster;

let me know if you interested in using

isi statistics history ...

for that.

Have you considered populating the L3 cache by some manually treewalks? Sounds terrible, but as long as there is this gap of cache warming for internally migrated data, what else can one do? BTW, new clusters with data ingested via NFS of SMB will obviously not see this issue.

To slightly augment your whishlist:

(Dear Santa...)

- metadata on SSD as traditional

- spare SSD capacity used for read and write-back(!) cache

- SSDs on PCIe rather than sharing SAS with the HDDs

Cheers

-- Peter

PS: At least one of the 26762 sysctl's in 7.2.0.4 speaks of L3 write-back (which of course does not necessarily imply that such function will actually materialize )

carlilek · Answer

The output is suspiciously similar to that of the HDDs in those nodes:

dm11-58# isi statistics drive --node --type SSD --longdm11-58# isi statistics drive --node 33,38,43,44,46,59 --type SSD --long

Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes

LNN:bay N/s B/s B N/s B/s B ms N/s ms % %

33:1 SSD 1.0 11K 11K 6.8 55K 8.1K 0.4 0.0 0.0 0.0 0.2 66.3 5.2M

38:1 SSD 1.4 13K 9.4K 6.8 54K 8.0K 0.4 0.0 0.0 0.0 0.2 66.3 5.2M

43:1 SSD 1.4 13K 9.4K 6.8 54K 8.0K 0.4 0.0 0.0 0.0 0.0 66.3 5.2M

44:1 SSD 2.4 26K 11K 8.4 69K 8.2K 0.4 0.0 0.0 0.0 0.2 66.3 5.2M

46:1 SSD 0.8 6.6K 8.2K 7.0 57K 8.2K 0.5 0.0 0.0 0.0 0.1 66.3 5.2M

59:1 SSD 1.4 11K 8.2K 12.6 102K 8.1K 0.5 0.0 0.0 0.0 0.4 66.4 5.2M

vs

dm11-58# isi statistics drive --node 33,38,43,44,46,59 --type=sata --long

Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes

LNN:bay N/s B/s B N/s B/s B ms N/s ms % %

33:2 SATA 0.0 0.0 0.0 14.4 118K 8.2K 2.0 0.0 0.0 0.0 7.0 66.3 5.2M

33:3 SATA 0.0 0.0 0.0 16.8 138K 8.2K 2.2 0.0 0.0 0.0 8.2 66.3 5.2M

33:4 SATA 0.0 0.0 0.0 13.0 106K 8.2K 2.1 0.0 0.0 0.0 7.4 66.3 5.2M

33:5 SATA 0.0 0.0 0.0 10.6 87K 8.2K 2.3 0.0 0.0 0.0 7.0 66.3 5.2M

33:6 SATA 0.0 0.0 0.0 10.2 84K 8.2K 2.0 0.0 0.0 0.0 5.1 66.3 5.2M

33:7 SATA 0.0 0.0 0.0 17.2 505K 29K 2.0 0.0 0.0 0.0 8.2 66.7 5.2M

Peter_Sero · Answer

This is clearly unexpected... do you plan to have this checked by EMC?

carlilek · Answer

At this point, I don't even know where to point my finger. All I know is that it's slow as molasses when it's configured wrong, and as usual with EMC support, they want to see it while it's broken. I don't have the good will left in my organization to re-break it just so they can look at it.

KennySabarese · Answer

SSDs on the NL410 are meant to speed up job operations by doing metadata acceleration, but it is completely understood that we are unable to fit all the metadata on the single SSD. It's called L3-meta vs standard L3. Notice that HD nodes and the NL400 with 6TB drives required OneFS 7.2 because it added this L3-meta mode. L3-meta will not do any caching of data blocks.

Since you are on the NL410, I assume you are on OneFS 7.2.1, which makes L3 pools completely separate from the nodes that are being GNA accelerated (things were much worse in for GNA in 7.1.1 and 7.2) This means that GNA is only accelerating the MD on the NL nodes without SSD in them. Your NL410 6TB drives and 1 SSD are being ignored by GNA.

I would engage your account team and ask for them to research your options, because it does seem like there was a regression in functionality here. I will say though, that when you say the word performance in regards to NL nodes, I would not get your hopes up since the NL is not a performance platform.

Historically, one of the issues with GNA (before L3) is that you end up having to buy S nodes with SSD whenever you add NL nodes, which bumps up the price and complexity, so it's much cleaner to go with the X410 platform which has the SSD built in and allows you to use the traditional metadata strategies or L3.

carlilek · Answer

Hmm. There's this here 'enable L3 cache' checkbox in the Smartpools interface for the NL410s. I wonder what happens if I uncheck that...

carlilek · Answer

Yes, I'm starting to work with my sales team.

Here's the issue with X410s... we need S class nodes for small file workloads, but we also need some deep storage to drop older stuff to, so this is a total regression.

I don't expect read or write performance from the NLs, of course, but with GNA I at least got namespace performance.

carlilek · Answer

I have spoken to an engineer, and we have determined the following:

If a cluster is GNA enabled AND it has nodes with L3-metadata setups (ie, NL410s or HD400s w/6TB drives)

-->Files on those 6TB nodes have their metadata stored via GNA on other nodes with SSD and that metadata is cached via L3

-->Directories on those 6TB do NOT have their metadata stored via GNA and that metadata is cached via L3

Incidentally, to see where the metadata for an object is, do an isi get -D on that object (or -Dd if it's a directory) and look for this:

* File Data (48 bytes):

* Metatree Depth: 1

* 78,16,988674719744:8192

* 101,16,918677561344:8192

* 121,19,74253762560:8192

* 130,34,1135611068416:8192

* 131,33,1172303634432:8192

* 133,33,1052311330816:8192

Where the first number is the devid of the node it's on, and the second is the lnum of the drive in that node that it's on. For a directory on an L3metadata node, there are no values listed under Metatree Depth: 1.

Isilon

Was this post helpful?