Highlighted
carlilek
2 Iron

Using SSD for data storage

Hi all,

First, some info about our setup

Isilon:

60 node cluster composed of S200s, S210s, and NL400s. 26 S class nodes, 34 NLs.

Total size: 4.2PB

Total SSD: 43.5TB (116 drives)

Total HDD usage: 3.3PB, total SSD usage: 7.14TB.

GNA is used.

Clients:

5000 core HPC compute cluster (10GbE on each node)

2-300 Linux servers and VMs (many 10GbE)

~1000 Mac, Windows, Linux workstations/laptops (mostly 1GbE, some 10GbE)

Environment:

Mixed protocol, NFSv3, SMB 1 & 2. Simple LDAP auth for NFSv3, AD auth for SMB1/2.

Linux home directories, also accessible from Windows/Mac, mounted as a drive in the user profile in AD.

We serve two zones for this, tier1 and tier2.

Tier1 data is ingested to the S2x0 tier, and has Metadata R/W set. If data on tier1 has an mtime > 4 months, it is tiered down to the NL400s, but keeps its Metadata R/W.

Tier2 data is ingested to the NL400 tier, with Metadata Read set.


Our weekly churn on Tier1 right now seems to be hovering around 4TB.


We are likely to expand our S210 footprint in the near future (I hear our sales guys salivating) and slightly reduce our NL400 footprint. This will put us at somewhere north of 60TB of SSD in capacity (and something like 140 "spindles")--and most likely we'll still only be using 7TB of it.


As many people around EMC are aware, I hate having that much unused SSD capacity and really wish I could actually use the stuff. L3 isn't an option, because we need GNA, and it's at this time pretty much an all or nothing choice. Ideally, in the future Isilon will be able to designate some SSDs in a node tier for GNA and some for L3, but that's mostly a pipe dream of mine.


If you've actually read this far, here's the meat:

I'm pondering using the SSDs as an ingest pool for 1 week of data on Tier1. Probably just for files in the sub 1MB range. My question is whether 140 SSDs will hold up to this and to serving the metadata needs of the cluster. Any and all thoughts are welcome.


Thanks,

Ken

0 Kudos
15 Replies
Peter_Sero
3 Zinc

Re: Using SSD for data storage

Couple of thoughts:

With your colorful mix of workloads you are not reporting a specific performance problem. So any metric you attempt to improve would be a quite unspecific one, like average throughput or latency. You might now be seeing a high variance in these figures, so an improvement will probably not be "statistically significant" in a strong sense.

Your total SSD capacity is about 1% of the HDD capacity. GNA requires 1.5% SSD; have you checked wether GNA is really active and functioning, rather than just set to "enabled"?

Your actual SSD usage is about 0.2% of the HDD usage -- much less than the 1.5% requirement. Of course this makes the SSDs capacity appear as wasted. I conclude that you are certainly lucky to have mostly large files on the cluster, rather than many-small-files with lots of snapshots (beware).

The SSD are underused IOPS-wise, right?

The reported churn of 4TB on per week on 26 S nodes is.... peanuts. Would that be 40TB?

One needs to ask, what will putting the small files (data) on SSD really help, given that there aren't "many" and that we have no specific metric to look at (other than SSD usage)?

A technical note concerning SmartPools file pool policies: One cannot make a rule for "ingesting small files". At the time of creating a file, the rule that matches implies a size of 0 bytes (and that rule is "cached" per directory). So any new file would land on SSD regardless how large it eventually grows. Only when the SmartPools job is run, files are migrated down. If you run SmartPools once a week, are your rule criteria is one week, the results will be uneven.

My final take is a far shot from the left field: As it seems you have about 4 SSDs per node in the S series, have you considered redistributing some SSD drives to the NL nodes? SSD support for NL400 is quite new, and it would allow you to enable the L3 cache across the whole cluster!

Hope it makes sense and helps

-- Peter

0 Kudos
carlilek
2 Iron

Re: Using SSD for data storage

Hi Peter,

This is exactly the discussion I was hoping to have! Thanks!

Let me go through your points...

1. We have a GNA exception, which allows us to run at 1%.

2. We have many small files, but also many large files. Wildly mixed workload.

3. SSD IOPS underused: Most likely, yes, but I'm not sure how to confirm this. Any tips would be welcome. We do have IIQ, but it is not working at the moment (why did I try upgrading it again, why?!)

4. Yes, it's peanuts, but that's on the Tier1 storage. Tier2 storage churns much more (this is largely political and chargeback releated).

5. This will help (hopefully) with specific projects that use many, many small files and are constantly referencing and modifying them. The one I have in mind stitches together many image tiles after slightly adjusting them for variances in order to create a large image. The owners typically compare it to google maps. Unfortunately, they have not provided any specific guidance on what current performance is or what they need it to be.

6. File pool policies: Agreed. That's probably more of a pipe dream for me; I'm essentially trying to do a poor man's L3.

7. Left field: Sadly, we don't have any spare 3.5" drive carriers, nothing to replace the SSDs in the S nodes with, and I doubt Isilon would shine a happy light on that solution (as you need to have a hardware configuration file from them, which is hashed and xml'd and all sorts of unreproduceable.) We did investigate purchasing SSDs for the NLs, but the upgrade price was... shall we say... not appealing. And just because I'm pedantic, we have 10 S200s with 2 SSDs, 9 S200s with 6 SSDs, and 7 S210s with 6 SSDs. All of this was to fulfill the GNA requirement (and even the exception).

--Ken

0 Kudos
kipcranford
2 Iron

Re: Using SSD for data storage

>

> 3. SSD IOPS underused: Most likely, yes, but I'm not sure how to confirm this. Any tips would be welcome. We do have IIQ, but it is not working at the moment (why did I try upgrading it again, why?!)

>

You could just collect some statistics manually, for a relatively short period of time (which will depend on how long you think it will take to get a good  snapshot of your cluster's activity), then analyze those.  You could run something like (which you'd probably want to script):

isi statistics protocol -d --nodes all --long --csv  > /ifs/data/stats/protocol_stats.csv

isi statistics drive -d --nodes all --long --csv --timestamp  > /ifs/data/stats/drive_stats.csv

isi_for_array "sysctl isi.cache.stats. >> /ifs/data/stats/\`hostname\`_cachestats.dump"'

If you want I can write up a quick script for you.  In the script you can control the timing on the stats queries to take each individual measurement at slightly different times, control the sample interval, etc.  You could also do all this by hand, but scripts are easier.  In future OneFS releases you'll also be able to use the API (technically stats are in the API now, but are considered "experimental").

(And before anyone comments on the above, yes, I know there's isi_cache_stats, but I like the raw sysctl data)

So you run this for a couple of hours (or whatever) and are left with a bunch of CSV data.  You can either parse that data, or send it to me and I'll parse it for you.

Or, just work with Support to get IIQ working again

>

> 5. This will help (hopefully) with specific projects that use many, many small files and are constantly referencing and modifying them. The one I have in mind stitches together many image tiles after slightly adjusting them for variances in order to create a large image. The owners typically compare it to google maps. Unfortunately, they have not provided any specific guidance on what current performance is or what they need it to be.

>

What SSD strategy are you using right now?  This workload sounds like it would benefit from either metadata-write (which would consume more of your SSD space) or L3.  I'll echo Peter's comments on moving some of your SSD to the NL nodes, such that all your nodes have local SSD.  This would make it much easier for you to run L3, which I think would also help this workload (for the cached small blocks) and probably all of your other workloads to some degree.  L3 (along with L2) will also cache all your active metadata, so you won't see to much of  a difference from what you see now in terms of performance (and some workloads could see noticeable improvements).  Really only on first access to cold data will you not get the benefit of reading that metadata mirror from SSD -- it will have to come from the spindles.

If indeed you have another order coming up, then perhaps you can work with your Sales team to figure out some creative way of getting SSD in the NLs.

0 Kudos
carlilek
2 Iron

Re: Using SSD for data storage

Hi Kip,

I've just opened a ticket re IIQ; we'll see if it gets anywhere.

We're currently using metadata write for tier1 and metadata read for tier2.

I am somewhat cautious about that "first access" thing, since so much of our data lies dormant for quite awhile before it is used, and our users can be unpleasantly latency sensitive.

--Ken

0 Kudos
carlilek
2 Iron

Re: Using SSD for data storage

The other (and probably more real) issue with putting an ssd in every NL400 is the fact I'd have to fail out at least one drive from every single NL400, then play fun little games with disi.

0 Kudos
ed_wilts
2 Iron

Re: Using SSD for data storage

The other (and probably more real) issue with putting an ssd in every NL400 is the fact I'd have to fail out at least one drive from every single NL400, then play fun little games with disi.

You also need to be at 7.2 to get the equivalency for this I think.  Lots of fun to get there from here.

I've actually given up with mixing NL and S nodes in the same cluster.  SmartPooling down works OK most of the time, but tiering back up is an impossibility in larger clusters.  I'm not buying any more NL nodes without SSDs either for my SyncIQ targets.

0 Kudos
kipcranford
2 Iron

Re: Using SSD for data storage

> Tier1 data is ingested to the S2x0 tier, and has Metadata R/W set. If data on tier1 has an mtime > 4 months, it is tiered down to the NL400s, but keeps its Metadata R/W.

That data doesn't have metadata-write if it's sitting on the NL400 nodes.  Those nodes have no SSD and are using GNA, which means they get an *extra* metadata mirror created for every file/directory, which goes on SSD somewhere else in the cluster (wherever there are SSD).  Any metadata read for that NL data will use that mirror on SSD, but any metadata write will have to update the mirrors on the NL spindles, plus update the extra mirror in SSD.  GNA is a read acceleration technology.

0 Kudos
carlilek
2 Iron

Re: Using SSD for data storage

Interesting... even if the data is modified, the metadata is still also stored on HDD in that scheme?

0 Kudos
Peter_Sero
3 Zinc

Re: Using SSD for data storage

Let's hope the project will get stuck due to some missing drive carriers...  and you should not be on your own with xml hw configs and disi commands... Didn't a wise Isilon guy once say, "everything is negotiable"? And I like Kip's hint towards finding a "creative way".

Btw, make sure your nodes are not running out of gas CPU-wise; also monitor:

isi statistics query -s 'node.cpu.*.avg'

Cheers

-- Peter

0 Kudos