carlilek

205 Posts

2837

August 28th, 2015 11:00

Using SSD for data storage

Hi all,

First, some info about our setup

Isilon:

60 node cluster composed of S200s, S210s, and NL400s. 26 S class nodes, 34 NLs.

Total size: 4.2PB

Total SSD: 43.5TB (116 drives)

Total HDD usage: 3.3PB, total SSD usage: 7.14TB.

GNA is used.

Clients:

5000 core HPC compute cluster (10GbE on each node)

2-300 Linux servers and VMs (many 10GbE)

~1000 Mac, Windows, Linux workstations/laptops (mostly 1GbE, some 10GbE)

Environment:

Mixed protocol, NFSv3, SMB 1 & 2. Simple LDAP auth for NFSv3, AD auth for SMB1/2.

Linux home directories, also accessible from Windows/Mac, mounted as a drive in the user profile in AD.

We serve two zones for this, tier1 and tier2.

Tier1 data is ingested to the S2x0 tier, and has Metadata R/W set. If data on tier1 has an mtime > 4 months, it is tiered down to the NL400s, but keeps its Metadata R/W.

Tier2 data is ingested to the NL400 tier, with Metadata Read set.

Our weekly churn on Tier1 right now seems to be hovering around 4TB.

We are likely to expand our S210 footprint in the near future (I hear our sales guys salivating) and slightly reduce our NL400 footprint. This will put us at somewhere north of 60TB of SSD in capacity (and something like 140 "spindles")--and most likely we'll still only be using 7TB of it.

As many people around EMC are aware, I hate having that much unused SSD capacity and really wish I could actually use the stuff. L3 isn't an option, because we need GNA, and it's at this time pretty much an all or nothing choice. Ideally, in the future Isilon will be able to designate some SSDs in a node tier for GNA and some for L3, but that's mostly a pipe dream of mine.

If you've actually read this far, here's the meat:

I'm pondering using the SSDs as an ingest pool for 1 week of data on Tier1. Probably just for files in the sub 1MB range. My question is whether 140 SSDs will hold up to this and to serving the metadata needs of the cluster. Any and all thoughts are welcome.

Thanks,

Ken

Responses(15)

Peter_Sero

1.2K Posts

0

August 30th, 2015 22:00

Couple of thoughts:

With your colorful mix of workloads you are not reporting a specific performance problem. So any metric you attempt to improve would be a quite unspecific one, like average throughput or latency. You might now be seeing a high variance in these figures, so an improvement will probably not be "statistically significant" in a strong sense.

Your total SSD capacity is about 1% of the HDD capacity. GNA requires 1.5% SSD; have you checked wether GNA is really active and functioning, rather than just set to "enabled"?

Your actual SSD usage is about 0.2% of the HDD usage -- much less than the 1.5% requirement. Of course this makes the SSDs capacity appear as wasted. I conclude that you are certainly lucky to have mostly large files on the cluster, rather than many-small-files with lots of snapshots (beware).

The SSD are underused IOPS-wise, right?

The reported churn of 4TB on per week on 26 S nodes is.... peanuts. Would that be 40TB?

One needs to ask, what will putting the small files (data) on SSD really help, given that there aren't "many" and that we have no specific metric to look at (other than SSD usage)?

A technical note concerning SmartPools file pool policies: One cannot make a rule for "ingesting small files". At the time of creating a file, the rule that matches implies a size of 0 bytes (and that rule is "cached" per directory). So any new file would land on SSD regardless how large it eventually grows. Only when the SmartPools job is run, files are migrated down. If you run SmartPools once a week, are your rule criteria is one week, the results will be uneven.

My final take is a far shot from the left field: As it seems you have about 4 SSDs per node in the S series, have you considered redistributing some SSD drives to the NL nodes? SSD support for NL400 is quite new, and it would allow you to enable the L3 cache across the whole cluster!

Hope it makes sense and helps

-- Peter

carlilek

205 Posts

0

August 31st, 2015 05:00

Hi Peter,

This is exactly the discussion I was hoping to have! Thanks!

Let me go through your points...

1. We have a GNA exception, which allows us to run at 1%.

2. We have many small files, but also many large files. Wildly mixed workload.

3. SSD IOPS underused: Most likely, yes, but I'm not sure how to confirm this. Any tips would be welcome. We do have IIQ, but it is not working at the moment (why did I try upgrading it again, why?!)

4. Yes, it's peanuts, but that's on the Tier1 storage. Tier2 storage churns much more (this is largely political and chargeback releated).

5. This will help (hopefully) with specific projects that use many, many small files and are constantly referencing and modifying them. The one I have in mind stitches together many image tiles after slightly adjusting them for variances in order to create a large image. The owners typically compare it to google maps. Unfortunately, they have not provided any specific guidance on what current performance is or what they need it to be.

6. File pool policies: Agreed. That's probably more of a pipe dream for me; I'm essentially trying to do a poor man's L3.

7. Left field: Sadly, we don't have any spare 3.5" drive carriers, nothing to replace the SSDs in the S nodes with, and I doubt Isilon would shine a happy light on that solution (as you need to have a hardware configuration file from them, which is hashed and xml'd and all sorts of unreproduceable.) We did investigate purchasing SSDs for the NLs, but the upgrade price was... shall we say... not appealing. And just because I'm pedantic, we have 10 S200s with 2 SSDs, 9 S200s with 6 SSDs, and 7 S210s with 6 SSDs. All of this was to fulfill the GNA requirement (and even the exception).

--Ken

kipcranford

125 Posts

0

August 31st, 2015 09:00

>

> 3. SSD IOPS underused: Most likely, yes, but I'm not sure how to confirm this. Any tips would be welcome. We do have IIQ, but it is not working at the moment (why did I try upgrading it again, why?!)

>

You could just collect some statistics manually, for a relatively short period of time (which will depend on how long you think it will take to get a good snapshot of your cluster's activity), then analyze those. You could run something like (which you'd probably want to script):

isi statistics protocol -d --nodes all --long --csv > /ifs/data/stats/protocol_stats.csv

isi statistics drive -d --nodes all --long --csv --timestamp > /ifs/data/stats/drive_stats.csv

isi_for_array "sysctl isi.cache.stats. >> /ifs/data/stats/\`hostname\`_cachestats.dump"'

If you want I can write up a quick script for you. In the script you can control the timing on the stats queries to take each individual measurement at slightly different times, control the sample interval, etc. You could also do all this by hand, but scripts are easier. In future OneFS releases you'll also be able to use the API (technically stats are in the API now, but are considered "experimental").

(And before anyone comments on the above, yes, I know there's isi_cache_stats, but I like the raw sysctl data)

So you run this for a couple of hours (or whatever) and are left with a bunch of CSV data. You can either parse that data, or send it to me and I'll parse it for you.

Or, just work with Support to get IIQ working again

>

> 5. This will help (hopefully) with specific projects that use many, many small files and are constantly referencing and modifying them. The one I have in mind stitches together many image tiles after slightly adjusting them for variances in order to create a large image. The owners typically compare it to google maps. Unfortunately, they have not provided any specific guidance on what current performance is or what they need it to be.

>

What SSD strategy are you using right now? This workload sounds like it would benefit from either metadata-write (which would consume more of your SSD space) or L3. I'll echo Peter's comments on moving some of your SSD to the NL nodes, such that all your nodes have local SSD. This would make it much easier for you to run L3, which I think would also help this workload (for the cached small blocks) and probably all of your other workloads to some degree. L3 (along with L2) will also cache all your active metadata, so you won't see to much of a difference from what you see now in terms of performance (and some workloads could see noticeable improvements). Really only on first access to cold data will you not get the benefit of reading that metadata mirror from SSD -- it will have to come from the spindles.

If indeed you have another order coming up, then perhaps you can work with your Sales team to figure out some creative way of getting SSD in the NLs.

carlilek

205 Posts

0

August 31st, 2015 11:00

Hi Kip,

I've just opened a ticket re IIQ; we'll see if it gets anywhere.

We're currently using metadata write for tier1 and metadata read for tier2.

I am somewhat cautious about that "first access" thing, since so much of our data lies dormant for quite awhile before it is used, and our users can be unpleasantly latency sensitive.

--Ken

carlilek

205 Posts

0

August 31st, 2015 12:00

The other (and probably more real) issue with putting an ssd in every NL400 is the fact I'd have to fail out at least one drive from every single NL400, then play fun little games with disi.

AU

Anonymous User

170 Posts

0

August 31st, 2015 14:00

The other (and probably more real) issue with putting an ssd in every NL400 is the fact I'd have to fail out at least one drive from every single NL400, then play fun little games with disi.

You also need to be at 7.2 to get the equivalency for this I think. Lots of fun to get there from here.

I've actually given up with mixing NL and S nodes in the same cluster. SmartPooling down works OK most of the time, but tiering back up is an impossibility in larger clusters. I'm not buying any more NL nodes without SSDs either for my SyncIQ targets.

kipcranford

125 Posts

0

August 31st, 2015 15:00

> Tier1 data is ingested to the S2x0 tier, and has Metadata R/W set. If data on tier1 has an mtime > 4 months, it is tiered down to the NL400s, but keeps its Metadata R/W.

That data doesn't have metadata-write if it's sitting on the NL400 nodes. Those nodes have no SSD and are using GNA, which means they get an *extra* metadata mirror created for every file/directory, which goes on SSD somewhere else in the cluster (wherever there are SSD). Any metadata read for that NL data will use that mirror on SSD, but any metadata write will have to update the mirrors on the NL spindles, plus update the extra mirror in SSD. GNA is a read acceleration technology.

carlilek

205 Posts

0

August 31st, 2015 15:00

Interesting... even if the data is modified, the metadata is still also stored on HDD in that scheme?

Peter_Sero

1.2K Posts

0

August 31st, 2015 19:00

Let's hope the project will get stuck due to some missing drive carriers... and you should not be on your own with xml hw configs and disi commands... Didn't a wise Isilon guy once say, "everything is negotiable"? And I like Kip's hint towards finding a "creative way".

Btw, make sure your nodes are not running out of gas CPU-wise; also monitor:

isi statistics query -s 'node.cpu.*.avg'

Cheers

-- Peter

kipcranford

125 Posts

0

September 1st, 2015 09:00

> Interesting... even if the data is modified, the metadata is still also stored on HDD in that scheme?

Correct. When a file moves from an S* node pool with SSD and (presumably) metadata-write, to an NL node pool without SSD but with GNA, OneFS removes all the metadata mirrors for that file from the S* node pool's SSD, writes new metadata mirrors on the NL spindles, then creates an additional metadata mirror and writes it somewhere on some node pool with SSD. This extra mirror is then used to accelerate metadata reads for that file while it lives on the NL node pool.

Any updates to that file's metadata while it exists on the NL node pool will have to update all the metadata mirrors on the NL spindles, plus update the extra mirror sitting on SSD somewhere on one of the S* node pools.

The metadata-write and metadata-read SSD strategies only apply to data sitting on the node pool with the SSD in it. So my example file will have metadata-write functionality while on an S* node pool, but then will "lose" that functionality when it migrates to the NL node pool.

carlilek

205 Posts

0

September 8th, 2015 16:00

Well, I've finally got IIQ working again (note, peeps, do not erase your boot disk for IIQ 3.2.1, you will lose your database even if your data is stored on NFS, and another note--you need IPv6 enabled in the kernel).

In any case, I've made a filter for my performance tier (read, S2x00s) and Disk 1/Bay1, ensuring I'm just looking at SSDs. Over the last hour (during which I've been running metadata intensive tests), the maximum Protocol Operations Rate I see (which I'm assuming corresponds to IOPS) is 2.43K. I don't know enough about the performance characteristics of the drives in these nodes to say whether that's underutilized, but I'm thinking it is, given the specs I can find floating around on these drives (57K write, 21K read).

Nevermind, it's worse than that, he's dead, Jim, dead, Jim, dead, Jim...

sorry.

I realized that what I needed to be looking at was Disk Operations Rate for the performance tier, broken out by disk. Under that rubric, the most I've seen in the last hour is around 440 IOPS. That's just pathetic.

Peter_Sero

1.2K Posts

0

September 8th, 2015 22:00

Ken, your metadata intensive tests might not hit the SSD very hard because

- on the "write" side, the HDDs used for data still determine the pace, unless you're creating just empty files.

- on the "read" side, you're lucky to see the metadata coming mostly from the L1/L2 RAM caches.

On our X200 nodes with single 200GB SSDs we have seen peaks around 4000 IOPS per SSD.

-- Peter

Peter_Sero

1.2K Posts

0

September 9th, 2015 01:00

> On our X200 nodes with single 200GB SSDs we have seen peaks around 4000 IOPS per SSD.

... apparently caused by NDMP tree-walks...

carlilek

205 Posts

0

September 9th, 2015 03:00

Well, we'll see what it looks like during the day when the building is around and hitting it, but I'm pretty sure these tests don't just hit cache. In any case, the performance tier as a whole is showing disk operations rates of 50-100K during this time. We've just got SO much SSD, it all gets sucked up in the noise.

Peter_Sero

1.2K Posts

0

September 9th, 2015 06:00

Check the metadata cache hit rates, they are usually much higher than the data cache hit rates!

View All

No Events found!