sfallon

16 Posts

8143

April 3rd, 2014 15:00

How best to remove a single node from a cluster

I have a customer that has a cluster of 4 108NL's and 4 NL400's. They need to remove one of the 108NL nodes as it is coming off maintenance. Is the best way to remove the node is just to smartfail or is there another way? They are concerned that data protection will be compromised while the smartfail process is running. They are at 2:1 OneFS 7.0.2.4. They recently had to smartfail a 108 node and it took 9 days to complete.

I looked into using smart pools but that looks to be migrating from one pool to another. This is only one node of a pool.

thoughts?

Responses(6)

C

crklosterman

450 Posts

2

April 3rd, 2014 21:00

Yes a smartfail is your best option in this case, it certainly will take a while to perform, but that is somewhat by design. A smartfail is after-all a predictive failure (on-purpose), and it'll kick off a flexprotect job that will re-protect all the data in that disk pool/node pool. Also keep in mind at N+2:1, with 4 nodes, the overhead from parity is basically 25%, when you drop to only 3 nodes, it increases to 33%, so not only is the node pool capacity going to decrease, but the amount of parity is going to increase. So some simple math, if there was 100TB of data now at 4 nodes, at 25%, it'll occupy 125TB of formatted space. After removing a node, it'll now occupy 133TB of space.

~Chris Klosterman

Senior Solution Architect

EMC Isilon Offer & Enablement Team

chris.klosterman@emc.com

Twitter: @croaking

C

crklosterman

450 Posts

1

April 3rd, 2014 22:00

No, you're not mis-reading the GUI, but you have to understand that it's impossible to calculate exactly how much space is useable for file storage, because it is dependent upon the size of the file itself. Isilon uses 8KB blocks and 128KB stripes, and files that are smaller than 1 stripe, so less than 128KB must instead be mirrored to ensure protection. We mirror them to 1 level greater than they would otherwise be protected at. So at N+2:1, we would mirror at 3X. So a calculation might be a 50KB file would consume 56KB of space for 1 copy, because we cannot stripe it, we instead mirror it, so it'll consume 168KB of space. So the amount of parity is dependent upon the file size, but also on the number of nodes in the cluster (this is especially true with larger files).

In the end the key here is that Isilon is extremely flexible because protection is per-file, but that also means that the overhead is based on the file itself. One common example I use for this discussion is imagine home directories, you might put all home directories in 1 top-level directory with default protection. But then you decide that you want to protect just your executive's homedirs at a higher level. You can do that. You can also use filepool policies with lots of options to say protect all .jpg files at N+4 that are over 6 months old. I don't know why you would do that particular example, but that's not really the point.

What most storage admins are used to with RAID is a known quantity of storage, with a 4+1 R5, you lose 20%, but then you lose more overhead from formatting the filesystem. So in that case regardless of file size data you have a known, but very inflexible amount of overhead.

~Chris

cincystorage

2 Intern

•

467 Posts

0

April 3rd, 2014 22:00

What I end up doing for my executive dashboard is taking the "HDD Avail" stat and factor in the default SmartPool protection, and adjust for my file size distribution (I skew towards lots of small file). I end up taking the percentage of space which are mirrored (As reported by InsightIQ) and make a sort of educated guess as to likely amount of actual usable. It works, just wish it was somehow easier.

cincystorage

2 Intern

•

467 Posts

0

April 3rd, 2014 22:00

The isilon GUI (at least with 7.0.2.5) is very bad at reflecting the amount of storage actually consumed at the "cluster status" screen. It report total consumable space, not taking into account protection levels - unless i'm mis-reading my gui.

C

crklosterman

450 Posts

0

April 4th, 2014 06:00

Mark,

I would suggest that you pass this feedback on to your EMC account team, and have them submit a feature request on your behalf. I’m not sure exactly how such a feature would be implemented, because in the end it’s just a guess or an estimation, based on average file size and protection levels, but yes I certainly agree that some sort of rough guess estimation of useable capacity including overhead would be useful. Perhaps it could be added to the smart pools page in the webui, and give a rough guess based on protection level for each node pool. File Pool policies that change this could make it a horrible estimate, but it’s still better than nothing.

~Chris

Peter_Sero

1.2K Posts

2

April 4th, 2014 07:00

> They are concerned that data protection will be compromised while the smartfail process is running.

sfallon:

The good thing which cannot be emphasized enough, is that during the entire smartfail

procedure (FlexProtect job -- up to several days) the actual protection is never

reduced or endangered!

So for this part of your original question, there is no reason for concern.

Cheers

-- Peter

View All

No Events found!