hersh1

197 Posts

4181

April 4th, 2016 11:00

Testing new node types in existing cluster

What type of testing do others perform when implementing new nodes into a cluster? Do others typically add the new nodes, wire up external connections and open the gates?

My tentative plan was setup the external connections to confirm network connectivity and reboot the node(s) before moving data to the them. Reboot is mainly for a bug that we discovered on another cluster but should be fixed now.

Responses(24)

carlilek

2 Intern

•

205 Posts

0

April 7th, 2016 05:00

Ouch.

I'd be really careful in that scenario.

BTW, I'd love to have an Isilon engineer weigh in here and tell me how wrong I am and how it's all fixed...

sluetze

2 Intern

•

300 Posts

0

April 7th, 2016 07:00

we will add some testscenarios regarding reboots, shutdown, nodeadd and stuff to take care. Luckily we have some clusters who are for testing only.

thx a ton and good luck waiting on that engineer.

kipcranford

125 Posts

0

April 7th, 2016 09:00

And just to be clear (since I didn't see it explicitly mentioned) the L1 cache in the accelerators behaves differently than the L1 cache in our storage nodes. It's L1 cache in the sense that it's caching file data only on that node (like Bernie mentioned), and that's the same as how L1 behaves in our storage nodes. However, that data in the accelerator's L1 is flushed using an LRU-like algorithm, not a drop-behind mechanism, which is similar to how L2 works on storage nodes.

BernieC

76 Posts

0

April 7th, 2016 09:00

Chris Klosterman wrote:

The RAM in the A100s is for L2 Cache, not L1, and it can be crazy aggressive with pre-fetching for streaming read workflows, so 4K Video streams, etc.

Hmmm, A100s and all accelerators really are all L1 cache. A100s don't have any storage, so they're not capable of having L2, or even L3. They're all initiator cache. If you don't connect clients to an accelerator like the A100 you don't get the benefit of all that L1 cache.

C

crklosterman

450 Posts

0

April 7th, 2016 09:00

Funny guys... (I'm not that engineer, but i'll respond to a few of your questions)

on the A100s

The RAM in the A100s is for L2 Cache, not L1, and it can be crazy aggressive with pre-fetching for streaming read workflows, so 4K Video streams, etc. I haven't heard anything personally about the A100s being discontinued, and we do have a number of customers that are very happy with them, so I'm certainly sorry to hear that you haven't been. It's all about understanding what you're using them for. Big streaming reads, good fit. Lots of random reads? Not as good of a fit. That's simply because it's tough to pre-fetch what we don't know you're going to be asking for into cache. This is where L3 cache on say an X410 or an S210 can prove advantageous.

On Group Merges

You're correct that group merges have historically been a problem when nodes of extremely large memory footprints are in a cluster and there is a reboot or some similar activity, but I don't know the current status of that, so I'll leave it to others to comment.

on the CPUs

I've personally seen clusters pushing upwards of 40Gbps of throughput where the CPUs were barely sweating at all. The NL and HD nodes certainly have slower CPUs than our other nodes, and that's intentional. They are designed as archive nodes with a lower $/TB cost. If you want a node that is both really big and really fast the X410 is 4&*$(** awesome as you yourself pointed out above. That said workloads that have huge numbers of small-medium files will certainly tax the CPUs more than a more homogenous workload of a small number of large files. This is both for data access as well as for cluster-wide jobs that do a treewalk/linwalk like fsanalyze or smartpools. And that all boils down to data layout and application behavior. If you have a dataset with 4 billion 10KB files and the app keeps doing directory listings all the time, or is doing a stat() or nfs getattr(), that's work that the node has to perform. When each new node goes on the market, Hardware Engineering works hard to pick the right CPUs for the workload the node is designed for based on what's available in the market at that time. Would a 12-core Skylake-based Xeon at the fastest possible clock speed be awesome in every node? Or 2 of them? Sure it would, but those didn't exist when the nodes we sell now were designed. And they draw more power, and generate more heat, use different RAM, and cost more money. Every time a new generation of Isilon nodes comes to market we get much faster CPUs, so I have no doubt that'll happen the next time too. It's no different than whatever brand of commodity servers you use in that respect. The difference of course is that we try not to have 500 different configurations floating around of say an X410, like you would find with a normal server, where you get to pick and choose every component. Every X410 has the same CPUs, same motherboard, same LOX card and NIC, same backplane. Why? We're a clustered operating system and we want consistency, and our customers demand that 6 months from now they are still able to buy 1 more node of the same configuration as their other 10, and keep growing. To do that, we can't change our hardware composition once per quarter or even once per year.

on Node Adds

A few points of guidance on doing node adds:

1. Personally, I prefer to remove any network provisioning rules first. Why? Because I would rather have the node add complete, then manually add my network interfaces to the necessary pools when I'm ready. Also this means that for a while you're running what we call NANON or NENON (not all nodes on network / not every node on network). This is fine 99% of the time, however if you run ICAP, ICAP will try and send out work to all nodes, (even those not on the network), so you may have some troubles. Also if you use ESRS, and if the cluster is on a version of OneFS before 8.0 then with NANON/NENON there are some quirks about support messages being proxied back to EMC through the other nodes depending on the network layout. 8.0 allows you to specify which subnet and pool to send the ESRS messages out.

2. once the node is up, ssh over to it (across the back-end), and do an isi_hw_status make sure that you have good input and output voltages from both power supplies (an unplugged cable will cause a power supply to show an input voltage of 0.0V). Watch for any hardware events associated with the new nodes, just to see if perhaps a drive isn't fully seated that sort of thing.

3. Once you're sure the hardware has no issues, then add in the network interfaces to the smartconnect pools.

~Chris Klosterman

Advisory Solution Architect

EMC Enablement Team

Chris.Klosterman@emc.com

C

crklosterman

450 Posts

0

April 7th, 2016 09:00

I stand corrected Bernie on that point, thanks.

~Chris

H

hersh1

197 Posts

0

April 7th, 2016 11:00

Thanks Chris for the detailed response. We are planning to have the nodes off the network to begin so that's very helpful. I assume the ICAP statement is if there are antivirus policies for scheduled scanning in place.

carlilek

2 Intern

•

205 Posts

0

April 7th, 2016 14:00

Hi Chris,

I have to dispute your statements on the appropriate sizing of CPUs. The impression I've gotten (and gotten from a very high level at Isilon) is that it is purely a cost play. The amount of markup you guys put on the hardware is downright incredible, and even putting a CPU that cost 1K more in the machines won't cut into that noticeably.

Many workloads do have millions of tiny files, and you simply don't have a SKU to deal with that properly.

There were certainly faster CPUs available when you designed the nodes; every single one I've looked at uses what is at the generation of the node close to the bottom of the line SKU, from the NL400 up to the S210.

I understand the desire to maintain consistency, but that can easily be addressed by starting with a higher SKU.

FYI, I see my 70 node cluster pushing 120Gb/sec with around 40% CPU. I've also seen it max out CPU and drop connections with a mutliscan running on Medium.

And don't get me started on L3 cache. With random workloads that are metadata heavy, L3 simply doesn't work. First access is always miserable. Now, if you pinned metadata in L3 and used the remainder for data, that would be awesome. As it stands, it simply doesn't work in my environment.

--Ken

carlilek

2 Intern

•

205 Posts

0

April 7th, 2016 14:00

The other big issue I have with the A100s is the incredible cost. I believe list on them may be as much as $60K, and that's for a $2000-$3000 server. So they simply don't provide the value that they could. In my opinion, in order to truly make them useful, you'd need enough to entirely front your Isilon cluster, at which point you might as well be buying a dedicated caching system, because it's going to cost that much.

This is addressed to the customers on the thread, not the engineers. I know you guys don't have anything to do with the pricing.

1
2

View All

No Events found!

Isilon

Testing new node types in existing cluster

Was this post helpful?