Unsolved

This post is more than 5 years old

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

939

September 9th, 2014 21:00

OneFS 7.1.0.x and extremely slow SmartConnect failover

Has anyone seen dramatically increased failover times of IP addresses in dynamic pools between OneFS 6.5.5.x and OneFS 7.1.x ?

I have 4 Isilon clusters, 2 are running OneFS 6.5.5.22 and two brand new clusters running 7.1.0.4. I just went through an upgrade from 7.1.0.3 to 7.1.0.4 and notice failover times of up to 2 minutes where an IP address from dynamic pool is not available. So i opened a ticket with support and was told that there were some SmartConnect issues in 7.1.0.3 that were resolved in 7.1.0.4.  Now that i am on 7.1.0.4 i decided to see if it got any better, not at all. I just rebooted a node while pinging an IP address and it stopped pinging and was not available for 2 minutes and 10 seconds. It takes 10-30 seconds to failover on my old 6.5 cluster.  Same number of subnets, same number of pools.  Has anyone seen this ?

Thanks

September 11th, 2014 09:00

What I've done for reboots is to manually take the NIC out of the dynamic pool.  That fails over nice and quickly.  When the node comes back, put the NIC back in the pool.

There is a known bug with NIC failover for which there's an open bug that I don't see resolved in 7.1.0.anything yet - that's bug 130151.  I had that opened when I had a NIC fail and it never got pulled out the dynamic pool at all.

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

September 11th, 2014 10:00

removing NIC from a pool ? By doing that you essentially permanently disconnect NFS clients because that can no longer reach IPs that were assigned to that NIC ?

September 11th, 2014 10:00

Thanks Rob!  I suspect my next upgrade will be from 7.1.0.3 to 7.1.1.something...

2 Intern

 • 

99 Posts

September 11th, 2014 10:00

FYI folks this bug has moved to 130388 and is targeted for MR 7.1.0.7.  Ed, work with your account team if you believe you need a quick patch.  Cheers.

September 11th, 2014 10:00

Nope - you're removing the physical NIC, not the IP address.  When the NIC is removed, SmartConnect will migrate the IP address to another NIC in the pool.  The NFS clients will not experience an outage because the IP remains available.

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

September 11th, 2014 10:00

thanks, so it's different from what i am experiencing. My experience is consistent also, i have two clusters that exhibit the same slow failover times.  I was actually sitting in vSphere client and saw the datastore go away for a minute while a node rebooted, something that i have never seen in my 6.5.5 environment.  Something is busted.

September 11th, 2014 10:00

In my case, it wouldn't fail over at all so existing connections hung when the IPs hosted on that NIC didn't migrate.  Kinda defeats one of the purposes of SmartConnect, hence the bug.

I normally have reasonably acceptable failover times during routine operations.  I haven't benchmarked them lately though.  With hard mounts, and almost exclusively NFS, I can do most of my work in non-interactive windows.  Not complete off-hours, but hours when the users aren't running interactively and the batch jobs can take the pause while the maintenance happens.

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

September 11th, 2014 10:00

yuck, that's ugly and does not address node panic instances. Did you experience slow failover times too ..or it would not failover at all ?

No Events found!

Top