OneFS 7.1.0.x and extremely slow SmartConnect failover

Question

Has anyone seen dramatically increased failover times of IP addresses in dynamic pools between OneFS 6.5.5.x and OneFS 7.1.x ?

I have 4 Isilon clusters, 2 are running OneFS 6.5.5.22 and two brand new clusters running 7.1.0.4. I just went through an upgrade from 7.1.0.3 to 7.1.0.4 and notice failover times of up to 2 minutes where an IP address from dynamic pool is not available. So i opened a ticket with support and was told that there were some SmartConnect issues in 7.1.0.3 that were resolved in 7.1.0.4. Now that i am on 7.1.0.4 i decided to see if it got any better, not at all. I just rebooted a node while pinging an IP address and it stopped pinging and was not available for 2 minutes and 10 seconds. It takes 10-30 seconds to failover on my old 6.5 cluster. Same number of subnets, same number of pools. Has anyone seen this ?

Thanks

Anonymous User · Answer

What I've done for reboots is to manually take the NIC out of the dynamic pool. That fails over nice and quickly. When the node comes back, put the NIC back in the pool.

There is a known bug with NIC failover for which there's an open bug that I don't see resolved in 7.1.0.anything yet - that's bug 130151. I had that opened when I had a NIC fail and it never got pulled out the dynamic pool at all.

dynamox · Answer

removing NIC from a pool ? By doing that you essentially permanently disconnect NFS clients because that can no longer reach IPs that were assigned to that NIC ?

Anonymous User · Answer

Thanks Rob!  I suspect my next upgrade will be from 7.1.0.3 to 7.1.1.something...

peglarr · Answer

FYI folks this bug has moved to 130388 and is targeted for MR 7.1.0.7.  Ed, work with your account team if you believe you need a quick patch.  Cheers.

Anonymous User · Answer

Nope - you're removing the physical NIC, not the IP address.  When the NIC is removed, SmartConnect will migrate the IP address to another NIC in the pool.  The NFS clients will not experience an outage because the IP remains available.

dynamox · Answer

thanks, so it's different from what i am experiencing. My experience is consistent also, i have two clusters that exhibit the same slow failover times.  I was actually sitting in vSphere client and saw the datastore go away for a minute while a node rebooted, something that i have never seen in my 6.5.5 environment.  Something is busted.

Anonymous User · Answer

In my case, it wouldn't fail over at all so existing connections hung when the IPs hosted on that NIC didn't migrate. Kinda defeats one of the purposes of SmartConnect, hence the bug.

I normally have reasonably acceptable failover times during routine operations. I haven't benchmarked them lately though. With hard mounts, and almost exclusively NFS, I can do most of my work in non-interactive windows. Not complete off-hours, but hours when the users aren't running interactively and the batch jobs can take the pause while the maintenance happens.

dynamox · Answer

yuck, that's ugly and does not address node panic instances. Did you experience slow failover times too ..or it would not failover at all ?

Isilon

OneFS 7.1.0.x and extremely slow SmartConnect failover

Was this post helpful?