Isilon smartconnect timeouts

If a node in isilon cluster goes offline.The SMB clients attached to that node/interface are taking about 18 seconds to re establish connection to other available nodes in the cluster. Is there any ways to reduce this to 10 seconds or lesser ?

I have disabled continuous availability on the shares and using windows 2012 R2 clients.

Responses(4)

SKT2

2 Intern

•

1.3K Posts

0

August 21st, 2017 01:00

Enable continuous availability and test again. Why you want to reduce it further?

with CA enabled your client should reconnect to the available node without a disruption.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 21st, 2017 01:00

Our application supports SMB2.

Now it takes about 18 to 21 seconds for the client to reconnect to smart connect DNS name. I want to reduce this to 10 seconds so that the data loss is minimal.

In wireshark I can see the TCP IP broadcasts searching for the failed node IP for 19 seconds after which it re queries the DNS name to get another node IP and establish new SMB session.

sjones51

252 Posts

1

August 22nd, 2017 12:00

Hi arjunnagaraju,

I asked around about this one, and the general consensus was that for the least amount of failover time was to use SMB3 with OneFS 8.0 witness service and continuous availability (CA) as mentioned already. The following is from the OneFS 8.0 Technical Dive course that I believe is only available internally. I could be mistaken on the availability, but here is the pertinent part:

In OneFS 8.0, Isilon offers the continuously available share option. This allows SMB clients the ability to transparently fail over to another node in the event of a network or node failure. This feature applies to Microsoft Windows 8, Windows 10 and Windows Server 2012 R2 clients. This feature is part of Isilon's non-disruptive operation initiative to give customers more options for continuous work and less down time. The CA option allows seamless movement from one node to another and no manually intervention on the client side. This enables a continuous workflow from the client side with no appearance or disruption to their working time. CA supports home directory workflows as well.

In SMB 3.0, Microsoft introduced an RPC-based-mechanism to inform the clients of any state change in the SMB servers. This service is called Service Witness Protocol (SWP) and it provides a faster timeout should a server go down and allows SMB 3.0 clients to failover in less time.

sleef

3 Posts

1

August 22nd, 2017 13:00

SMB 3.0 CA failover can happen because of SMB 3.0 Persistent Handles, which allow file locks on that node to be "cluster aware" in a sense. The ability to failover at all is dependent on the Persistent Handles. The timeliness to failover depends on SWP (lwwit). The lwwit service is what serves the SWP implementation, and the client is responsible for "registering" with the Witness server (any node) and request a notification on when a failover should occur. If for any reason the client either fails to register with Witness, or witness fails to notify the client, the client will default to the ~20 second failover time to recover the persistent handles on another node, which is the seamless failover.

A functional Witness service (in my testing) will allow the client to failover within ~8 seconds or so. Packet captures would be necessary to ensure that Witness is both registering properly, and notifications are being requested and/or sent to/from the cluster. Event logs on the Windows client will also show some detail on any failures (and successes) that may have occurred with registration and notification ("Witness client" logs I believe)

View All

No Events found!