Start a Conversation

Unsolved

This post is more than 5 years old

A

5 Practitioner

 • 

274.2K Posts

1086

September 30th, 2014 04:00

Delay of access node IP failover

Hello Centera Experts,

Customer is complaining that failover takes too long time in two access nodes environment cluster.

When one access node is offline, it takes 10 seconds for the another access node to be able to accept connection.

I was doubting their APP issues but found that they configured their Config file as below.

cas.poolAddress=10.255.206.181,10.255.206.182

And according to the programmer's guide, it states that more than two IP address of the same cluster will try to connect both of pool connection.

2 or more IP addresses of the same cluster (recommended

scenario): The SDK tries to connect to all IP addresses. Failing to

connect to one address does not prevent establishing the pool

connection.

Their code seems pretty normal, putting two IP address to String.

Is there anything else I should check with customers? Especially, which part of codes should I check?

409 Posts

September 30th, 2014 04:00

Hi

When they say that it is taking too long to failover, which API operation are they talking about?

If it's the FPPool_Open() call then 10 seconds is fine and as FPPool_Open should only be called once at the startup of the service it's not a big problem.

If it's during all IO operations then there is something wrong.

If a node fails during say a write op then the SDK will mark that node as down and restart the IO op on another node (or at least attempt to).  At any subsequent IO op then the SDK knows the node is down and wont try to use it, so you will not see any delay (the SDK does periodically probe the failed node to see if it's back online and can use it again).

So during normal operation just because a node has failed the customer should not actually notice anything assuming their are surviving nodes that can be used by the SDK.

As they do seem to be noticing this effect then I suspect that what they are in fact doing is something like

FPPool_Open()

IO Operation

FPPool_Close()

and repeating this for every IO.

This is not the way to use the Centera API.  As you will be experiencing the 10 second delay on EVERY IO operation instead of just on start up or on the one that the node failed.

Check and see if they are doing this

5 Practitioner

 • 

274.2K Posts

October 1st, 2014 17:00

Hello Paul,

They were complaning for the FPPool_Open() takes 10 seconds, not the entire operation.

Customer is not happy with, why they should feel the 10 seconds delay if there is offline access node.

Is there good way to explain that why FPPool_Open() should take 10 seconds with offline access node?

Thank you very much.

409 Posts

October 2nd, 2014 01:00

I'm surprised they even notice the timeout that leads to the failover in a FPPool_Open() as this should only be occurring once in the lifetime of their application being up.  Even if they reboot once a day, say, how are the spotting 10 seconds delay in their application starting?

The reason why taking 10 seconds or so to failover is because the default retry and timeout variables are set that way.  The settings

FP_OPTION_PROBE_LIMIT

FP_OPTION_RETRYCOUNT

FP_OPTION_RETRY_SLEEP

(all documented in the API reference Guide and the Programmers Guide)

govern this behavior.

They can change this is they really want to but I generally recommend as a best practice to not change these settings as they govern the retry/timeout behavior for all operations.

Do they notice a 10 second delay everytime they try to read or write objects or just the once on start up?

409 Posts

October 2nd, 2014 02:00

Tell them to not do the open every time.  An open call is one of the most expensive calls they can make anyway never mind when a node is down.  Doing this is against our best practice for integrating with the Centera API.  Our best practice presentation/training is available for them to download from this site

5 Practitioner

 • 

274.2K Posts

October 2nd, 2014 02:00

Yes,  they actually do their works by running one thread ..like "open write, exist, read ". They can't split this one thread.

This one thread took 10 seconds each time and they would like to reduce the 10 seconds to something less.

Local team will tell them the feature of FPPool_Open() as you answered. Thank you very much 

5 Practitioner

 • 

274.2K Posts

March 6th, 2015 02:00

Hello Paul,

I know that this issue is bit quiet old but customer came back to us to this issue.

This customer does not call "FPPool_Open()" on each operation. They call it only once.

But one thing interesting is that the two access nodes will be located in different backbone, network destination.

In our internal lab test that each accessnode share same network, we performed the failover test and we barely could find some delay.

Is it OKAY for each access node to divide in two different backbone in same cluster ?

No Events found!

Top