Unsolved
This post is more than 5 years old
5 Posts
0
717
June 12th, 2009 04:00
AutoStart 5.2 SP3 - Managed IP (Path Failed)
It was my understanding that the network path testing tab of a managed IP address is a sort of 'ping list' that AutoStart would use to check if the network path is correct. If for whatever reason, the IP addresses cannot be reached, then the managed IP address is marked with a 'Path Failed'.
The admin guide suggests that even though a 'Path Failed' error may be received, the managed IP will still stay assigned to node it was already assigned to (thus preventing a failover).
The problem one of our customers has, is that they are receiving a Path Failed error message and AutoStart appears to make arranagements to relocate the resource group (or at least disconnect the resources), yet almost immediately backs out halfway through and reassigns the resource group back to its original node.
In this instance there are two nodes - znew_swiftlive (primary node) and znew_swiftback (backup node). The managed IP address is 172.17.32.1 and there is also a node alias (SWIFT_DB_LIVE) assigned to this IP address. Only 1 network path IP address has been added for network testing (the default gateway that both nodes are using). Currently, the managed IP and node alias are contained in a resource group called SWIFT.
The following 9 events are the actions that have occured at various times over the past couple of days. They have been taken from the dedicated AutoStart event log in Windows. As a result, because I am reading these events on my laptop, I do not have access to the full event - just the last line of it.
Also, please bear in mind that these events were all marked as Information events, not Warnings or Errors.
- IP Address 172.17.32.1 on znew_swftlive is Path Failed. (02:35:32)
- Resource Group SWIFT is in the Online Pending state. Current trigger is for resource 172.17.32.1 on node znew_swftlive Cause - IP Address Failure. (02:35:32)
- Released Node Alias SWIFT_DB_LIVE from znew_swftlive.(02:35:33)
- Release request received for Managed IP 172.17.32.1. (02:35:33)
- Released Managed IP 172.17.32.1. (02:35:33)
- Assign request received for Managed IP 172.17.32.1. Target Node: znew_swftlive. (02:35:34)
- Assigned Managed IP 172.17.32.1 to NIC 172.17.32.11 on node znew_swftlive. (02:35:35)
- Assigned Node Alias SWIFT_DB_LIVE on znew_swftlive. (02:35:41)
- Resource Group SWIFT is in the Online state. Running on node znew_swftlive. (02:35:42)
As you can see, these events took place over a period of 10 seconds from 02:35:32 - 02:35:42), with the actual reassigning of the IP address taking place in 3 seconds.
They previously had data sources in the reosurce group but they were disconnected by AutoStart during the above events. This is a setup that has been in place for nearly two years. The IP timeout settings were originally set to the defaults, but yesterday the timeout was increased from 5 seconds to 7.
I can understand that if AutoStart genuinely believes there was a problem with the IP address, it would have made arrangements to relocate the resource group to the next available node - its what it is supposed to do. But then why back out a few seconds later causing an outage in the process? During the 3 second 'outage' applications that depend on the managed IP cease to run and have to be manually restarted when the users arrive to work.
The customer has since added another IP address for testing (although i have suggested that they add more, for more reliable results, however, this outage still occurred (the results of which are above).
Any ideas?
The admin guide suggests that even though a 'Path Failed' error may be received, the managed IP will still stay assigned to node it was already assigned to (thus preventing a failover).
The problem one of our customers has, is that they are receiving a Path Failed error message and AutoStart appears to make arranagements to relocate the resource group (or at least disconnect the resources), yet almost immediately backs out halfway through and reassigns the resource group back to its original node.
In this instance there are two nodes - znew_swiftlive (primary node) and znew_swiftback (backup node). The managed IP address is 172.17.32.1 and there is also a node alias (SWIFT_DB_LIVE) assigned to this IP address. Only 1 network path IP address has been added for network testing (the default gateway that both nodes are using). Currently, the managed IP and node alias are contained in a resource group called SWIFT.
The following 9 events are the actions that have occured at various times over the past couple of days. They have been taken from the dedicated AutoStart event log in Windows. As a result, because I am reading these events on my laptop, I do not have access to the full event - just the last line of it.
Also, please bear in mind that these events were all marked as Information events, not Warnings or Errors.
- IP Address 172.17.32.1 on znew_swftlive is Path Failed. (02:35:32)
- Resource Group SWIFT is in the Online Pending state. Current trigger is for resource 172.17.32.1 on node znew_swftlive Cause - IP Address Failure. (02:35:32)
- Released Node Alias SWIFT_DB_LIVE from znew_swftlive.(02:35:33)
- Release request received for Managed IP 172.17.32.1. (02:35:33)
- Released Managed IP 172.17.32.1. (02:35:33)
- Assign request received for Managed IP 172.17.32.1. Target Node: znew_swftlive. (02:35:34)
- Assigned Managed IP 172.17.32.1 to NIC 172.17.32.11 on node znew_swftlive. (02:35:35)
- Assigned Node Alias SWIFT_DB_LIVE on znew_swftlive. (02:35:41)
- Resource Group SWIFT is in the Online state. Running on node znew_swftlive. (02:35:42)
As you can see, these events took place over a period of 10 seconds from 02:35:32 - 02:35:42), with the actual reassigning of the IP address taking place in 3 seconds.
They previously had data sources in the reosurce group but they were disconnected by AutoStart during the above events. This is a setup that has been in place for nearly two years. The IP timeout settings were originally set to the defaults, but yesterday the timeout was increased from 5 seconds to 7.
I can understand that if AutoStart genuinely believes there was a problem with the IP address, it would have made arrangements to relocate the resource group to the next available node - its what it is supposed to do. But then why back out a few seconds later causing an outage in the process? During the 3 second 'outage' applications that depend on the managed IP cease to run and have to be manually restarted when the users arrive to work.
The customer has since added another IP address for testing (although i have suggested that they add more, for more reliable results, however, this outage still occurred (the results of which are above).
Any ideas?
No Events found!


JoelStewart
24 Posts
0
June 30th, 2009 13:00
Investigating fail-over details related to this environment could be quite complex. I would recommend engaging EMC Support and having an engineer take a look at the logs and the specific configuration to help make the most educated diagnosis of why the Managed IP address failed or did not fail-over.
Best Regards,
Joel
yito1
262 Posts
0
June 30th, 2009 18:00
Because the loop of the failover is caused when becoming Path Failed.
This is a bug.
You should report to EMC Support.
Measures at the trouble of the route must use Isolation.
ecervant
63 Posts
0
July 2nd, 2009 08:00
If you only have one network test IP, then any network delay or hiccup can cause that 1 ICMP test to fail and therefore cause the Managed IP to transition into this path failed state. I usually suggest adding at least three network path testing IP addresses to prevent unwanted failover when there is simply a network delay or quick hiccup. The more test IP's are specified, there is a less probability of an invalid failover of the resource group.
Now for the Resource group functionality it appears to be behaving as designed. Here is some insight on the Managed IP test and Resource group actions.
Every managed resource has failure detection settings that tell the resource group what to do if there is a failure with the managed resource. BY default, the managed IP address will take Resource group action when the managed IP transitions to a path failed state. The action is to restart the resource group or relocate the resource group, depending on the scenario and settings.
When a managed resource fails on the active node, by default it will try to restart the resource group on the same server 3 times in 300 seconds. The fourth time it will try to relocate to the standby node. This option is configurable in the Resource group Options tab.
The reason your resource group only takes 3 seconds to restart is because it only has two managed resources which are fairly quick in assigning and un-assigning. I assume if you added more managed resources to the resource group, it would take longer to restart the resource group and increase your managed IP outage.
My end suggestion, is to add more IP addresses to minimize the unwanted test failure when the network is saturated or when there is a hiccup. The more reliable IP addresses on the subnet the better. In theory, you want to use IP addresses that would help indicate your users are also not able to access the resources from the managed IP.
I hope this helps.