EdwinVanMierlo
2 Bronze

Re: Ask the Expert: EMC Cluster Enabler multi-site clustering

All,

let me detail one of the many questions which come in from time to time.

Many times I have been asked to provide an explanation because:
"Cluster Enabler failed over the cluster groups"

Lets start with a statement: Cluster Enabler will not make any decisions (one exception to this), and will not cause the cluster groups to failover (one exception to this), nor will it actually actively failover any cluster groups.

Besides the exceptions noted, which I will come back to further down. Cluster Enabler is merely a resource in the cluster, and will follow the groups and cluster where they want to "go".

Microsoft Windows Server Failover Cluster is in control, all the normal rules for cluster failover are honoured, and cluster will determine the failover based on the set error detecting properties of cluster itself.

Cluster still behaves as if it was a normal cluster, it will perform error detecting, and the following three methods are used:

  • Network error detecting; this is what we used to call the "heartbeat", it will detect if a node is up or down in the cluster, it will cause a revote (regroup) for quorum and it will take appropriate action.
  • Resource error detecting; through the methods of "Looksalive" and "IsAlive" the cluster checks the health of each and any resources in the cluster groups, and will detect failures.
  • User mode Hang detection; based on a "heartbeat" between the NetFT.sys network driver and the Clussvc.exe Cluster service, it can detect if the node is hanging or not.

Based on the error detection routines within cluster, it is cluster to determines:

  • do we need to failover ?
  • where do we need to failover to ?

As cluster is a highly customisable engine itself, there are many factors which influences the failover decision of cluster. Here are the most common ones:

  • Per Group: the "Group Failover Threshold"; for each group you can set a threshold and period which determines how many times the group can failover due to failures in a given period. This setting is to prevent a group from "failing over continuously" when there is a problem
  • Per Group: "AutoStart / Priority"; this setting will determine if a group will failover or not; if the Priority (or AutoStart) is set to "Do not AutoStart" then the group will not failover
  • Per Group: "Preffered Owners" this setting will determine the next node in line to failover to (advisory only)
  • Per Group: "AntiAffinityClassNames" this setting will determine if a group will not come online if another group with similar name is already on the target node. It is used to keep "similar applications apart" (advisory only, cluster may overwrite this setting)

  • Per Resource: "Resource Restart" property; if the resource property is set to be "do not restart" then the cluster will not affect this group and will not failover the group
  • Per Resource: "Persistent State" property; if the resource was brought offline manually, and if the group is failing over to another node, the resource will not come online on the other node, as it was already offline (and therefore persistent) to begin with.
  • Per Resource: "Possible Owners"; prior to failing over to a node, cluster will enumerate all resources in the group and determines if all resources are "possible" to come online on that node, if one (or more) resources are not able to come online due to this "Possible Owners" setting, then cluster will not move the group to that node.

All this is done, the error detecting, the decision to failover (or not), and where to move to, by cluster.
Cluster Enabler does not come into play in this logic, it is merely a resource in the cluster.

Hope this clears up the situation of "Cluster Enabler failed over the group", the answer for that is: Cluster detected an error, and Cluster failed over the group. That will give you a good start when looking at those situations.

I mentioned 2 exceptions:

  • Cluster Enabler does not cause a cluster to failover; the one exception is that Cluster Enabler is a resource in the cluster, therefore has an "LooksAlive" and "IsAlive" check. While the checks we are doing are quite basic (just checking connectivity to the frame) it is possible that this is detecting a failure and the resource fails. While this fails, you probably notice that other resources will be failing the same time, as you will have a connectivity problem; so disk resources will be failing as well at this point.
  • Cluster Enabler does not make any decisions: the one exception to this is: at the very moment the resource is coming online the Cluster Enabler software checks the replication status of the "link" (could be RDF, MirrorView or RP) and if the "link" is not replicating/up then Cluster Enabler resource will make the decision not to come online. This is to protect the data, but can be configured. (I will do another post detailing this)

And there you have it, failover of cluster in a multi-site cluster using Cluster Enabler.
It is basically and not much more than any other cluster !
All the normal cluster logic applies

Please let me know if you have any questions
Rgds,
Edwin.