We've been around the block multiple times with EMC Support and Microsoft Support and it seems there's a lot of finger pointing between the two orgs on our issue and we've been getting nowhere with either vendor.
I'm curious if anyone has experienced this:
We occasionally have EXTREMELY long failovers on our MSCS clusters attached to both our CX3-80 and CX4-960.
Host OS is Win2k3 64-bit
PowerPath is 5.5
Emulex HBA Drivers are latest and greatest
Switches are Cisco running SAN-OS 3.3(4a)
CX3 is running Flare 28, CX4 is running Flare 29
What we see is that during a failover our larger LUNs (2Tb) will go Online Pending for an extended period of time, and then randomly pop after 45 minutes or so and come up. If we shut down all nodes of the cluster, then just try to bring the one node up, we get the same behavior, so I don't think we're having SCSI reservation conflicts.
Removing the host/disks from teh storage group and readding them also has no impact. We just have to "wait it out" for MSCS to decide to bring the disks online.
Has anyone else EVER run into this situation before that can provide us a direction to continue working/researching?
so you cluster nodes are attached to both arrays ? Do you experience this issue if your cluster is attached to only one of the arrays ? Disks are formatted GPT ?
Are you using Cluster Enabler to cluster across both Arrays? Or do you have separate clusters, isolated to individual arrays?
This is what happens when you post after working an issue for waaaay two long
Two separate clusters, running on two separate Clariions. No mirrorview between them, no MirrorView/CE. Simplist cluster config imaginable.
CX3-80 is running Flare 26
cluster nodes zoned identically ? (each HBA is zoned to SPA and SPB). Failover mode on Clariion is set to 1 ?
Could you run the following PowerPath command on all nodes in the cluster?
I'm not entirely certain of the exact behavior when there are collisions, but I do know to look for it in a clustered environment when installing PowerPath so figured it would be worth a quitck look. If there are nodes with the same hostid entry this of course would only be in a situation where the systems were created from a cloned disk image. There is a procedure to change it using the set parameter.