KTelep

24 Posts

1891

May 23rd, 2011 22:00

Issues with MSCS in Clariion Environment (Extremely Long Failovers)

We've been around the block multiple times with EMC Support and Microsoft Support and it seems there's a lot of finger pointing between the two orgs on our issue and we've been getting nowhere with either vendor.

I'm curious if anyone has experienced this:

We occasionally have EXTREMELY long failovers on our MSCS clusters attached to both our CX3-80 and CX4-960.

Host OS is Win2k3 64-bit

PowerPath is 5.5

Emulex HBA Drivers are latest and greatest

Switches are Cisco running SAN-OS 3.3(4a)

CX3 is running Flare 28, CX4 is running Flare 29

What we see is that during a failover our larger LUNs (2Tb) will go Online Pending for an extended period of time, and then randomly pop after 45 minutes or so and come up. If we shut down all nodes of the cluster, then just try to bring the one node up, we get the same behavior, so I don't think we're having SCSI reservation conflicts.

Removing the host/disks from teh storage group and readding them also has no impact. We just have to "wait it out" for MSCS to decide to bring the disks online.

Has anyone else EVER run into this situation before that can provide us a direction to continue working/researching?

Responses(14)

RRR

5.7K Posts

0

May 24th, 2011 02:00

Sorry, I don’t have any experience with Mirrorview/CE just yet.

dynamox

1 Rookie

•

20.4K Posts

0

May 24th, 2011 04:00

so you cluster nodes are attached to both arrays ? Do you experience this issue if your cluster is attached to only one of the arrays ? Disks are formatted GPT ?

Storagesavvy

474 Posts

0

May 24th, 2011 10:00

Are you using Cluster Enabler to cluster across both Arrays? Or do you have separate clusters, isolated to individual arrays?

S

sasamir1

234 Posts

0

May 24th, 2011 11:00

Flare 28 on CX3? please can you check again.

regards,

Samir

KTelep

24 Posts

0

May 24th, 2011 13:00

This is what happens when you post after working an issue for waaaay two long

Two separate clusters, running on two separate Clariions. No mirrorview between them, no MirrorView/CE. Simplist cluster config imaginable.

CX3-80 is running Flare 26

dynamox

1 Rookie

•

20.4K Posts

0

May 24th, 2011 13:00

cluster nodes zoned identically ? (each HBA is zoned to SPA and SPB). Failover mode on Clariion is set to 1 ?

KTelep

24 Posts

0

May 24th, 2011 14:00

Yep, all the zoning is correct for all nodes of the clusters, Failover mode is set properly, and validated.

Storagesavvy

474 Posts

0

May 24th, 2011 14:00

What is the NTFS Allocation size for the large LUNs? 4KB(default)? Or something else?

Richard J Anderson

christopher_ime

2K Posts

0

May 25th, 2011 06:00

Could you run the following PowerPath command on all nodes in the cluster?

emcphostid check

I'm not entirely certain of the exact behavior when there are collisions, but I do know to look for it in a clustered environment when installing PowerPath so figured it would be worth a quitck look. If there are nodes with the same hostid entry this of course would only be in a situation where the systems were created from a cloned disk image. There is a procedure to change it using the set parameter.

DanJost

190 Posts

0

May 27th, 2011 12:00

How full is the 2TB disk? A loooooonngg time ago I had a similar problem and the chkdsk was running when the disk would failover between nodes but I would think a very full 2TB disk would take longer than 45 minutes to chkdsk. You can prohibit chkdsk from running on the cluster nodes but that might not be desireable in the long run. It's a longshot but that's all I got

Dan

SKT2

1.3K Posts

0

July 6th, 2011 16:00

Chris,

what version of PP are you talking about ? I have not noticed emcphostid so far in my servers.

christopher_ime

2K Posts

0

September 11th, 2011 15:00

SKT,

Apologies for the delay. I need to better manage responses directly to me as I only just realized their was an unanswered post. Anyways... the earliest I've seen this documented is in PowerPath 5.3. Would this explain why you are not seeing it?

SKT2

1.3K Posts

0

October 2nd, 2011 18:00

may be i missed this is a windows environment. I see this is same as HostIdFile file in unix world

KTelep

24 Posts

0

October 3rd, 2011 15:00

So we finally resolved this by migrating the hosts to VNX. LUNs on the old array, takes up to 2 hours to failover. LUNs on the new array, less than 30 seconds.....

View All

No Events found!

CLARiiON

Issues with MSCS in Clariion Environment (Extremely Long Failovers)