Start a Conversation

Unsolved

This post is more than 5 years old

3127

December 9th, 2009 02:00

Windows clustering disk signature problems

I'm trying to troubleshoot a problem we've got on one our our windows clusters.

Specifically, it's failed to bring resources online, complaining:

000011fc.000002b4::2009/12/07-11:28:55.243 ERR  Physical Disk : Online, Cluster DB signature 13FF07D7 does not match PhysicalDrive11 signature 3FFFFFFF

Basically, it looks like 'somehow' this drive has changed disk signature (BCV/SRDF replicas also have this disk signature).

The really bothersome thing is that this is a database server, and _both_ the logs and data LUNs have had this (well, something very similar) happen over the last couple of weeks. But only on that particular database.

We've got a pair of DMX-3s, SRDF replicated, and backup stuff via BCVs. Powerpathed and clustered. So there's a load of 'complications' in the environment.

Now I'm trying to figure out what's going on, and how to deal with it - we've ended up presenting them with new storage, which has 'fixed' it, but that hasn't in my mind got to the root cause of it.

Microsoft have apparantly told us it's a 'storage fault', but ... well, I figure they probably _would_ say that.

Has anyone seen anything similar? Specifically a cluster resource going offline with disk signature problems?

I've got the ball in my court at the moment, as Microsoft have wagged a finger at multipathing driver or the storage array, which doesn't make much in the way of sense to me. But whatever. I was wondering if anyone had suggestions on avenues of enquiry? (I do have this open as a case with EMC support)

Cheers,

Ed

53 Posts

December 9th, 2009 02:00

Production unfortunately - this problem hasn't shown up in dev, nor DR.

DR does have the disks with the 'replicated' signatures on disk, but that one works 'just fine'. Indeed, we used the SRDF copy to restore the database files to bring the database back into service.

2 Intern

 • 

1.3K Posts

December 9th, 2009 02:00

is that a production or DEV environment?

53 Posts

December 9th, 2009 07:00

It's 2003 SP1 (yes, I know that's end of life - we have extended our support arrangement, and despite that, I'm still trying to pull together quite what happened for my own sanity - nothing worse than 'it'll all be fixed in the new release' only to find it hasn't been, and you're back to square one with a critical problem)

Changes - well, nothing dramatic, configuration wise. What triggered noticing this is building a new cluster node, but it all went splat -before- it got brought online in the cluster. It's likely there are patches and service packs applied though.

We've never done a BCV or RDF restore of this device (nor it's 'buddy' that did the same thing).


We've also not swapped between MBR and GPT or vice versa.

VSS is a little more knotty - we use Networker to backup, via powersnap. However, since posting this, I've gone on a rummage and seen hints that point towards VSS as a possible culprit, and I've since found event log messages for the service starting and stopping. I _think_ this may be in response to "SQL Server VSS Writer" service. But I've still not actually caught anything misbehaving.

61 Posts

December 9th, 2009 07:00

Is this Windows 2003 or 2008?

Were there any changes made to the environment recently? Were there any sort of restore operations that might have occurred from the BCV or RDF device since the last failover or reboot of the host?

Was the disk recently converted from MBR to GPT?

Are you using any VSS enabled backup products?

61 Posts

December 9th, 2009 10:00

I asked about VSS due to the issue found in http://support.microsoft.com/kb/939007. You might want to check the version of your ftdisk.sys driver to see if it is susceptible to this issue.

The signature 3FFFFFFF doesn't seem random enough to be a Windows assigned disk signature though, so this might be a dead end.

If you haven't done so already, you might consider opening a service request so that EMC support can take a look at this issue.

5 Practitioner

 • 

274.2K Posts

December 16th, 2009 03:00

Hi,

I've seen that a couple of years ago. One disk with multiple path can be the cause. Because the disk driver might seen a disk on one path before the other one and then OS thought it was a new disk and write a new signature on it.

It happens on a exchange cluster when failover node occurs.

With MS host, check carefully storport, HBA driver, PP versions, they all linked.

The fix provide by Microsoft is one that include the FTdisk driver. So I guess your are in the same issue that I had.

Hope it helps.

Michaël

53 Posts

December 16th, 2009 03:00

Well, we're making progress on this - we're still at the stage of 'we're not sure' but we're going to be doing some patching/service packing on the OS, upgrading powerpath, Qlogic drivers, and applying some FA flags.


1) Select node to upgrade - fail resources over to alternate node
2) Reduce access to disks to a single path (easiest method - block port on switch)
3) Scan for hardware changes in device manager
4) Uninstall PowerPath
5) Reboot
6) Apply Windows 2003 SP2
7) Reboot
8) Install HP proliant Support Pack for (Proliant DL585 G1)
9) Upgrade HBA drivers and firmware and apply MS STORport hotfix 957910
10) Change FA flags to make sure OS2007, SPC2, CMSN, SCSI3 are set symmask -sid set hba_flags on os2007,spc2,scsi3  -wwn -dir -p -enable symmask -sid refresh
11) Reboot
12) Install PowerPath 5.3
13) Reboot
14) Add removed paths from step 2
15) Scan for hardware changes in device manager
16) Repeat for alternate nodes.

Not fixed as of now - our current 'plan B' is to put in a 'test' driver, that will detect that a signature change has occurred, and will crash dump the OS when that happens. Not as drastic as it sounds, as it's a clustered system .

We have caught a few 'signature change' collisions, but ... well, still can't see how we got signatures of FFFFFFFF and 3FFFFFFFF - they're 'all 1' bit patterns, and they seem ... unlikely random generated sigs.

No Events found!

Top