dajonx
2 Iron

Windows Server 2008 R2 Lost A Volume...

Hi,

I was testing out some SQL scripts on the DR site when I got an error in SSMS stating that the Windows OS could not access the TempDB files.  I went on Server Manager - Disk Management and rescanned the disks, but the TempDB volume never showed up.  I then looked on Group Manager for the SAN and didn't see that volume offline or anything nor did the SAN have any errors.  On iSCSI Initiator, the volume was connected so I'm wondering how Windows could "lose" a volume just like that?

I finally got it working by disconnecting to the volume on iSCSI Initiator and reconnecting, then rescanned the disks.  The TempDB volume showed up, but SSMS still could not access that volume so I restarted the SQL service.  It worked and everything is fine.

I'm just curious if anyone ran into this problem before?  Does anyone know why this occurred?

Thank you.

0 Kudos
13 Replies
Moderator
Moderator

Re: Windows Server 2008 R2 Lost A Volume...

Hello Dajonx

After reading your post the first thing I would like you to check is the access list for the volume that you were having problem seeing. Is it set up using the IQN, IP or are you using CHAP? If you are using it via IP, are you using a wildcard in the 4th octet? Also can I get a timeline  for when the issue started happening?

Thanks

Kenny K.

Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos
dajonx
2 Iron

Re: Windows Server 2008 R2 Lost A Volume...

Hi Kenny,

This occurred shortly after my initial post (around 11:20 AM).  The volume access is via IP with a wildcard in the 4th octet.  All of our volumes (primary SAN and DR SAN) are using IPs with a wildcard in the 4th octet.  We have also checked the check box stating "Allow simultaneous connections from initiators with different IQN names"

This is the first time I have encountered this after about a year of purchasing the EqualLogics.

Thank you.

0 Kudos

Re: Windows Server 2008 R2 Lost A Volume...

Using a wildcard in any of the IP octets is a very dangerous thing to do.  That means any initiator can connect to it and overwrite it at will.   Two or more servers connecting to the same volume without a cluster file system is a sure fire way to corrupt a volume.   Restrict volume access to only the server that needs it.

There's a free utility called 'testdisk' that might be able to restore the partition table, then run chkdsk to hopefully repair any filesystem errors.  You should check the iscsi connections on all of your remaining volumes to see if multiple servers are connected.  

Also, just changing the ACL on the array will not drop any existing connection.   You are at an extreme risk right now for further corruption.  

Social Media and Community Professional
#IWork4Dell
Get Support on Twitter - @dellcarespro

dajonx
2 Iron

Re: Windows Server 2008 R2 Lost A Volume...

Even though the SANs are on different subnets (primary and DR)?  For example, the DR site has a subnet of 10.1.1.* and the primary site has a subnet of 10.10.10.* which is the value we put in their respective SAN access control list.

And is this the reason why that volume just disappeared on the Windows File System?

0 Kudos

Re: Windows Server 2008 R2 Lost A Volume...

The subnets don't matter.  Anything local or routable to either can access any volume.  I could plug a laptop into either subnet and connect to any/all of your volumes.   Fire up DiskManager and re-format them all.  Each server believes it owns that volume exclusively.   On a reboot/shutdown it will flush out what it believes to be the current state of the volume, which can overwrite what the real server did.  it is without a doubt the number one cause of file system corruption with SANs.  It's known as a 'double mount' condition.  Since there doesn't appear to be a connection issue, the most likely cause is file system corruption due to a double mount.  

Social Media and Community Professional
#IWork4Dell
Get Support on Twitter - @dellcarespro

dajonx
2 Iron

Re: Windows Server 2008 R2 Lost A Volume...

Understood.  Is there a document(s) that explains this and the steps to fix it?

0 Kudos

Re: Windows Server 2008 R2 Lost A Volume...

No, there's no document that explains this.  

There's docs for Testdisk to attempt partition table recovery.

On the array I would suggest you use the IQN name of the initiator instead of IP address.  You can copy/paste it from the MS iSCSI utility.   Also depending on the firmware, you can make sure that "Multi Host Access" is disabled on the volume.  

Social Media and Community Professional
#IWork4Dell
Get Support on Twitter - @dellcarespro

0 Kudos
dajonx
2 Iron

Re: Windows Server 2008 R2 Lost A Volume...

Ok, thank you.  

Can you please tell me if this is considered to be safe?

Example: ServerA has an IP of 10.1.1.1 and ServerB has an IP of 10.2.1.1.  The SAN has an IP of 10.1.100.1 and the IP Access is 10.1.100.*.  Nothing else has the IP range of 10.1.100.* except the SAN.  Isn't that the same as using IQNs?

I have created a few volumes with IQN Access and "Allow simultaneous connections from initiators with different IQN names" unchecked.  When I look at the connections tab on Group Manager, it is the same as if I used IP Access of 10.1.100.*.  (There are four connections with IP addresses of 10.1.100.101 - 10.1.100.104)

0 Kudos

Re: Windows Server 2008 R2 Lost A Volume...

No.  That's still not 100% safe.   Using IQN name is not the same as just adding an '*" in the ACL.  It has a similar effect.  But the process of allowing it is very different.  Each login attempt is screened.

Using IQN also makes it easier if you ever had to change the IP subnet of the iSCSI SAN.  

You're only looking at the servers right now.   You add another one, or someone else does and you're still vulnerable.

the checkbox adds another layer of protection as long as that desired server remains online and connected at all times.

Some one else coming in later may not realize the ramifications of this setup.  And just follow your lead w/o understanding the consequences.  

It's really no different from controlling physical access to a server room, giving them all the passwords and full access to everything at same time.  

It's better to be more restrictive than not.    You look to have lost data once because of this issue.  Do you want to risk that again?

Social Media and Community Professional
#IWork4Dell
Get Support on Twitter - @dellcarespro