Start a Conversation

Unsolved

This post is more than 5 years old

D

40489

January 16th, 2012 08:00

Windows Server 2008 R2 Lost A Volume...

Hi,

I was testing out some SQL scripts on the DR site when I got an error in SSMS stating that the Windows OS could not access the TempDB files.  I went on Server Manager - Disk Management and rescanned the disks, but the TempDB volume never showed up.  I then looked on Group Manager for the SAN and didn't see that volume offline or anything nor did the SAN have any errors.  On iSCSI Initiator, the volume was connected so I'm wondering how Windows could "lose" a volume just like that?

I finally got it working by disconnecting to the volume on iSCSI Initiator and reconnecting, then rescanned the disks.  The TempDB volume showed up, but SSMS still could not access that volume so I restarted the SQL service.  It worked and everything is fine.

I'm just curious if anyone ran into this problem before?  Does anyone know why this occurred?

Thank you.

294 Posts

January 16th, 2012 09:00

Hi Kenny,

This occurred shortly after my initial post (around 11:20 AM).  The volume access is via IP with a wildcard in the 4th octet.  All of our volumes (primary SAN and DR SAN) are using IPs with a wildcard in the 4th octet.  We have also checked the check box stating "Allow simultaneous connections from initiators with different IQN names"

This is the first time I have encountered this after about a year of purchasing the EqualLogics.

Thank you.

685 Posts

January 16th, 2012 09:00

Hello Dajonx

After reading your post the first thing I would like you to check is the access list for the volume that you were having problem seeing. Is it set up using the IQN, IP or are you using CHAP? If you are using it via IP, are you using a wildcard in the 4th octet? Also can I get a timeline  for when the issue started happening?

Thanks

5 Practitioner

 • 

274.2K Posts

January 16th, 2012 10:00

Using a wildcard in any of the IP octets is a very dangerous thing to do.  That means any initiator can connect to it and overwrite it at will.   Two or more servers connecting to the same volume without a cluster file system is a sure fire way to corrupt a volume.   Restrict volume access to only the server that needs it.

There's a free utility called 'testdisk' that might be able to restore the partition table, then run chkdsk to hopefully repair any filesystem errors.  You should check the iscsi connections on all of your remaining volumes to see if multiple servers are connected.  

Also, just changing the ACL on the array will not drop any existing connection.   You are at an extreme risk right now for further corruption.  

294 Posts

January 16th, 2012 11:00

Even though the SANs are on different subnets (primary and DR)?  For example, the DR site has a subnet of 10.1.1.* and the primary site has a subnet of 10.10.10.* which is the value we put in their respective SAN access control list.

And is this the reason why that volume just disappeared on the Windows File System?

5 Practitioner

 • 

274.2K Posts

January 16th, 2012 12:00

The subnets don't matter.  Anything local or routable to either can access any volume.  I could plug a laptop into either subnet and connect to any/all of your volumes.   Fire up DiskManager and re-format them all.  Each server believes it owns that volume exclusively.   On a reboot/shutdown it will flush out what it believes to be the current state of the volume, which can overwrite what the real server did.  it is without a doubt the number one cause of file system corruption with SANs.  It's known as a 'double mount' condition.  Since there doesn't appear to be a connection issue, the most likely cause is file system corruption due to a double mount.  

294 Posts

January 16th, 2012 13:00

Understood.  Is there a document(s) that explains this and the steps to fix it?

5 Practitioner

 • 

274.2K Posts

January 16th, 2012 15:00

No, there's no document that explains this.  

There's docs for Testdisk to attempt partition table recovery.

On the array I would suggest you use the IQN name of the initiator instead of IP address.  You can copy/paste it from the MS iSCSI utility.   Also depending on the firmware, you can make sure that "Multi Host Access" is disabled on the volume.  

5 Practitioner

 • 

274.2K Posts

January 17th, 2012 11:00

No.  That's still not 100% safe.   Using IQN name is not the same as just adding an '*" in the ACL.  It has a similar effect.  But the process of allowing it is very different.  Each login attempt is screened.

Using IQN also makes it easier if you ever had to change the IP subnet of the iSCSI SAN.  

You're only looking at the servers right now.   You add another one, or someone else does and you're still vulnerable.

the checkbox adds another layer of protection as long as that desired server remains online and connected at all times.

Some one else coming in later may not realize the ramifications of this setup.  And just follow your lead w/o understanding the consequences.  

It's really no different from controlling physical access to a server room, giving them all the passwords and full access to everything at same time.  

It's better to be more restrictive than not.    You look to have lost data once because of this issue.  Do you want to risk that again?

294 Posts

January 17th, 2012 11:00

Ok, thank you.  

Can you please tell me if this is considered to be safe?

Example: ServerA has an IP of 10.1.1.1 and ServerB has an IP of 10.2.1.1.  The SAN has an IP of 10.1.100.1 and the IP Access is 10.1.100.*.  Nothing else has the IP range of 10.1.100.* except the SAN.  Isn't that the same as using IQNs?

I have created a few volumes with IQN Access and "Allow simultaneous connections from initiators with different IQN names" unchecked.  When I look at the connections tab on Group Manager, it is the same as if I used IP Access of 10.1.100.*.  (There are four connections with IP addresses of 10.1.100.101 - 10.1.100.104)

294 Posts

January 17th, 2012 12:00

Thank you, Don.

What is the best way to convert the existing volumes to IQN Access as well as removing the "Allow simultaneous connections from initiators with different IQN names"?  

Just to clarify, the IQN names that I will be inserting in the Access tab are their respective "iSCSI Target" in Group Manager -> Modify Settings -> Advanced Tab.  Is that correct?

5 Practitioner

 • 

274.2K Posts

January 17th, 2012 12:00

Yes,

Go to each volume, and on the access tab add in the IQN name, using the "Add" link on the right hand side.   Once that's done, you can remove the IP address and uncheck the box.   If you have sessions from another server you will have to manually log out of that volume from the server side.  They won't be logged out when you change an ACL.

5 Practitioner

 • 

274.2K Posts

January 17th, 2012 13:00

That's a CHAP error.   Where a volume has a CHAP account as an ACL, but the initiator isn't sending a CHAP username/password combo.  make sure you put the ACL in the correct block on the access tab.  Sounds like the initiator name was in the CHAP field.

294 Posts

January 17th, 2012 13:00

From my test when I had changed the IP Access to IQN as well as uncheck the simultaneous connection box, I received a TON of error messages.

Severity  Date and Time       Member  Message                                                                                                                                                                                                                                                                                                          

--------  ------------------  ------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  

ERROR    1/17/12 4:09:15 PM  SAN1    iSCSI login to target '10.1.222.26:3260, iqn.2001-05.com.equallogic:0-8a0906-bd2802609-5bf0019d05d4e9d8-xxxdata' from initiator '10.1.222.103:51653, iqn.1991-05.com.microsoft:serverA.xxx.xx' failed for the following reason:   Initiator tried to bypass the security phase but we cannot.  

No Events found!

Top