Unsolved
This post is more than 5 years old
24 Posts
0
6945
May 31st, 2011 20:00
Adding additional HBA/WWN resets host failover mode to 1
We are in the process of migrating our systems from a CX700 to a VNX 5700. The VNX is on a completely new set of SAN switches.
Our first group is about 15 HP Blades running VMWare 4.1U1. We set up the blade chassis configuration so that each blade had one HBA attached to the switch with the CX700 and one HBA on the fabric with the VNX attached. We then used the storage vmotion to move the virtual machines (about 130 VM's, 10TB of data).
When we first connected the blades to the VNX, I set the host initiators in the VNX to ALUA/failover mode 4, and set the VMWare end to "round robin" (we're using the native VMWare multipathing). Everything moved OK, and we have been running with all the VMWare data on the VNX for a couple of weeks. .
Today, we wanted to go ahead and switch the HBA connections that had been on the CX700 to the VNX environment to provide redundancy. We confirmed that there was no storage traffic going to the CX700 (the LUN's had been removed from the storage group). We unplugged the HBA from the CX700, no problems showed up. We then attached the fiber cables to the new switch with the VNX.
At that point, all the VM's stopped responding on the network. All paths from the blades to the VNX showed "dead". Thinking it was something with the moved connections, we moved the cables back to the CX700. Still all paths are dead.
We then looked in Unisphere, under "connectivity status". Each blade host showed the additional WWN's from the second HBA, but the failover mode on the host record had been reset to "failover mode 1". Since VMWare had been expecting mode 4, I figure that's why it lost connectivity to the paths. Resetting the host in the VNX back to failover mode 4, then re-scanning the storage in VMWare fixed the problem for that host.
Why would adding additional WWN's to the host entry cause the failover mode to reset to "1"? Is this a bug in the VNX software, or is there a reason to reset it when adding the extra WWN's?
Any suggestions would be appreciated.
Mike O'Donnell


pharford57
115 Posts
0
June 1st, 2011 01:00
Hi Mike
I have seen that happen on a Clariion when i set the failover mode to 4 and went back and it had chaged to 1. Have your run the relvant commands on your vmware hosts to change to ALUA also ?
here is a link to a guy who had a similar issue and all the commands etc he had to run to change vmware to use ALUA, or you could use powerpath to look after all the path management for you
http://www.boche.net/blog/index.php/2010/02/04/configure-vmware-esxi-round-robin-on-emc-storage/
anyway have a read of the above it may help you out
paul
msodonnell
24 Posts
0
June 1st, 2011 06:00
I had the ALUA working fine on both the VNX and VMWare side. I had actaully seen the site you mentioned when I was doing the research to configure ALUA.
I just can't figure out why adding an additional WWN to the host record on the VNX would reset it's mode back to 1. I would think once it's set for the host, it should stay, no matter what HBA/WWN changes you make.
I'm just trying to find out if it's a bug (or "undocumented feature"), or if there's an intentional reason to reset it back.
kelleg
6 Operator
•
4.5K Posts
0
June 3rd, 2011 14:00
This might be related to what you're experiencing:
glen
The following is a Primus(R) eServer solution:
ID: emc262738
Domain: EMC1
Solution Class: 3.X Compatibility
Goal Why is the failover mode on the array changing for 4 (ALUA) to 1 after an Storage Processor reboot or an NDU?
Fact Product: CLARiiON CX4 Series
Fact EMC Firmware: FLARE Release 30
Fact Product: VMware ESX Server 4.0
Fact Product: VMware ESX Server 4.1
[NOT] Fact
Symptom After a storage processor reboot (either because of a non-disruptive upgrade [NDU] or other reboot event), the failover mode for the ESX 4.x hosts changed from 4 (ALUA) to 1 on all host initiators.
Cause On this particular array, for each Storage Group a Host LUN Zero was not configured. This allowed the array to present to the host a "LUNZ." All host initiators had been configured to failover mode 4 (ALUA). When the storage processor rebooted due to a non-disruptive upgrade (NDU), when the connection was reestablished, the ESX host saw the LUNZ as an active/passive device and sent a command to the array to set the failover mode to 1. This changed all the failover mode settings for all the LUNs in the Storage Group and since the Failover Policy on the host was set to FIXED, when one SP was rebooting, it lost access to the LUNs.
Fix VMware will fix this issue in an upcoming patch for ESX 4.0 and 4.1. ESX 5.x does not have this issue.
To work around this issue, you can bind a small LUN, add to the Storage Group and configure the LUN as Host LUN 0 (zero). You will need to reboot each host after adding the HLU 0. For each Storage Group you will need a HLU 0. See solution emc57314 for information on changing the HLU.
These are the directions from VMware for the workaround:
msodonnell
24 Posts
0
June 5th, 2011 06:00
Thanks for the info.
It does sound like a similar effect, but I'm not sure it was the same cause. A couple of days before the issue with the reset, we had both SP's replaced. During that time, as each SP went down and then back up, VMWare did exactly what it was supposed to. The paths to the down SP showed "dead" but the other paths worked fine.
The issue we had later was when we basically added additional host initiators to the existing host record on the VNX. Still, it sounds a lot like the effect we had, so I'll go ahead and set up a small lun 0 in the group.
Actually, on the CX700, we had a 5G LUN "0" we had set up years ago to address some of the other "LUNZ" issues we had with VMWare. When we set up the VNX, apparently that LUN wasn't brought over. Fortunately, the other luns were created on the VNX with non-zero host lun values, so it will be easy to create a small lun 0.
Thanks again.
Mike O.