Unsolved
This post is more than 5 years old
4 Posts
0
2592
VMware multipathing, VNXe3100 -- whenever one path connects, another goes down
Has anyone else seen anything like this? I am working with an EMC support engineer and also with VMware support, they seem stumped.
2 ESXi 5.0 hosts in an HA cluster sharing one VNXe3100. Each host has two physical NICs for iSCSI. iSCSI binding appears to be correct, both the EMC tech and the VMware tech have been over and over it.
I have one iSCSI server on the VNXe, on SP A. It uses eth 2 and eth3, which have IP addresses on two different subnets, call them A and B.
The eth 2 NICs on both storage processors are in subnet A and are patched to switch A. Both eth 3 NICs are in subnet B and are patched to switch B
On each ESXi host, iSCSI 1 (vmk1) is in subnet A and patched to switch A; iSCSI 2 is in subnet B and patched to switch B.
The problem is I can only get two out of four vmk NICs to connect at any one time. Right now, one ESXi host is "firing on both cylinders", iSCSI 1 and 2 both working, but the other ESXi host doesn't see anypath to storage. But at other times, after a reboot, each host might see one iSCSI NIC connect and the other one is dead.
On a particular host, when one iSCSI NIC is up with the other down, I get get them to "swap statuses" -- if iSCSI 1 is up, I can get iSCSI 2 to connect but then iSCSI 1 will be dead. All four iSCSI NICs on the two hosts have been working at various points (just not more than two of them at once), so I think that proves there's no hardware problem i.e. a bad NIC, patch cable or switch port.
Whichever iSCSI NIC is down, I can't vmkping the storage processor IP in that subnet. When that same path shows as Up in vSphere client, then vmkping works.
I have tried it with Jumbo Frames enabled or disabled (point to point, MTU the same on all devices including the vSwitch and vmk NIC), no difference. When Jumbo was enabled, vmkping worked with packet size set to 8000, on whichever path was Up. All NICs and switch ports are set to 1000/Full, no autonegotiate.
I tried replacing one of the switches with another one of a different brand; no change. All the switches I've tried are low-end managed switches, 2x 3Com 2928-SFP-Plus, and a Dell Powerconnect 2716. The minimal logging available on the switches doesn't show any problems (dropped packets etc).
I am hoping someone out there has seen this problem and found a solution. Or just some advice on what to try next.
Thanks!
Mark
Johnny_Bravo
63 Posts
0
November 8th, 2011 03:00
Hey Bulgie,
You still having problems with this?
jjgrinwis
48 Posts
0
November 9th, 2011 02:00
I've seen similair strange things with the iscsi service on the vnxe3100.
We have the following test setup:
So one host connected to both subnets and connected to both the iscsi servers.
With default MTU this is working fine, we have two paths per lun and we're seeing two luns, one for SPA and one for SPB.
When we change the MTU on switches, vsphere nic/vswitch and SP's, I'm loosing all connectivity to my iSCSI servers, datastores gone.
We're also having some nfs servers running and they are still connected, no problem at all.
After setting everything backup to 1500 and a reboot of the SP's everything back to normal.
As a test we removed one interface from the trunk, created a iscsi service connected to this single interface, added that target to software iscsi adapter and new lun seen on the vsphere host.
When we set that single interface to 9K again, we loose all iscsi services on SPA, the one on SPB still working fine. So even the iSCSI services with an unchanged MTU running on it's own set of interfaces is loosing all connectivity. NFS over SPA and SPB still working fine, using same interfaces as iSCSI service.
We need another reboot of the SP-A to get services back on it's feet. After the reload, iSCSI service is back online and iSCSI datastores available again.
We're running the latest version of software on the vnxe.
(almost forgot why I love my FC network so much, now I know again)
bulgie_f8cfe5
4 Posts
0
November 10th, 2011 14:00
I'm pretty sure we had the same symptoms whether jumbo frames were enabled or not. I did switch the MTU back and forth a couple times during troubleshooting but it appeared to neither cause nor cure the problem.
I got an improvement by giving both eth2 and eth3 IPs in the same subnet instead of two subnets. Then my ESXi hosts were reliably seeing two paths to each LUN. But for some reason, though the topology looked like we should have complete switch redundancy, we didn't. If I turned off one switch to simulate a failure, then both hosts lost all their storage. BTW we had two iSCSI ports on each host, correctly bound to one physical NIC each.
Next I decided to add more NICs. Added two more physical NICs per host and made two more iSCSI ports, again each bound to one pNIC, for a total of four per host, all on the same subnet. THIS WORKED though I can't tell you why. We have switch redundancy now: Both hosts have multiple paths to each LUN even with one switch turned off. With regard to the comments from jjgrinwis -- we have jumbo frames on (MTU = 9000).
It should have worked with just two iSCSI NICs per host, and it should have worked when we had two subnets instead of one, but I don't care now. It's working, and I consider the case closed.
Thanks all.