Start a Conversation

Unsolved

This post is more than 5 years old

76140

July 26th, 2013 06:00

ISCSI Connection Failure Causing ESXi 5.1 VMs to shutdown

Hi all,

 

Was hoping someone may have seen the problem we have encountered over the last 2 weeks.

Our VMs are performing unexpected shutdowns. Our Exchange server has rebooted twice today :(

 

The timing of the shutdown matches with information we are seeing the Equallogic logs:

Severity  Date and Time          Member    ID                        Message                                                                                                                                                                                                                                                                                                   
--------  ---------------------  --------  ------------------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 INFO     26/07/2013 9:35:19 AM  APCEQLM1  7.2.15 | 7.2.24 | 7.2.29  iSCSI session to target '192.168.43.12:3260, iqn.2001-05.com.equallogic:4-52aed6-c7ca4349e-87b00217be35059e-lon-prod-vm-1' from initiator '192.168.43.31:58797, iqn.1998-01.com.vmware:apcesxhost1-24c73c18' was closed. | iSCSI initiator connection failure. | No response on connection for 6 seconds. 

 

We have 1 PS4100X, running the latest EQL firmware 6.0.5. We have 2 stacked Powerconnect 6224 switches. Our array and switches were configured by DELL support.

 

We have followed DELL recommendations on connecting vSphere to ISCSI storage. As far as I can see our vSwitches are configured correctly.

 

Anyone has any ideas why we're seeing this behavior ? I've logged this with DELL support but thought I'd ask on the forums in case anyone has experience this issues.

 

Thanks in advance!

 

Lee

7 Technologist

 • 

729 Posts

July 26th, 2013 07:00

First ensure you can ping/traceroute from the ESX host(s) to every eth interface on the array:

ping -I sourceIP destintionIP (source is one of the interfaces on the ESX host, destination is each eth interface on the array(s).  You need to test all possible combinations!

The do the same ping test from each array member to each iSCSI interface on your ESX hosts (telnet to a eth interface on the member, do this for all members, and all possible combinations).  The traceroute command on the array is a bit different, support traceroute –s sourceIP destinationIP (soruce would be each eth interface, destination would be each iSCSI on all ESX hosts connected to the array group)

Ensure you are using the GroupIP for the iSCSI discovery address on your storage adapter

Don has written several post on configuration, and some of the details/how to are covered in this forum (See Don’s comments)

en.community.dell.com/.../20008239.aspx

Configure iSCSI with MPIO using this PDF:  

www.dellstorage.com/.../DownloadAsset.aspx

If you are convinced that you have done all these, I would open a support case.

-joe

7 Technologist

 • 

729 Posts

July 26th, 2013 07:00

Also, since you stated that this started two weeks ago, you might find that your network topology changed (i.e., array failover to the secondary controller is a typical example).  This might indicate a configuration issue with the way the ESX hosts and/or array secondary controller (now the active) is cabled/configured to the switches, or the inter-switch connection could be an issue too.  The ping test should identify where the problem is.

-joe

24 Posts

July 28th, 2013 18:00

Hi Joe,

Thanks very much for the suggestions, I'll try the troubleshooting suggestions you mentioned and hopefully it will point me in the right direction.

Thanks

Lee

24 Posts

July 28th, 2013 21:00

Hi Joe,

I've ran ping tests across all combinations from both hosts to the array and all successful. The traceroute was also successful although I received a warning about  multiple interfaces found but I believe this is because we have 2 vmnics assigned to our management network. One is on standby.

From the array the ping tests were successful as well across all combinations with no packet loss. The traceroute from the array failed though.

Our interfaces are on a network 192.168.43.x  and the gateway on the array points to a gateway of 192.168.43.1 which doesn't exist. The DELL engineer that set this up explained the gateway was not relevant so we accepted this. Not sure if that is the cause of the problem. The ping tests work fine.

Stupidly I forgot to mention we had some power issues recently due to a faulty UPS and the EQL rebooted. I don't know if the array is now using a different controller. However, the array is cabled according to DELL best practices for redundancy so not sure if it makes any difference.

The only other thing I noticed was that the MTU for the iSCSI vSwitch on my second host was still at 1500. I changed that to 9000. Not sure if that would cause this problem, but either way it needed to be changed.

I've logged this with DELL so hopefully they can shed some light on the issue.

Thanks.

Lee

5 Practitioner

 • 

274.2K Posts

July 29th, 2013 06:00

What's the FW on the 6224 switches?  

A 6 sec timeout means that the server failed to respond to keepalive packets.  Very similar to a ping test, it's part of the iSCSI spec.  Each side periodically pings the other.  When that fails the iSCSI session is dropped.  When you reboot a server you will see thiese errors as well.

24 Posts

July 29th, 2013 21:00

Hi Don,

Firmware on the switches is 3.3.4.1

When you say the server failed to respond to keepalive packets, do you mean the physical server or the VM ? Just making sure I understand.

We didn't reboot the physical server or the VM, except of course when it shutdown unexpectedly.

Thanks

Lee

24 Posts

July 29th, 2013 22:00

Hi Don,

Thanks for confirming re: keepalive. We use VMDK's so ESX it is. I'll aim to get the firmware updated on the switch asap.

Regarding the ESX servers and EQL array, neither use an NTP time source. Again, I will look to rectify this. Is there a NTP service that DELL recommend ?

Thanks for the advice.

Lee

5 Practitioner

 • 

274.2K Posts

July 29th, 2013 22:00

Re: Keepalive.  Who ever is in charge of the iSCSI session handles this.  If it's a VM using VMDKs then it's ESX, if the VM is using it's native iSCSI adapter, then the VM is responsible for those sessions.

Re: Switch.  That firmware needs to be updated.  

Are the ESX servers and EQL array connecting to the same NTP time source?   If not they should be.  

5 Practitioner

 • 

274.2K Posts

July 30th, 2013 09:00

RE: NTP. Not really,  There are some public ones out there.  pool.ntp.org for example.   Setting up NTP is important, the clocks on the ESX servers commonly drift, especially under heavy CPU load.

You AD servers, exchange, SQL should all be on the same NTP servers.

24 Posts

July 30th, 2013 22:00

Thanks Don,

FYI, DELL support came back to me after reviewing the logs of our EQL and switches. They said the Equallogic looked fine as it is presenting itself to the hosts. However they think there is a loop in our network as the switch logs are flooded with these messages

<189> JUL 26 08:10:04 192.168.42.48-1 TRAPMGR[151077744]: traputil.c(611) 647 %% 2/0/1 is transitioned from the Forwarding state to the Blocking state in instance 0

<189> JUL 26 08:10:04 192.168.42.48-1 TRAPMGR[151077744]: traputil.c(611) 648 %% 2/0/2 is transitioned from the Forwarding state to the Blocking state in instance 0

We don't have a network engineer as we're a small IT team but I'm assuming if there is a loop, then the switches could be over used and rejecting the iSCSI traffic.

I don't know much about the switch configuration other than what the engineer sent through after the configuration was completed. I'm going to re-check the cabling set up and also read through the iSCSI optimization whitepaper for DELLPowerconnect switches to make sure everything looks ok.

Thanks.

Lee

5 Practitioner

 • 

274.2K Posts

July 31st, 2013 12:00

A loop is alot worse than that.  A "loop" can totally mess up a network.  The switch uses the MAC address table to route packets to the correct port.  If you "loop" the switch, MAC addresses will incorrectly show up at that looped port.  So packets will go there instead.  

That's why when you have multiple cables interconnecting switches, you have to trunk them together.  Otherwise you could loop those switches.  

24 Posts

July 31st, 2013 17:00

Hi,

The EQL engineer spoke to the networks team and they advised the items in the log out are out of date (somehow missed the timestamp) so not to be concerned about.

He says after going over the logs the network config actually looks good and that we should check on the host side. Ensure mem is configured and in use etc. Unfortunately we don't have an enterprise license so have to use native round robin.

I'm following up with VMware now to see if they can shed any light on this.

Thanks for the advice on this issue.

Lee

5 Practitioner

 • 

274.2K Posts

July 31st, 2013 20:00

In the links to other forum posts, that Joe included is a script that will optimze the VMware Round Robin IOs per path setting.  The default is 1000, so until 1000 IOs are sent, it won't switch to another path.  This basically has the effect of the same as FIXED pathing.  We suggest changing it to 3 instead.  So more IOs will actually be in flight at one time.

Regards,

24 Posts

August 1st, 2013 22:00

Great, thanks for the info, I'll have a read and change the setting. Great tip!

24 Posts

August 4th, 2013 19:00

Just thought I'd post an update on this thread for anyone else that has experience this issue. VMware have advised the issue we are seeing is related to a known issue with the tg3 driver.

See this KB for more info: kb.vmware.com/.../search.dolanguage=en_US&cmd=displayKC&externalId=2035701

Thanks

Lee

No Events found!

Top