Start a Conversation

Unsolved

This post is more than 5 years old

19238

September 17th, 2014 13:00

RHEL 6.4 + Hit 1.3 + FW 6.0.11: iscsi connections sometimes not restored

Hi, I've got several M620 connected to a PS running 6.0.11 fw. The OS is RHEL 6.4 x64 and Eql HIT 1.3 is installed. Each server uses two BCM nics 10 gig for iSCSI, configured with jumbo frames.

The servers can mount the volumes defined in my PS and the multipath is running fine (ehcmcli shows two connection per volume). However, when testing iSCSI failover "killing" one iSCSI switch at time, I have noticed that when the switch backs online sometimes (and randomly!) the second iSCSI connection is not restored.

The only way to have all of my iSCSI connection up and running is to restart the ehcmd daemon on the host.

The event log on my PS is clear, no errors reported. Any idea?

5 Practitioner

 • 

274.2K Posts

September 17th, 2014 14:00

What model array do you have?   Possibly a 6110/4110?

5 Practitioner

 • 

274.2K Posts

September 17th, 2014 14:00

I would suggest opening a support case so the logs from array and host can be reviewed.

How long do you wait before restarting the service?   Every two minutes the service will check the current connections.   I would expect the open-iscsi initiator to see that a connection was lost and needs to be restored.

Regards,

29 Posts

September 17th, 2014 14:00

Waiting about 30 minutes didn't help :( , the only way was a restart of ehcmd.

29 Posts

September 17th, 2014 14:00

PS-M4110XS

5 Practitioner

 • 

274.2K Posts

September 17th, 2014 16:00

Single interface arrays have different rules with HIT but I think a review of the logs is needed.  With single interface arrays it will create / remove as needed based on load.   Since there's just the one interaces with vertical failover there's no requirement for two connections.

29 Posts

September 17th, 2014 23:00

Donald, first of all many thanks for your answer.

I'm going to get the logs of my hosts and then I'll post the relevant info.

Thanks again.

5 Practitioner

 • 

274.2K Posts

September 18th, 2014 09:00

You are very welcome.  I hope you do submit a support request for this issue.  If there is a problem, the sooner the dev folks get it the better.  

29 Posts

September 19th, 2014 00:00

Hi, some updates and more detailed info: we have opened a SR (# 900766879) and we have sent the logs requested, but so far we were only told to enable STP on our switches (...) and that maybe the culprit is some kind of misconfiguration in the hosts. Now the SR should escalate, but maybe you could get more info than me about this SR :D .

Reproducing the issue with a colleague, we noticed that:

1) We restart the "A" switch. When the switch is up and running again, the initiator reestablishes two connections.

2) A few minutes later then we restart the "B" switch too: again, the initiator reestablishes two connections as soon as the second switch backs online.

3) Now (and that is the issue) if we restart the "B" switch" again, when the switch backs online the host reestablishes immediately TWO iSCSI connections out of two, but a few seconds later the initiators disconnects the second connection and uses only one path. I'd like to point out that seems to be a "clean" logout of the initiator, we see the logouts on EQL event log. At this point, only restarting ehcmd daemon on the hosts restores TWO iSCSI connections.

I understand that is not a "real world" scenario (a triple switch failure in a few minutes, I've never seen before such a thing), but, due to stiff business requirements, I have to configure a 99.99999999999...% available system. Actually I don't know if this could be considered an issue or this is only the normal behavior of EQL HIT using a one-port SAN like the M4110XS.

I think my hosts are correctly configured, performances are really good and the storage survives a "normal" failover scenario with no problems; I followed EQL's best practices configuring  switches, hosts and storage, as I usually do.

5 Practitioner

 • 

274.2K Posts

September 19th, 2014 05:00

Hello,

This  sounds more and more ike expected behavior with single port array.  One way to verify this is, remove the HIT kit from the equation, and just use the standard Linux MPIO instead and repeat the test.

If the connection is logged out with "logout request from initiator" or similar message, that is HIT/LE logging out the connection on purpose.

With vertical failover support, even with one connection to the array, one a switch failure the existing connection is maintained by using the physical port of the other controller, that's connected to the surviving switch.

Re: 99.9999   I'm always curious to know how companies are configuring their environment to even 7x 9's.  %99.9999999 is 3.15 seconds of unavailability per year total.  In the case of EQL, FW upgrade or CM failover will exceed that number.   What do you do on your hosts to achieve this?     At another company I moved the majority of servers from Windows to Linux.  Ended up with greater uptime but I could never get even seven nines without clustering everything and completely mirroring all storage.  Just rebooting for updates defeated it.  

Even the MPIO in Linux/Windows takes longer than that to determine path has failed and resume on the remaining paths.

Regards,

No Events found!

Top