5 Posts

May 16th, 2013 12:00

Thanks Don.  I've had a case open for a week or so and they are reviewing the logs now.  I was just wondering if anyone had similar experiences.

6 Posts

May 31st, 2013 08:00

I'm not 100% sure it is the same issue we seem to have but it sounds like it might be related. We get the same events logged in the Eql event log and the vmk logs seem more or less the same - sorry there is way too much text to check it. 

We have:-

  • 3 Eql boxes in a group, all running 6.0.2 (R305616) (H2) firmware.
  • A 6 host esxi HA cluster running 5.0U1 build 623860.
  • Hosts are all PowerEdge R815 and have six nics; two Intel and four (on the motherboard) Broadcom.
  • The iSCSI SW adapter is set to use one of each nics so hopefully eliminate issues with VMware nic drivers and firmware.
  • Both vmnics are on a single switch and pinned to a single vmk port (vmk2 and vmk3).
  • We also have vmk1 port with a lower ip bound to both physical nics.
  • Delayed ACK has not been disabled but testing has shown that this is not the cause. Our bandwidth usage is nowhere near high enough for this to be a concern (I like over engineering :emotion-2:).
  • The Eql boxes and hosts are both connected to a 4 switch Cisco 3750G stack. The stack runs two vlans; one for VM management and monitoring, the other for iSCSI and an nfs host. Client traffic goes through another switch.
  • Flow control is enabled end-to-end.
  • Jumbo frames are in use.

We also get periodic reports of path redundancy being lost by the ESXi hosts. We can actually reproduce these by rebooting an individual ESXi host. Once the host has rebooted, it will lose path redundancy on at least one volume as it runs the load balancing. We used this to eliminate delayed ack as a possible cause. It is a really odd issue as it seems to have no effect on performance. We also tried reverting MEM back to  1.1.0 from 1.1.2 without sucess.  Despite that, I've still got a number of things I need to eliminate before I'm ready open a support call. 

The thing that particularly bothers me is that the disconnect seems to happen before the connect., which is (in my view) logically the wrong way round for a load balancing process. 

Did you get anywhere with your support call?

129 Posts

June 3rd, 2013 14:00

Hi,

I am seeing what seems the exact same issue. I have very similar setup with 5 node cluster (5.0U2) and 6 member EQL array (6.02), Have implemented all recommended ESX/iscsi settings from this forum. Fairly frequently I see path redundancy lost messages from Vcenter but when I check all paths are active/OK. I can trigger these alerts by creating new EQL volumes and presenting to the ESX hosts so looks like rebalancing and settles immediately. I have not opened a case as yet as site is under development. Would be interested to know if anyone gets to the bottom of this as it is annoying if nothing else.

5 Posts

June 3rd, 2013 14:00

Hi Azriphale,

This ticket is still open.  I continue to see the path redundancy lost messages across my hosts.  Initially, I didn't have the delayed ack and iscsi timeout settings set correctly on each node, but I've corrected that issue and I've also disabled LRO on each host and the issue persists.  Again, I'm not seeing any type of performance issue and I don't see any type of I/O loss during these events which last usually less than 20 seconds.  I suspect they are MEM or NLB events but I don't see that clearly stated in the EQL Group logs or vmkernel log.  I'll update when we get it sorted out.

Jake

129 Posts

June 4th, 2013 02:00

Do you know if there is a way to reduce the vcenter spam in this scenario, especially when the "problem" is just a legitimate transitioning of resources.

5 Posts

June 4th, 2013 05:00

Does the EQL group note the event in its log as a rebalance?  The first entry I see in the group event log is usually something like this:

iSCSI session to target '192.168.90.102:3260, iqn.2001-05.com.equallogic:4-52aed6-dc615d168-d000021a7ff4f6cb-eql-esx-vol-c' from initiator '192.168.90.42:56606, iqn.1998-01.com.vmware:VHOST04-1123e289' was closed.   Logout request was received from the initiator.

I thought I saw someone post an entry online (can't find it now) that showed the array actually requesting the rebalance which was then followed by the logouts/logins.

4 Operator

 • 

2.4K Posts

June 4th, 2013 09:00

I have to manage around a dozen of ESXi Cluster + (Single) EQL installation and as soon as we enable the vCenter Alarms we got the email notification about "lost path blabla"  for random datastores on random hosts. The host reconnectet without a problem in the same second. We see this  per once day... once per week. For those customer who are to nervous we disable the notification :(

Only in one case we have seen that EQL ALB went crazy. The 4th. array (Same RAID and capacity) was added to a pool and we see the LUN migration to the new member as expected. 2 weeks later we deploy a volume and than the hell opens and the group manager starts to play "ping pong" and move volume between the members.

With the "soft limit" of spreading a volume over up to 3 members and when a volume "leave" a member all 10 ESXi hosts produce email notification about losing the connection. Since we have only 1 and 2TB volumes the process for moving around takes some time. We see IOPS/Traffic in SANHQ as a flat line in this time for a Volume. As soon as the migration was finished and a short periode a nother volume starts moving.

This goes for about 6-8 weeks and Dell support means that this is an expecting behaviour :/ - The moving stops 3 weeks later.

Regards,

Joerg

No Events found!

Top