Start a Conversation

Unsolved

This post is more than 5 years old

93718

December 9th, 2012 18:00

Randomly losing access to drives under Windows

We've been using Equallogic for years, but only this year have we started having major issues.

We have multiple groups and multiple arrays/models.  There are standalone Windows and Linux file servers that have iSCSI LUNs attached, ESX servers and then standalone and clustered SQL servers.  It's the database servers that have issues.  No issues with others that we've encountered.

Occasionally, a random database server will start to experience slow I/O 833 errors accompanied by iScsiPrt 9, 39 and 129 errors in the system log.  Sometimes access to the volume still works for a period of time, and other times it just hangs the server to the point where it has to be power cycled.  With clustered servers, graceful failover never works, and the cluster service has to be killed in order to break ownership of the LUN.  

This issue only exists with a single LUN out of many that are presented and attached, and only to a single server at any time.

I've recently found that if I toggle the volume offline and online real fast the errors stop instantly and access and functionality is restored until the next time it happens again.

This happens to no particular version or patch/hotfix level of Windows 2003 or 2008 R2.  There are different models of servers and NICs that experience this, as well as different switches.  There are PS6510X, 6110XV, 6010XVS, and 6010XV arrays in play.  Firmware versions for each group include 5.2.1 ,5.2.2, and 5.2.5.

We're about ready to throw Equallogic out of the environment if we can't find a fix.  We've had an ongoing case open for nearly 7 months now and have had Dell onsite to troubleshoot.  Has anyone else experienced this?  Were you able to resolve?

7 Technologist

 • 

729 Posts

December 10th, 2012 05:00

Strongly suggest that you update to v5.2.6 (or go to the v6 FW which is 6.0.2).

Also check the iSCSI disk timeouts.  The file is located on the FW download page titled “iSCSI Initiator And Operating System Considerations”.  Since you are running multiple versions of the firmware, it’s best to review each one and ensure nothing has changed from 5.2.1 to 5.2.5.  

Verify all of your host adaptors are configured properly, and running the latest drivers and firmware.

If you list the interface make and model that you are having issues with, post them here, I may have additional settings to try.

Verify that your switch is configured for iSCSI on the host and array interfaces.  Ensure Flow control is configured, and if supported by your switch, enable Jumbo frames.  A quick check for this is here(located your switch) : en.community.dell.com/.../3615.rapid-equallogic-configuration-portal-by-sis.aspx

Ensure you have the latest os/firmware on your switches.  Ensure they are approved: en.community.dell.com/.../2632.storage-infrastructure-and-solutions-team-publications.aspx.  In the list, click “EqualLogic Compatibility Matrix”.  Also on this page is specific information on configuring your switches.

Beyond that, if you still can’t keep the hosts connected, I would suggest opening a support case, so that we can look at the array diags to see if further information can be found there.

-joe

27 Posts

December 10th, 2012 05:00

Arista 7504, 7124, Cisco Nexus 5596, PowerConnect 6200 switches have all been used (different environments).

Jumbo frames are used.  STP is configured correctly (MSTP).  No topology changes.  Have tried flow control on, off, and TX only (EQL arrays do not send pause frames, but will honor them if received).  DCB is disabled.  

Our case is with an escalation manager who has been great, but he's stumped too.  This isn't your typical misconfiguration, firmware, driver issue.  There is something inherently wrong.  If there was something systemic, I would expect ALL of the volumes on a given server to have issues, vs. a single volume out of 4-12 of them presented and attached.  

127 Posts

December 10th, 2012 05:00

Hmmm....if he had a case open for seven months, i don´t think some diags will help, neither i think it´s the firmware. in addition, he says that forcing the vol offline and online again solves it for some time. So....regarding all this.....first thing comes to my mind when it comes to randomly losing access is:

a) misconfigured spanning tree

b) jabber (caused by misconfigured or "incompatible" jumbo frames)

c) misconfigured flow control

So....

I´d like to know which kind of switches are in between the EQL and the Machines. In Addition: Are jumbo frames used? Is spanning tree configured? If yes, what mode, portfast, rstp? Do you see spanning tree topology changes which cause a global recalcluation (not the normal topology changes, you will see thousands of these, THIS IS OK!!! I´m talking of these which will cause a global recalclualtion)? Is flow control configured on the switch? Maybe DCB is turned on on your EQL?

Best regards,

Joerg

7 Technologist

 • 

729 Posts

December 10th, 2012 06:00

Have you tried to do a packet trace yet (wireshark)?

-joe

127 Posts

December 10th, 2012 06:00

The issue appears with broadcom AND intel nics? And jumbo frames were turned off on one test while the issue happened again? no stp blocks during these phases?

puuuuh - that seems very strange to me.

my gut instinct definitely points to some switch issue. with all this in mind, i don´t think it´s the eqls or the hosts.

127 Posts

December 10th, 2012 06:00

And this cisco tcp-map command is not by accident set to drop at your 5596?

window-variation {allow | drop}

Sets the action for a connection that has changed its window size unexpectedly. The window size mechanism allows TCP to advertise a large window and to subsequently advertise a much smaller window without having accepted too much data. From the TCP specification, "shrinking the window" is strongly discouraged. When this condition is detected, the connection can be dropped.

(Default) The allow keyword allows connections with a window variation.

The drop keyword drops connections with a window variation.

127 Posts

December 10th, 2012 06:00

seems like a cool jigsaw ;-)

but fun aside - wow, that is really strange!!!!

ANYTHING common during these phases?

Just thinking loud:

HIT Kit with multipathing in play on the hosts? Or on some hosts? Or could we also line that out?

127 Posts

December 10th, 2012 06:00

Oh, i think you 100% checked this already but just to be sure: On NO switch is iSCSI-hating stuff like arp guard, storm control or other throttling or protection mechanism configured, right?

127 Posts

December 10th, 2012 06:00

How often do these things happen? Could you schedule a test window in which you disable jumbo frames just on one machine which is affected and observe what happen? Or did you already do that?

Best regards,

Joerg

27 Posts

December 10th, 2012 06:00

We did manage to grab wireshark captures once while it was happening, and it appeared that one of the arrays was sending zero byte window scaling messages.  Other than that, no smoking gun.

27 Posts

December 10th, 2012 06:00

This occurs randomly.  We have already tried enabling/disabling jumbo frames, as well as toggling on/off most of the other usual settings (flow control settings, RSS, offloads, etc.).  Also, Broadcom and Intel NICs are in play, 1Gb and 10Gb, Dell and HP servers.

127 Posts

December 10th, 2012 06:00

Yeah you are very right. And thus, my first thought is spanning tree. Because this can easily affect only one port on one server while keeping the others have fun and stay healthy.

You write you see no topology changes at all. You SHOULD! Every server restart is a topology change. Thus, i´d like to know if you really have spanning tree configured correctly. Besides, you use different switch vendors. Were they able to select a spanning tree master?? Is the master flipping some times?

And my "MSTP" you certainly mean RSTP, correct?

 

27 Posts

December 10th, 2012 06:00

No topology changes when these issues are experienced.  Only when a server reboots or is plugged/unplugged as one would expect.  This is a dedicated iSCSI network, and is not shared with LAN traffic.  Different switch vendors in different environments, but one of them (the biggest and most important) runs on Nexus, and there is no STP there.

127 Posts

December 10th, 2012 06:00

There HAS to be some in common. There MUST ;-)

27 Posts

December 10th, 2012 06:00

No, none of those wanky settings is enabled...lol.  Again, keep in mind there are completely different switching technologies in play in different environments, so some of those settings don't even exist on all platforms.  If there's something wrong at the switching layer, it's wrong across 3 different vendors in 5 different environments.

No Events found!

Top