Start a Conversation

Unsolved

This post is more than 5 years old

93820

December 9th, 2012 18:00

Randomly losing access to drives under Windows

We've been using Equallogic for years, but only this year have we started having major issues.

We have multiple groups and multiple arrays/models.  There are standalone Windows and Linux file servers that have iSCSI LUNs attached, ESX servers and then standalone and clustered SQL servers.  It's the database servers that have issues.  No issues with others that we've encountered.

Occasionally, a random database server will start to experience slow I/O 833 errors accompanied by iScsiPrt 9, 39 and 129 errors in the system log.  Sometimes access to the volume still works for a period of time, and other times it just hangs the server to the point where it has to be power cycled.  With clustered servers, graceful failover never works, and the cluster service has to be killed in order to break ownership of the LUN.  

This issue only exists with a single LUN out of many that are presented and attached, and only to a single server at any time.

I've recently found that if I toggle the volume offline and online real fast the errors stop instantly and access and functionality is restored until the next time it happens again.

This happens to no particular version or patch/hotfix level of Windows 2003 or 2008 R2.  There are different models of servers and NICs that experience this, as well as different switches.  There are PS6510X, 6110XV, 6010XVS, and 6010XV arrays in play.  Firmware versions for each group include 5.2.1 ,5.2.2, and 5.2.5.

We're about ready to throw Equallogic out of the environment if we can't find a fix.  We've had an ongoing case open for nearly 7 months now and have had Dell onsite to troubleshoot.  Has anyone else experienced this?  Were you able to resolve?

27 Posts

December 10th, 2012 07:00

I don't think so, but I will check.  We followed the Dell recommendations verbatim for Nexus switches.

en.community.dell.com/.../3562.best-practices-for-dell-equallogic-sans-utilizing-cisco-nexus-by-sis.aspx

27 Posts

December 10th, 2012 07:00

Unfortunately, there are very few things in common.  HIT kit 3.5 and 4.0 are in play.

We've been banging our heads on this for months, and have a pretty good idea what we're doing.  We've burned through many EQL L2 and L3 techs and nobody has been able to find anything.

I'm starting to think the only alternative we have is to back out of using EQL for this purpose and go back to a FC solution.  It works great for our other environments, but has given us fits with the MSSQL servers for too long now.  I was hoping someone could chime in with their experience and potential fix before we finally throw in the towel.

127 Posts

December 11th, 2012 08:00

In addition, i found that in the 5.2.6 release notes, maybe it´s worth a try...

"Under an extremely heavy workload on the group, or in an extremely poor network environment, the total number of iSCSI connections on the member was significantly less than shown in the pool stats. As a consequence, the TCP receive window was not scaling correctly. [Tracking #: 80616, 250516]"

But be careful - i think i heard (please correct me if i am wrong, don´t know 100% for sure) 5.2.6 activates SCSI UNMAP without the possibility to disable it at EQL level. Thus, if you use let´s say Windows 2012 Server and a CSV and you are not sure if this config is supported by now, you may disable odx in the guest.

12 Posts

March 19th, 2013 04:00

I have seen drives disappearing in Windows.  We have two particular servers that lose drives, one has 5 drives connected but it is always the F drive that goes walk about. The iSCSI initiator shows it as connected but Windows cannot see it. The only common thing I can see is that the drives tend to disappear during a time of elevated writes, whether this is significant or always the case I can't say for certain. Other sites with the same hardware/firmware/networking and other bits do not get the problem, but they don't have the same data access profile.

It does seem to intense writes that trigger the loss of drives rather than reads as the drives tend to disappear before the backups kick in rather than during the backup.

5 Practitioner

 • 

274.2K Posts

March 19th, 2013 06:00

Is it still the case where putting the volume offline, then online restores access?   If so that's symptomatic of a SCSI RESERVE being applied to the volume by another server.  Preventing the other server from accessing it until a SCSI release is issued.  Offline/online clears out that status when all the connections drop off.  Unlike a SCSI-3 reservation it's not persistent.   Another way to test this is create a clone of the volume, if that it accessible, a reservation problem becomes more likely

30 Posts

April 9th, 2013 12:00

Is there any update on this issue?  We, too, are experiencing something similar which seems isolated to Windows Server 2008 R2 HyperV hosts and guests running on them.  We're using EQLs and using 4 PowerConnect 6248s split into 2 stacks and the stacks are LAG'ed together.  The iScsiPrt errors happen randomly every few days or weeks on different hosts or guests, sometimes during the day but mostly in the evening off-hours.

I've identified that I can pretty reliably reproduce the issue by Live Migrating a guest and wrote about it here: en.community.dell.com/.../20337974.aspx

But Live Migration must only be one of the ways this issue is manifesting itself because when it randomly happens, usually in the evening hours when no one is in the office performing any Live Migrations.  Also, I've since noticed these errors on the HyperV hosts as well which should rule out any direct connection to Live Migration itself since hosts don't migrate.

The other interesting behavior noted in the link I provided is that some of the guests affected by this are part of a Failover Cluster and when these iScsiPrt errors randomly happen, it usually causes an unplanned failover event to happen as well (this is how we first started noticing these errors).  What happens is that the Failover Cluster suddenly sees the node receiving the iScsiPrt errors as unavailable.  This means the cluster heartbeat does not receive a response from this node.  But, the cluster heartbeat is on the virtual public switch that the guests use for primary communications and this has nothing to do with iSCSI.

So at the moment this unknown event happens 1) iScsiPrt errors are reported on the iSCSI NICs in the guests and 2) at the same time something happens on the virtual public NIC which causes the cluster heartbeat to fail.  Since these unknown events mostly happen on random guests and occasionally the HyperV hosts (I do not see iScsiPrt errors in any other non-virtualized Windows or Linux servers) I'm beginning to think there is something going on somewhere in the HyperV world.  If this were something at the EQL or switch level I would expect it to manifest itself more widely through our environment.

Thanks,

Ryan

No Events found!

Top