jwdg

12 Posts

8158

November 26th, 2015 08:00

What might cause multiple failures of drives in the same slot?

I wonder if anyone has seen a similar thing before. I have a VNX5500 to which I added an additional enclosure and drives back in June. On the 29th August, one of the drives in the new enclosure failed (prior to this, there were numerous Soft SCSI Bus Errors for the drive concerned):

Brief Description: CLARiiON call-home event number 0x712789a0 Host SPA Storage Array CKMxxxxxx SP N/A SoftwareRev 7.33.2 (0.51) BaseRev 05.32.000.5.209 Description Drive(Bus 0 Encl 2 Slot 5) taken offline. SN:Z1xxxxxx . TLA:005050144PWR. Reason:Drive Handler(0x00c3)

A replacement drive was shipped, but when inserted on 2nd September it wasn't recognised (it appeared to fail before the replacement wizard completed). Another replacement was shipped and failed about 6 hours after installation. Another replacement was shipped and lasted around 9 days, until the 18th September when it failed too. On that occasion we noticed that several LUNs went intermittently unavailable for a couple of minutes (causing some user disruption) when the drive "failed". Eventually, as part of some information gathering, a colleague inadvertently pulled out and reseated the "failed" drive - at which point (the 29th October) it came back online and has worked until a couple of days ago (when it went offline with similar LUN issues, though these were less disruptive as they happened in the middle of the night), I think another drive is on its way to us as I write!

Interestingly, when it failed one of the SPs logged a set of Unit Shutdown errors for the *adjacent* drive... whether those are the cause of the LUN disruption, I don't know.

I'm assuming this is not a common situation, but I've made little headway in getting it investigated so wonder if the wider community has any ideas?

Thanks

John

Responses(12)

adhamakady

65 Posts

0

November 28th, 2015 23:00

Hello John,

It's might be a case of a bad slot on the enclosure itself. However, given that the drive(s) have been up for at least some periods of time (instead of just being always and immediately reported as faulted in this slot), then it is unlikely.

Unfortunately any other possible reasons would just be guesswork without an event log or even SP Collects;

- Did the soft SCSI errors stop reporting at any point? or do they usually show up on drives of this Disk Array Enclosure?

- Did the failed drives have reported soft media errors on them prior to failure?

- Do you see any mention of Uncorrectable Errors / Sectors?

Other than simply bad luck on this slot, I would say there's either a problem on an LCC (Link Control Card) / cable on this DAE, or media problems on the drives / RAID Group itself.

Creating a Service Request with support and attaching SP Collects would be the best route IMO, especially because of the disruption you mentioned.

If you prefer not to do that however, just let me know the answers to the questions above.

Hope this helps,

Adham

J

jwdg

12 Posts

0

November 30th, 2015 04:00

Thanks Adham.

I have got an SR open and had a phone conversation this morning about it which may be leading somewhere other than another replacement drive (perhaps an LCC reseat)! In terms of your questions, though:

The SCSI bus errors came as a burst of 84 soft SCSI Bus Errors errors logged on SP B for 0/2/5 from 00:28:41 to 00:29:09 (17 were logged from 00:28:41 to 00:28:43 on SP A). The next significant event on SP B is the "drive handler" offline I quoted above at 01:25:24. I haven't spotted any media errors on the affected drive prior to that (whereas I've seen that for other drives prior to them failing).

Thanks again fior the response

John

adhamakady

65 Posts

1

November 30th, 2015 22:00

Yeah, SCSI errors with no media errors would point away from a drive failure, but since the SCSI errors are reported from both SPA and B sides like you mentioned, it wouldn't be clear which LCC / cable has a potential problem. Although SP Collects (and Ktlogs found on the SP's directory along with the SP Collects) should point out any LC card problems which would need reseating / replacing.

The SR owner would be able to confirm, but if you face any issues there, you can always let me know the SR number and I'll be able to get access to the SP Collects and talk to the owner / yourself.

Good luck, hope the problem is resolved.

Adham

J

jwdg

12 Posts

0

December 18th, 2015 09:00

Thanks again for the pointers, Adham. My optimism has declined a bit. I've been told that the iSCSI logout is a network congestion symptom (though they are very highly correlated with the bus errors) and that as the drive wasn't replaced when it was reseated, I needed to replace it again. I did that

In the end the original drive replacement (aside from the reseating event) was sufficiently far in the past that I was asked to replace the drive (which we arranged and was done on the 10th). The slot failed again at 9PM last night with the same pattern - a burst of Soft SCSI Bus Errors for the slot, shutdown errors for the *adjacent* drive and iSCSI timeouts for the ESXi hosts (thankfully 9PM on weekdays is quiet for our usage scenario).

For anyone at EMC who'd like to review, the latest fault (with today's SPCollects) is on 76183538. The previous case (at the start of this thread) was 75734958 and the previous attempt to that was 74134482.

Thanks

John

brettesinclair

715 Posts

0

December 18th, 2015 21:00

That's quite a saga John. Sounds like you need to progress beyond just drive replacements.

Have they had you reseat the LCC's or cables yet ?

I had a similar experience with a Dell MD3000 a few years back which turned out to be a faulty/fractured backplane in a disk shelf, and the whole shelf was replaced.

Hope you get it sorted soon

J

jwdg

12 Posts

0

December 21st, 2015 04:00

I haven't reseated anything yet- is that any more involved than just loosening the thumbscrews, pulling the LCC out and replacing it? How disruptive is that to normal array services, assuming the other side SP is operating normally? (This might influence when I do it to avoid disruption to the VMs running off the storage).

kelleg

4.5K Posts

0

December 29th, 2015 12:00

What happened when they replaced the disk on 12-17 (or 12-18) - it's not clear from the latest case (76183538) notes what happened. I also see that the case is archived. If replacing the disk did not work, then the case should still be open.

I did see that you may need to upgrade the disk firmware for some of the SAS disks - see ETA 195555. This is pretty important - I'm attaching the latest Uptime Bulletin as this has the latest information about this ETA:

I'm also see some SCSI Reservation Conflict messages in the latest spcollects you uploaded from 12-18:

SPA Interface:(FE5/SC) SCSI status = 0x18 count:8

SPA Interface:(FE4/SC) SCSI status = 0x18 count:4

SPB Interface:(FE4/SC) SCSI status = 0x18 count:8

These aren't a lot, but as you're running ESX you probably should not be seeing any as VAAI/ATS should prevent SCSI Reservations from occurring.

I see some iSCSI logout messages (10 in 30 days) - most look like congestion issues, but one or two occurred when you had the disk (0-2-5) shutdown and these could be due to latency on the LUNs when the disk from moving to the Hot Spare.

glen

1 Attachment

2015 Q4 VNX Uptime Bulletin.pdf

J

jwdg

12 Posts

0

December 30th, 2015 04:00

Thanks Glen

On the subject of the SCSI reservations and VAAI/ATS - I think we currently have ATS disabled per ETA207784 until we can schedule an upgrade to the latest OE version (we're in a change freeze for Christmas and New Year and the OE upgrade is probably not enough of an emergency to bypass that).

Drive firmware - it's on the list to do early in the new year (probably even sooner than the OE upgrade).

iSCSI Logouts:I can see why the transfer of data to the hot spare might load the system, but the ESXi logs suggest *very* high latency: "Long VMFS rsv time on 'V2 BULK 20' (held for 4417 msecs)" makes me wonder if the SP is stalling completely for several seconds?

I haven't replaced the drive yet, perhaps because I was hoping to trigger an escape from what has become a repetitive process. My concern is that I don't know how to get EMC to perform proper analysis on a superficially functional array with a repetitive fault, which is why I had wanted to leave the fault present. Perhaps I need to replace the drive (I think this will be the 6th replacement) and *then* raise a new SR to request some analysis of why the drives keep failing?

kelleg

4.5K Posts

1

December 31st, 2015 07:00

John,

There's a couple of parts for VAAI - one is the HardwareAcceleratedMove (the function for Clone, Copy Templates) and a separate part called ATS (Atomic Test and Set. The ATS part is the one that handles the reservation locking for the Data Stores/LUNs - this needs to be enabled to prevent SCSI Reservations from running. Take a look at KB 91644 - the introduction sections talks about ATS and its importance. You should disable the HardwareAcceleratedMove part of VAAI, but leave the ATS part enabled.

As for the disk replacement, I would recommend that you do replace the disk - all indications are that the disk has failed. I can see where someone could get confused when looking over the older cases as it appears that the disk was not replaced as the same serial number is still there.

glen

J

jwdg

12 Posts

0

January 5th, 2016 08:00

Thanks Glen

I would be keen to restore the ATS functionality (which I think is HardwareAcceleratedLocking - see ETA 207784: VNX: Storage Processors may restart if VMware vStorage APIs for Array Integration (VAAI) is enabled resulting in potential data unavailability.) - I think our decision was based on that discussion or an earlier version of the ETA before there was a hotfix available for our OE version, I'll try to tackle getting that hotfix and installing it so we can turn on ATS again!

I have replaced the disk , but as I can't reopen the "CallHome" SR (it keeps getting automatically closed because I've been sent a replacement disk!) I've now opened another SR (76519062) to chase the underlying problem. I think this saga might be a case study worthy of review by someone who has oversight of the EMC support process...

John

brettesinclair

715 Posts

0

January 7th, 2016 01:00

John, keep in mind that the hotfix install is not that much different to a full OE upgrade...SP Reboots included.

Out of interest, what is the drive type that's failing ? We've had particular issues with 600GB 15k drives not long ago. There's an ETA for that too. (blogged about here EMC VNX/VNX2/VMAX: 600GB 15k Drive Increased failure rates – with fix. – Pragmatic IO

The resolution involved drive firmware upgrades and a "double binding" procedure.

You'd expect replacement drives to be shipped with newer firmware, but something to be aware of.

kelleg

4.5K Posts

1

January 7th, 2016 09:00

From the latest spcollects the drive in 0-2-5 is a 3TB Seagate NL-SAS. The fact that multiple drives have failed in this one location does seem to point to that particular slot but it could also be one of the LCC's (there are two in the DAE).

The reset/replacement of the LCC cables (the external DAE cables between DEA's) may have an effect, but the cables carry the data between DAEs. The LCC card might be a better place to look as the LCC card is the one that is connected internally to each disk slot (it's like a switch). Re-seating the LCC's might be a better choice before replacing the whole DAE as the LCC's connect to each disk slot.

Whenever you re-seat a cable or the LCC, you will break the connection on that bus side for any DAE after Bus 0 Enclosure 2 (I see that there is not an enclosure 3, so that should not be an issue). So if you re-seat the cable on the A-Side, you lose connection to the drives from the SPA side, but the SPB side is still active (the LUNs would trespass to the SPB side in this case). You should treat the re-seat like an NDU - perform it during a slow period.

glen

View All

No Events found!