I wonder if anyone has seen a similar thing before. I have a VNX5500 to which I added an additional enclosure and drives back in June. On the 29th August, one of the drives in the new enclosure failed (prior to this, there were numerous Soft SCSI Bus Errors for the drive concerned):
Brief Description: CLARiiON call-home event number 0x712789a0 Host SPA Storage Array CKMxxxxxx SP N/A SoftwareRev 7.33.2 (0.51) BaseRev 05.32.000.5.209 Description Drive(Bus 0 Encl 2 Slot 5) taken offline. SN:Z1xxxxxx . TLA:005050144PWR. Reason:Drive Handler(0x00c3)
A replacement drive was shipped, but when inserted on 2nd September it wasn't recognised (it appeared to fail before the replacement wizard completed). Another replacement was shipped and failed about 6 hours after installation. Another replacement was shipped and lasted around 9 days, until the 18th September when it failed too. On that occasion we noticed that several LUNs went intermittently unavailable for a couple of minutes (causing some user disruption) when the drive "failed". Eventually, as part of some information gathering, a colleague inadvertently pulled out and reseated the "failed" drive - at which point (the 29th October) it came back online and has worked until a couple of days ago (when it went offline with similar LUN issues, though these were less disruptive as they happened in the middle of the night), I think another drive is on its way to us as I write!
Interestingly, when it failed one of the SPs logged a set of Unit Shutdown errors for the *adjacent* drive... whether those are the cause of the LUN disruption, I don't know.
I'm assuming this is not a common situation, but I've made little headway in getting it investigated so wonder if the wider community has any ideas?
It's might be a case of a bad slot on the enclosure itself. However, given that the drive(s) have been up for at least some periods of time (instead of just being always and immediately reported as faulted in this slot), then it is unlikely.
Unfortunately any other possible reasons would just be guesswork without an event log or even SP Collects;
- Did the soft SCSI errors stop reporting at any point? or do they usually show up on drives of this Disk Array Enclosure?
- Did the failed drives have reported soft media errors on them prior to failure?
- Do you see any mention of Uncorrectable Errors / Sectors?
Other than simply bad luck on this slot, I would say there's either a problem on an LCC (Link Control Card) / cable on this DAE, or media problems on the drives / RAID Group itself.
Creating a Service Request with support and attaching SP Collects would be the best route IMO, especially because of the disruption you mentioned.
If you prefer not to do that however, just let me know the answers to the questions above.
Hope this helps,
I have got an SR open and had a phone conversation this morning about it which may be leading somewhere other than another replacement drive (perhaps an LCC reseat)! In terms of your questions, though:
The SCSI bus errors came as a burst of 84 soft SCSI Bus Errors errors logged on SP B for 0/2/5 from 00:28:41 to 00:29:09 (17 were logged from 00:28:41 to 00:28:43 on SP A). The next significant event on SP B is the "drive handler" offline I quoted above at 01:25:24. I haven't spotted any media errors on the affected drive prior to that (whereas I've seen that for other drives prior to them failing).
Thanks again fior the response
Yeah, SCSI errors with no media errors would point away from a drive failure, but since the SCSI errors are reported from both SPA and B sides like you mentioned, it wouldn't be clear which LCC / cable has a potential problem. Although SP Collects (and Ktlogs found on the SP's directory along with the SP Collects) should point out any LC card problems which would need reseating / replacing.
The SR owner would be able to confirm, but if you face any issues there, you can always let me know the SR number and I'll be able to get access to the SP Collects and talk to the owner / yourself.
Good luck, hope the problem is resolved.
Thanks again for the pointers, Adham. My optimism has declined a bit. I've been told that the iSCSI logout is a network congestion symptom (though they are very highly correlated with the bus errors) and that as the drive wasn't replaced when it was reseated, I needed to replace it again. I did that
In the end the original drive replacement (aside from the reseating event) was sufficiently far in the past that I was asked to replace the drive (which we arranged and was done on the 10th). The slot failed again at 9PM last night with the same pattern - a burst of Soft SCSI Bus Errors for the slot, shutdown errors for the *adjacent* drive and iSCSI timeouts for the ESXi hosts (thankfully 9PM on weekdays is quiet for our usage scenario).
For anyone at EMC who'd like to review, the latest fault (with today's SPCollects) is on 76183538. The previous case (at the start of this thread) was 75734958 and the previous attempt to that was 74134482.
That's quite a saga John. Sounds like you need to progress beyond just drive replacements.
Have they had you reseat the LCC's or cables yet ?
I had a similar experience with a Dell MD3000 a few years back which turned out to be a faulty/fractured backplane in a disk shelf, and the whole shelf was replaced.
Hope you get it sorted soon
I haven't reseated anything yet- is that any more involved than just loosening the thumbscrews, pulling the LCC out and replacing it? How disruptive is that to normal array services, assuming the other side SP is operating normally? (This might influence when I do it to avoid disruption to the VMs running off the storage).
What happened when they replaced the disk on 12-17 (or 12-18) - it's not clear from the latest case (76183538) notes what happened. I also see that the case is archived. If replacing the disk did not work, then the case should still be open.
I did see that you may need to upgrade the disk firmware for some of the SAS disks - see ETA 195555. This is pretty important - I'm attaching the latest Uptime Bulletin as this has the latest information about this ETA:
I'm also see some SCSI Reservation Conflict messages in the latest spcollects you uploaded from 12-18:
SPA Interface:(FE5/SC) SCSI status = 0x18 count:8
SPA Interface:(FE4/SC) SCSI status = 0x18 count:4
SPB Interface:(FE4/SC) SCSI status = 0x18 count:8
These aren't a lot, but as you're running ESX you probably should not be seeing any as VAAI/ATS should prevent SCSI Reservations from occurring.
I see some iSCSI logout messages (10 in 30 days) - most look like congestion issues, but one or two occurred when you had the disk (0-2-5) shutdown and these could be due to latency on the LUNs when the disk from moving to the Hot Spare.
On the subject of the SCSI reservations and VAAI/ATS - I think we currently have ATS disabled per ETA207784 until we can schedule an upgrade to the latest OE version (we're in a change freeze for Christmas and New Year and the OE upgrade is probably not enough of an emergency to bypass that).
Drive firmware - it's on the list to do early in the new year (probably even sooner than the OE upgrade).
iSCSI Logouts:I can see why the transfer of data to the hot spare might load the system, but the ESXi logs suggest *very* high latency: "Long VMFS rsv time on 'V2 BULK 20' (held for 4417 msecs)" makes me wonder if the SP is stalling completely for several seconds?
I haven't replaced the drive yet, perhaps because I was hoping to trigger an escape from what has become a repetitive process. My concern is that I don't know how to get EMC to perform proper analysis on a superficially functional array with a repetitive fault, which is why I had wanted to leave the fault present. Perhaps I need to replace the drive (I think this will be the 6th replacement) and *then* raise a new SR to request some analysis of why the drives keep failing?
There's a couple of parts for VAAI - one is the HardwareAcceleratedMove (the function for Clone, Copy Templates) and a separate part called ATS (Atomic Test and Set. The ATS part is the one that handles the reservation locking for the Data Stores/LUNs - this needs to be enabled to prevent SCSI Reservations from running. Take a look at KB 91644 - the introduction sections talks about ATS and its importance. You should disable the HardwareAcceleratedMove part of VAAI, but leave the ATS part enabled.
As for the disk replacement, I would recommend that you do replace the disk - all indications are that the disk has failed. I can see where someone could get confused when looking over the older cases as it appears that the disk was not replaced as the same serial number is still there.