A disk in our VNX failed and during the rebuild we received the following message.
Uncorrectable Sector RAID Group: 14e Position: 0 LBA: 34b02580 Blocks: 40 Error info: 40000000 Extra info: 12
Ok - it looks like perhaps 40 sectors are lost. If you read the KB on uncorrectable sectors it will tell you the data is lost. This is where I started getting frustrated with the service request support.
1) They tell me I have no DU/DL (Data Unavailable/Data Loss) failures. That makes no sense to me since the KB page contradicts that.
2) Ok - I need to recover the 40 sectors. That will be easy if they belong to a database LUN. Those get backed up nightly. It will be more difficult if it's something like a VMware datastore. I'd like to know the same day what LUN is impacted so I can get on with the business of a recovery. The Service Center personnel say all of my LUNs are in a storage pool and so I must run ROBV on the entire pool and only when it completes 6 to 7 days later will they be able to tell me which LUN needs a recovery operation.
Is this really how it is with a VNX storage pool? I have to say, I'm kind of disappointed why the alert just doesn't flat out say what user LUN is impacted.
I am currently in similar predicament, and unfortunately the nature of virtualization prevents us from quickly identifying the correlation between the LBA in issue and the LUN it is associated with. In your case, while RAID Group 14E has the LBA in issue but if EMC is saying the entire pool needs to be RoBV'd (Read only Background Verify) then I can assume you must be running FAST VP too. That means FAST VP may be moving the uncorrectables with in the pool and may be long gone from the reported LBA and can be anywhere.
There really is no way of finding out which LUN is impacted other than disabling FAST VP to stop the block relocation and running RoBV to identify where the uncorrectables are at that moment. If it's on File side, it can add another layer of finding out the correlating FS to make things more difficult.
Having that said, your idea of having the alert identifying the LUN to be recovered would not only benifit us as a customer but EMC as well from TTR perspective. It would be an excellent enhancement EMC can make to automatically kick start the process when OE code finds the uncorrectables. Or at least make it an option for us to choose (stopping FAST without consent may not be preferable to some).
EMC > Please kindly open an Enhancement Request ticket with the above.
EMC doesn't usually open Product Enhancement Requests (PER) from the Community Forums. You can have your Sales Team initiate the process or use the DellEMC Support page to do so.
Once a PER is opened, you can request status updates from the engineering team.
Let us know if that helps!
I agree with what you've said about the data moving, but my point is the data is not moving at the time of the I/O request. If my host says "give me the 100th sector in LUN1138" the VNX needs to be able figure out where that 100th sector lives within the pool and return it to me in milliseconds. In the rare case where it locates the 100th sector in user LUN1138 and finds it to be "uncorrectable", why is it not possible to include the string "LUN1138" in the alert generated?
I do see your point and I fully agree. I'm thinking from TTR's perspective if the host knows which data is corrupted we are better off just restoring that and writing over the bad block and we mostlikely would not need to know where it is on. But the LUN still needs to be unbound and bound to delete the uncorrectable counts for log purposes, so the LUN correlation is still very useful. Same goes with if it is found during Background Media Scans or during rebuild which will surely make things easier for all of us.
A couple of points.
1) "if the host knows which data is corrupted..." My problem is I don't know which host. It's really hard when you have lots of virtual hosts sharing VMware datastore LUNs. I have 50 LUNs, but 500 hosts. It's still going to be a chore to track down the host, but just knowing which LUN will eliminate many of them. As of now, I would have to run scans on all 500 hosts.
2) My understanding is the LUN doesn't need to be unbound. Once you write over the top of the uncorrectable sector, the count is decremented. If you recover everything, the count will go to zero without an unbind.