I seem to have gotten into a little pickle. Maybe someone could advise please?
I have an MD1000 attached to my PowerEdge server via a PERC 6/E controller. It's been a while since I fired it up and there was a message on boot up after the POST about the battery being dead. Unfortunately I've only been using the server occasionally for storage the last week or so, so hadn't yet gotten around to figuring out how to replace the battery.
I have successfully been powering the server up and down over the week with no issues. Today something else has occurred and I don't know what, but the server would not boot. The controller complained that there was no configuration. When going into the controller BIOS I could see the VD Management screen and VD 0 complained that there was no configuration. When hitting enter it came up with an error message, something like, "configuration not available - unknown error".
Upon a bit of a further dig around, I could see that there was uncommitted data in the cache. The controller reported losing access to one or more of the VD's. I decided I was happy to sacrifice the data and I flushed the cache. I also set the controller to NOT force stop on error. However when I rebooted the controller still stopped the server from booting.
This is where I think I made the error. Mainly because I didn't check what I was doing before I did it. My bad... I went back in and chose to delete the "foreign configuration" as I thought I could somehow reset the configuration. Of course it didn't reset the configuration, it deleted it.
So previously I could see that there was meant to be a VD 0, even if I couldn't seem to access it properly. I could see all 6 of the HDD's and their statuses etc, but they all seem to have been reset. This is the status / config I had previously:
HDD 0 - In error
HDD 1 - Online
HDD 2 - Online
HDD 3 - Online
HDD 4 - Online
HDD 5 - Rebuilding
So I have also had a recent HDD failure and the hot spare (disk 5) was rebuilding. All of that config is now gone and all disks now report 'ready' apart from 0 which of course still has a SMART failure.
The question of course, is how do I set the VD up again so that I don't loose any of the data?
Once I realised what I did, I have powered down the server and stopped mucking about.
Can anyone please advise on how to keep the data?
Before I advise on how to possibly get the array back online I'm going to give a disclaimer. If the data is important then you should contact a data recovery company. Any attempts to recover the data could further corrupt it and make it more difficult or impossible to recover.
Let me explain what the situation is before I tell you how to possibly fix it. The array information is stored in the controller cache and on the hard drives themselves. When that information does not match between the controller and hard drives the controller will put the array into a foreign status. This helps prevent data corruption. When this happens you have two options, you can import the foreign configuration or clear it. If you import the controller will rewrite the configuration it has stored with the configuration stored on the drives. If you clear it then it deletes the metadata tags that define the array off of the hard drives. Clearing the foreign configuration is the same as deleting an array.
The good thing is that deleting the array only deletes the metadata tags that defines the array. It does not actually delete the data or array formatting off of the drives. Because of this you can do what is called retagging. Retagging is when you create the exact same array. If the metadata tags are rewritten to the drives then it make the data accessible again. The trick is to make sure you have the exact same settings as the previous array. If the array is created with a different number of drives, different stripe segment size, or a different array size then the data will not be in the expected location. If that happens it will discard the old array formatting and write new.
Here is the process to retag your array:
Once that is done you should have a RAID 6 created with drives 0-5, and drives 0/5 should be offline. At this point you should try to boot up and access the data. If the data is available back up before trying to rebuild 0/6 back into the array. I would suggest that you have a backup solution ready before you start doing any of this.
After you have the data backed up I would recommend that you delete the array, create a dissimilar array, then recreate the array you want to use. I would not recommend that you continue using this array. Retagging is an array recovery procedure that is used to get access to data. It should not be seen as a permanent solution. Also, because there was discarded cache there is likely corrupt array information. In short, you will likely have problems with this array at some point in the future. That could be 2 minutes after the retag or never.
The reason you want to create a dissimilar array is because if you don't you are just retagging. The controller may not rewrite the array data, and you may encounter corrupted array data in the future because of this. Create a 6 drive RAID 0 or anything else that is not the same array you are currently using. Then initialize the array. After that completes delete the array, create the desired array, and initialize again.
Dell EMC, Enterprise Engineer
Get support on Twitter @DellCaresPRO
Great answer Daniel.
Saves me from a backup recovery, that maybe took me 2 or more days.
The only thing that i missed in your post are the posible reasons for this problem.
I have a RAID 1 VD0 that is fine. I had noticed the D: drive (VD1) had gone missing, but the data was not important. It consists of a single 3TB disk. I had to shutdown and reboot for a move and then now have "FOREIGN" assigned to the VD1 but I do not care about the data on VD1 (D: drive) and just want to reboot from the good VD0 (C. I cannot figure out what option to take (import or clear). I am pretty sure that the single 3TB disk is bad and I can easily replace it, but not sure of the correct steps.