Revert replace physical disk operation on MD3220 and import another disk

Question

We have a MD3220 (with an expansion shelf MD1220) that have all it's 24 slots full, with 4 Virtual disks setup in RAID 5. We had two hot spares on the array on slot 10 and 15.

We had first a disk tagged as Failed on slot 16 and two VirtualDisks became degraded. The Hot Spare on slot 15 kicked in, but the rebuild failed. As we had more extra spares, we then choose from the Array management utility to make 15 replace the 16 physical disk and removed the failed 16.
The rebuild on 15 still failed we inserted a new disk on slot 16 which was set as Hot Spare, it also tried to rebuild and failed again.

After that we got a failure also of disk on slot 14, and then 2 of 4 Virtual disks become to failed state.

We turned off the servers and storage arrays for inspecting. We managed to do a full clone with `ddrescue` of the failed disk on slot 14 to another extra spare. In principle we only have a single bad sector (512 bytes) somewhere not near end or beginning of the disk.

After turning on again the arrays and servers with the new clone on slot 14 it won't recognize it, the disk is Unnasigned. We guess due to different serial nr. disk UUID.
We tried to tell the RAID controller to treat the new clone as the old disk, we converged to this command but fails:

# /opt/dell/mdstoragesoftware/mdstoragemanager/client/SMcli -n MD3220-HPS -S -c 'recover virtualDisk physicalDisks=(0,19 0,11 0,12 0,13 0,14 0,9 0,15 0,17 0,18 0,22 0,20) newDiskGroup="10" userLabel="2" virtualDiskWWN="6d4ae52000a5ff6f000003bb506415fe" capacity=1610612736000 offset=0 raidLevel=5 segmentSize=256 dssPreAllocate=TRUE SSID=2 owner=1 ;'

Could not recover a virtual disk using the Recover Virtual Disk command at line 1.
Error 4 - The operation cannot complete because of an incorrect parameter in the command sent to the RAID controller module.

Please retry the operation. If this message persists, contact your Technical Support Representative.
The command at line 1 that caused the error is:

recover virtualDisk physicalDisks=(0,19 0,11 0,12 0,13 0,14 0,9 0,15 0,17 0,18 0,22 0,20) newDiskGroup="10" userLabel="2" virtualDiskWWN="6d4ae52000a5ff6f000003bb506415fe" capacity=1610612736000 offset=0 raidLevel=5 segmentSize=256 dssPreAllocate=TRUE SSID=2 owner=1 ;

Script execution halted due to error.

SMcli failed.

We also got the 'recovery profile'. This has all disks description in it. In theory we could also adjust this to replace the failed disk 14 with the new clone 14 but we can't find any documentation if this is possible.

We have two questions:

1. Do we have a possibility to force this kind of import to the controller ?

2. After the inspection when we turned off, we noticed also that the initially disk 16 tagged as Failed was in good state. Putting it back on the Array on slot 16 is now shown also as Optimal but Unnasigned (as we set Replace physical disk to 15 before taking it out). So we have again on 16 the same disk originally tagged as Failed that now is Optimal. Can we somehow revert the operation of "Replace physical disk from 15 to 16", but without triggering any data rewrite on 16 ? With the hope of the controller seeing the data on the disk and getting the VirtualDisks back.

Thanks in advance for any useful input.

arielzn · Answer

If it helps to clarify a screenshot of the Array Management tool showing the Disk group with the issue

Screenshot from 2020-10-20 22-32-04.png

The Failed disk on slot 14, we were able to clone it with a lengthy ddrescue run.
The new clone has been reinserted on slot 14.
We need to make the array to accept the new clone as replacement.
Is it possible to import the new clone on slot 14 without touching its data?

The disk on slot 15 was a spare set to physically replace 16 after we got a Failed for 16, the rebuild on 15 failed.
After inspection the disk on slot 16 seemed fine, it's back on the Array and is seen as Optimal but Unnasigned.
Can we switch back 15 to 16 without touching the data on 16 neither ? is the same original disk which was part of the array reinserted.

DellEMCSupport · Answer

Hello arielzn,

The drive that you replaced was it on the support matrix for an MD3220? Here is the link to the support matrix. https://dell.to/3o7nrYu if it is not then you will need to get a drive that matches the support matrix. What is the current version of firmware that is on your controllers?

arielzn · Answer

Thanks for your answer.Yes, it is correctly supported, the replaced cloned drive is now: 0, 14    838.363 GB Physical Disk SAS 6 Gbps ST900MM0006 LS06 And we have it recognized as: Physical Disk at Enclosure 0, Slot 14 Status: Optimal Mode: Unassigned Raw capacity: 838.363 GB Usable capacity: 837.863 GB World-wide identifier: (edited) Associated disk group: None Just in case, the failed removed but good enough for the ddrescue dump was: 0, 14   838.363 GB Physical Disk SAS 6 Gbps WD9001BKHG D1S4 For the firmware details I get: CONTROLLERS------------------------------ Number of RAID controller modules: 2 RAID Controller Module in Enclosure 0, Slot 0 Status: Online Current configurationFirmware version: 07.80.41.60 Appware version: 07.80.41.60 Bootware version: 07.80.41.60 NVSRAM version: N26X0-780890-001 Model name: 2660 Board ID: 2660 Submodel ID: 136Product ID: MD32xx The same for the other on Slot 1

DellEMCSupport · Answer

Hello arielzn,

Looking at your firmware it is way out of date and is most likely causing your issue. You should be able to assign the drive back to your virtual disk. If that doesn’t work, then you can try to revive the disk group and that should bring the virtual disk back online. Here are the steps that need to be done when reviving a diskgroup.

Disable Hot Spares
Fail all disks that are part of the disk group.

you will want to use the following command via SMCLI:

set physicalDisk [0,14] operationalState=failed;

For all other drives repeat previous command but change the drive slot.

You want to revive the disk group. Here is the command for that.

revive diskGroup disk group name;

Once all the steps are done the disk group should come back online and you should have access to your virtual disk again.

arielzn · Answer

My point was, how to assgin it back, asking to take the same order  its clone had on the array: physicalDisks=(0,19 0,11 0,12 0,13 0,14 0,9 0,15 0,17 0,18 0,22 0,20)

arielzn · Answer

Hello.To try first just to assign the drive [0, 14] back to disk group 1, how should I proceed precisely ?

DellEMCSupport · Answer

Hello arielzn,

All you have to do is to add the drive back to the diskgroup first and then make sure it is added to the virtual disk. If you are using disk pools, then you will need to add the drive to the pool. Here is the link to the administrator guide, & chapter 7 is disk groups, & chapter 8 is disk pools. https://dell.to/31tYsEV

arielzn · Answer

Ok, as explained on the guide I don't see the option " Storage → Disk Group→ Add Physical Disks (Capacity) " available, probably due to the failed state. I do see it for the other Optimal RAID.

The only approach available to assign back the new 14, is by a replace physical disk:

Screenshot from 2020-10-21 23-38-34.png

By going for this replacement , the current data on the cloned disk [0, 14] won't be modified ?

DellEMCSupport · Answer

Hello arielzn,

The disk 0,14 will be modified so that it can be added back to the virtual disk. The only way it would not be modified is if when the drive failed you powered off your MD3220. Since yours has been up and running the data that you cloned is stale and will not be current to what the virtual disk has.

arielzn · Answer

Thanks for all your answers so far.

Trying to add it back on the group didn't work , an Error due to the already failed state of VDs on the disk group.

Going for the second option, after failing all disks in the Disk Group, trying the revive I hit :

# SMcli -n 
 
   -c "revive diskGroup ["1"];"
  
Unable to force disk group "1" to optimal at line 1.
  
Error 29 - This operation cannot complete because there was a security authentication failure on a parameter in the command sent to the RAID controller module.
  
Please retry the operation. If this message persists, contact your Technical Support Representative.
  

  
The command at line 1 that caused the error is:
  
revive diskGroup [1];
  
Script execution halted due to error.
  
SMcli failed.

Any other parameter I should add for SMcli ?

DellEMCSupport · Answer

Hello arielzn, Try to reboot your controllers and then try the command again.

dr.kiev · Answer

In case if you simply want to recover the data, it would easier to reconstruct raid virtually and backup it to a separate location .

PowerVault

Revert replace physical disk operation on MD3220 and import another disk

Was this post helpful?