Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

1943

January 10th, 2012 00:00

MetaLUN fail after drive replacement

Last week one drive failed failed from a LUN which is contained in a MetaLUN.
The drive in question was a 73GB. While we didn't have a 73GB spare in stock, we replaced it with a 146GB drive.
The replaced drive did rebuild and all looked fine. UNTILL we restarted our CX-600 yesterday.

The MetaLUN doesn't come online, the displayed status says: shutdown.
One of the MetaLUN compontents (1 out 4 LUNs) go owned.
Apparently this is a problem in the flare version:


CX200, CX400, CX600
LUNs may fail to assign (go

unowned) after a drive replacement, a hot spare

equalize where the hot spare is a different size than the

drive it replaced, and then a trespass to the peer SP. 

(112437)


Fixed in Revision: 02.07.010 

Solution:  Correct defect in code that determines the

rebuild checkpoint following drive replacement.


What to do?

Can i change the drive in question with a 73GB spare, without losing anything..?
Does al start working again?


Thanks for helping!

25 Posts

January 11th, 2012 14:00

Well i reported this issue yesterday. Today they still didn't have a solution.
That's why i started to replicate the problem over to another lun, to see what was going to happen.
After de succeeded test, we decided to go for it, and replaced the disk of the MetaLUN in question.

The owner of the LUN changed immediately back to it's orginal owner and the status of the MetaLUN changed from shutdown to degraded.
After the rebuilding, the MetaLUN was working again!

1.5K Posts

January 10th, 2012 00:00

I 'll strongly recommend to open a service request with EMC or get in touch with your local support personnel.

25 Posts

January 10th, 2012 00:00

So i cannot swap the drive with a 73GB disks, so the system starts rebuilding the disk?

25 Posts

January 10th, 2012 01:00

I stated the wrong problem ID. The correct one is:

CX200, CX400, CX600 LUNs may fail to assign (go

unowned) after a drive replacement, a hot spare

equalize where the hot spare is a different size than the

drive it replaced, and then a trespass to the peer SP. 

(112437)

Fixed in Revision: 02.07.010 

Solution:  Correct defect in code that determines the

rebuild checkpoint following drive replacement.

25 Posts

January 10th, 2012 08:00

Can anyone help me..?

2 Posts

January 10th, 2012 11:00

try to restart the management server on both sp's and still it doesn't change replace 146GB with 73GB . If still it doent work then only option is to call EMC.

thanks

kranthi

2 Posts

January 10th, 2012 13:00

if u got a hot spare then there is no trouble in it u can swap the disks

25 Posts

January 10th, 2012 13:00

Well, the question is if swapping the disk doesn't make the situation even worse.
I restarted the SP's a couple of times, but the ownership doesn't change.

25 Posts

January 11th, 2012 03:00

I tried to replicate the problem on another LUN without any data on it, which consist of a 2 disk RAID 1.

Disks are 73GB in size. LUN was running fine.

When i removed one of the 2 disks from the array, the hot spare kicked in and started rebuilding the disk.
Then i placed a 146GB disk back into the disk array, and let the hot spare copy over it's data.
The LUN was fine again.. UNTILL...... i rebooted the clariion.

After the reboot the LUN showed Owner: N/A, so we got another unowned LUN. The problem does apparently only occur after a reboot. When i took the 146GB out, the hot spare didn't gave a kick at all. It didn't start rebuilding. This time when i reinserted the 73GB it started to rebuild on the new disk drive right away. Also, the owner changed immediately back to the original owner! So the LUN is owned again, after replacing the 146GB by a 73GB disk.

The question now is, is it safe to do the same with the MetaLUN from the beginning of this discussion which contains a lot of important data..?!

4.5K Posts

January 11th, 2012 13:00

It should work, but to be honest, I can't recall a case with this exact problem. Wnen you use a large disk to replace a smaller disk flare will only use the space the size of the original disk. It would be best to have engineering review this to be sure as you have important data on the metaLUN. Please open a case with support have have them open a case with engineering to review.

glen

4.5K Posts

January 11th, 2012 14:00

what was the SR# for the case - I'd like to follow up with engineering to see what they say and what you found - this would make a good support article.

glen

25 Posts

January 12th, 2012 06:00

The SR# is 45376354. The only thing the tech/engineer could tell me was "i have to investigate the problem and come back to you next day". Next day the same answer, while I presented him all the information I also posted in this discussion. This problem is also known by EMC, while it's described in a release note.

Personally i think this is unacceptable, waiting 3 days to get possibly, not even sure,  a solution to get the critical data back.

At the end of day 2 i got a call from another worker from EMC that they couldn't find (after 2 long days!) a valid support contract. So after 2 days of pretended investigation, they coudn't help me anymore. Also if there was a valid support contract available, i guess you can't let your customer wait for 3 or more days to get business critical information back. That's why i started to reproduce the issue  and tried to find a solution on my own. Thank god all worked out as expected!

Personally i see this kind of problems as a factory defect which need to be supported, regardless if there's an valid support contract available. If i buy a car with one year of warranty and the computer of this car starts doing odd things like, accelerating while not requested, after the warranty is expired, because of some bad code, the code/computer is replaced without any additional costs. I think this scenario is the same as the one described in the beginning of this discussion, while this both could lead to very dangerous situations. You may expect a proper operation of this piece of hardware. Anyway, this is my opinion..

That aside, the problem is solved by replacing the bigger disk by a disk of the original size. Caution, i think it's a risky operation, while the hotspare doesn't kick in for some reason. The device starts rebuilding on the new disk right away! Also, after inserting a disk of the original size the LUN will become accessible directly after the drive has been initialised.. I needs to be said that this problem only occurs when the CLARiiON is restarted. Replacing a 73GB disk by a 146GB without restarting, will function without any problems!

Thanks everyone for helping on this one..

No Events found!

Top