fl1

89 Posts

8022

April 11th, 2015 09:00

Raid group/LUN stuck in transitioning state?

Hello,

I had a disk fail and it kick over to a hot spare, I then replace the failed disk and now it's been in a transitioning state for 2 days. The transitioning "T" indicator is on the hot spare disk. When I do a /nas/sbin/navicli getlun -prb it still shows 0% rebuilt on the LUN. Now I have moved everything off this LUN and is basically an empty LUN at this point. I want to know if I can just unbind the LUN and destroy the Raid Group and start fresh or will that make things worst?

One other thing I have to note is that while I was waiting for the replacement disk another disk within the same LUN/Raid Group started to fail with thousands of soft errors but it has not failed yet and I can not do a proactive copy to spare because the raid group is still in a transitioning state.

I like to just delete the LUN and destroy the Raid Group and then replace all the bad disk and then recreate a new raid group and lun.

Any help or suggestion is appreciated.

Responses(11)

brettesinclair

2 Intern

•

715 Posts

0

April 13th, 2015 04:00

Did the initial transition to hot-spare complete ?

I don't think you'll be able to delete the LUN (let alone destroy the RG) while the array has such an an operation in progress.

fl1

89 Posts

0

April 13th, 2015 08:00

The initial transition to hot spare completed and the bad disk went into fault and un-mounted itself. Before the replacement disk arrive a 2nd disk in the same LUN/RG started to fail but I could not do another proactive copy to spare because the RG is already in a transitioning state from the 1st hot spare copy. Then I started getting "uncorrectable parity sector" and "parity invalidated" error messages on a 3rd disk in the same LUN/RG. I replaced the 1st failed disk and this is where I'm at for the last 4 days in a transitioning rebuild back to the disk. Now I'm also getting "uncorrectable sector" on the brand new replacement disk I just put in.

I'm pretty sure the whole process is stuck because it can't transition back from the hot spare because of the 2nd bad disk and the parity errors on the RG (raid-5). At this point I'm going to try and just pull out the transitioning hot spare and see what happens since this is a empty LUN/RG and I don't care if it's lost.

kelleg

4.5K Posts

0

April 20th, 2015 10:00

Once the second disk in the same raid group failed, it stopped all processing on the LUNs in terms of rebuilding - R5 can handle only a single failure, the second one stops everything. This is called a double faulted raid group and it is a serious issue.

You need to open a service request with EMC - they will need current spcollects (if you haven't already done this).

glen

Aleite

27 Posts

0

April 20th, 2015 12:00

you can delete the lun and raid group without a problem, hot spare will just become available again.

brettesinclair

2 Intern

•

715 Posts

0

April 20th, 2015 15:00

Andre Leite wrote:

you can delete the lun and raid group without a problem, hot spare will just become available again.

While the array (rightly or wrongly) is displaying a transition op in progress?

brettesinclair

2 Intern

•

715 Posts

0

April 20th, 2015 16:00

fl wrote:

This is what I had to do to fix the problem.

Well played

fl1

89 Posts

0

April 20th, 2015 16:00

This is what I had to do to fix the problem.

1) Unbind ALL hot spare luns because when I pull the transitioning hot spare it would start a new transitioning on another hot spare even if the hot spare is not the same capacity or speed as the failed disk.

2) Pull the last remaining hot spare disk. This puts the raid group in a faulted state.

3) Now I can unbind the lun and destroy the raid group.

4) Put all hot spare disk back in and re-bind them.

5) Create new raid group and lun after replacing all bad disks.

Good thing I saw this coming and migrated all data off this lun and raid group so I didn't lose anything other than my hair.

Aleite

27 Posts

0

April 21st, 2015 02:00

Hi Brett, the answer is yes, array will stop the transition operation.

Done it many times, have over 150 clariions/vnx1/vnx2 here, double faults are not common but happen from time to time.

brettesinclair

2 Intern

•

715 Posts

0

April 21st, 2015 05:00

Thanks Andre, good to know in case I ever come across that situation.

PS - That's a lot of arrays, I'm jealous !

karen.iannuccil

3 Posts

0

September 27th, 2015 14:00

I think I'm in the same situation. I had 1 drive fail, while replacing it I accidentally pulled the wrong drive. I put that drive back in then replaced the bad drive. Both drives are in the transitioning state, they have been like this for two days. I have only Navisphere Express. When viewing the status in that utility it states disk 8 and disk 9 are transitioning. To try and get more details I installed naviseccli and ran the getall command. Within the output disk 8 and 9 are in a powering up state. They show 100 for prct rebuilt. I'm thinking nothing is actually happening with these drives, no data is being transitioned.

I have never supported a san, what should I do at this point?

kelleg

4.5K Posts

0

September 29th, 2015 12:00

This maybe due to the failed LCC that you have. You could try powering off the array and powering back on -sometimes the disks can get stuck in the powering up state. But the bad LCC might be preventing this.

Also, check the version of Software running on the array - the latest version is 22.712 - this version fixes a lot of issues for disks and SAS interface. If you do have a failed LCC, you'll need to get that fixed before you can upgrade.

In the getall, there should be a section for "Faults" - you should be able to see if anything is faulted.

glen

View All

No Events found!

CLARiiON

Raid group/LUN stuck in transitioning state?