redphil

7 Posts

2121

July 29th, 2020 11:00

equallogic -PS4100-e

Have 3 san's

-ps4100e

-ps4100x

ps6100z

Moved all three san's from one site to another - moved internet circuit and switches also (same equipment,different physical location).

Two san's came up and online right away - Ps4100x and ps6100z.

Ps4100E - is powered on, disk lights light up but no connection(currently "offline"),

Tried to power down again, then bring back up - fail

Tried to power down, remove drives, power up, insert disks - fail

Tried to remove controller 0(top) - see if controller 1(bottom) would pickup activity - fail

Tried to remove controller 1(bottom) - see if controller 0(top) would pickup activity - fail

No activity lights on network ports on controllers - nothing on mgmt ports - I don't know if they were setup(mgmt port - inherited san's with no documentation)

I don't know if there is a way to force the san online or connect to it with it offline -

Any thoughts? (other than burning it down, had it)

Responses(14)

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 29th, 2020 14:00

Hello,

You will need to connect to the serial port of the active controller. This will allow you to see how far along the boot process went.

Until the array boots up completely the network won't be initialized. The serial port is the console for the array, like you have on a switch. 9600 baud, no parity, 8 data bits, 1 stop bit, NO FLOWCONTROL. No hardware or software flowcontrol should be used.

Generally speaking with SANs it's not a good idea to just power off and power on repeatedly.

Before you moved them, did you initiate a shutdown of the arrays first?

Once you get connected and powered up make sure you capture the output of the boot process. Please copy the TEXT of that here, not a screenshot of the boot up.

I suspect these arrays are no longer covered by contracts? if so, depending on where you are you can open a one time case for a fee and they will help you triage and potentially resolve the issue.

But make sure you can successfully connect to the array first.

Regards,

Don

R

redphil

7 Posts

0

July 30th, 2020 07:00

So I was able to get the console/serial cable connected (found it). Was able to login in to the primary(active) controller on Ps4100E. It says that the storage array is still initializing - limited commands - try later.

Connected to secondary controller just to confirm - said connected to secondary - limited commands -

Moved serial back to primary -

Saw message/error -

"logger daemon is losing messages because offline disks are generating more events than the daemon can handle"

Tried to run "show member" or "member show" didn't do anything guessing cause of limited commands -

Currently having it run "diag" to file -

Is there a way to get status message? Other than "still initializing"?

I ask cause it would be "initializing" for over 12hrs now. Just a few servers data that I would like to get from there without doing restore from backup. This SAN was one of the larger ones

(thanks for the guidance)

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 08:00

Hello,

Re: Status. No because there is a problem with the RAIDset. So all services/functions require the RAIDset to be online. Which includes the group and member configuration database. hence "show member" , etc won't work.

CLI>support exec "raidtool"

Please send that output

Regards,

Don

R

redphil

7 Posts

0

July 30th, 2020 09:00

CLI> support exec "raidtool"
You are running a support command, which is normally restricted to PS Series Tec
hnical Support personnel. Do not use a support command without instruction from
Technical Support.
Driver Status: Ok
RAID LUN 0 Faulted Beyond Recovery.
11 Drives (?,7,2,?,5,?,?,11,4,?,?)
RAID 5 (64KB sectPerSU)
Capacity 9,601,932,984,320 bytes
RAID LUN 1 Faulted Beyond Recovery.
12 Drives (?,10,?,?,?,?,?,?,?,?,?,?)
RAID 6 (64KB sectPerSU)
Capacity 9,601,932,984,320 bytes
Available Drives List: 8
CLI> RAID LUN 0 Faulted Beyond Recovery.

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 10:00

Hello,

Well, that's a very unusual output. Basically it's saying most of those drives are not online.

Please run CLI>support exec "diskview -j" and paste the output here.

I would then power down the array. Remove all the drives and inspect the connectors. Then re-insert them and make sure they are fully seated and power up the array again.

Re-run those two commands

Regards,

Don

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 12:00

Hello,

The problem with using images is they don't show up immediately. So I have no idea of what you saw.

However, when you say you "replaced bad" disks, did you just replace them with new drives all at once?

If you were running RAID6 you can do that with two failed drives but not three in the same RAIDset. You replace a bad drive, let the RAIDset rebuild then replace the next one, etc.. Wholesale replacement of drives will likely make the RAID faulted beyond recovery.

None of the 'show' commands or similar will work until the RAIDset is recovered and array fully booted up.

Regards,

Don

R

redphil

7 Posts

0

July 30th, 2020 12:00

I the SAN down using command.

Turned off the SAN, pulled power cords. Removed the drives (in a order), used compressed air shots on the drives, re-inserted the drives(in order). Powered on the SAN, let it run boot up, waited 15mins, connected back to serial and re-ran disk check - found 3 failed drives. Replace 2 out of 3 failed drives. Didn't have a third spare drive. Ran the check again - shows 11 drives in total online - 1 offline. Disk 7 has 27 bad blocks - only one show anything.

noticed during bootup

before changing drives -

after replacing drives -

Still showing as initializing, when checking "show" command

R

redphil

7 Posts

0

July 30th, 2020 13:00

I replace 1 drive waited about 15 mins - how long it took in the past to re-build drive - then did the other drive. Have not had more than 1 fail at a time, normally when one fails i swap it out within 24hrs. Then get a new spare.

This SAN was set to raid 5

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 13:00

Hello,

You should never just assume the RAID rebuild finished. Since a another drive that is failing can slow down the process.of rebuilding. Always check the status, especially with RAID5 since it is the least resilient RAID level.

Regards,

Don

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 13:00

Hello,

OK the images finally showed up. The array was faulted, so replacing the drives would not have resolved it. As only a DEGRADED RAID condition can be recovered by replacing the failed disk.

You need to put the disks back in there were there from the start. One of those drives is the key drive. It's the last one that failed. Until that comes back online, your RAID and array will be down.

Given the issues, I would suggestion you send the drives out to be recovered by a third party data recovery service. They will clone all the drives just in case. If they are successful the RAIDset will come back online and complete the start up process. Then you will have to run an integrity test on all your volumes. E.g chkdsk / fsck

Regards,

Don

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 30th, 2020 15:00

Hello,

The EQL array doesn't care about slot assignments. The metadata on each drive is all that's required.

You can move drives around without concern.

Don

QueBall

21 Posts

0

July 30th, 2020 15:00

I think your movers dropped it. If it was healthy prior to the move then it likely experienced a major shock or drop in transit.

You have to go back to your backup images at this point. I am not 100 percent certain if Equallogic array recognizes or cares if a drive is plugged into the wrong slot or not. Maybe they got mixed up at some point during the move or your efforts to fix this. you might try switching the bad drives around and if they show healthy at another slot then they got switched in transit. (mover dropped it and they fell out and then put them back in to wrong slots without telling anyone)

D

dwilliam62

1 Rookie

•

1.5K Posts

0

July 31st, 2020 08:00

Hello,

Glad to hear you are making progress. I would suggest RAID6 over RAID10. More space and more redundancy. You can still go offline if both mirror drives fail. I have seen that happen.

Thank you for the update. I look forward to knowing how what happens.

Regards,

Don

R

redphil

7 Posts

0

July 31st, 2020 08:00

I was the mover and I didnt drop it -

Sorry i should have been a little more clear - when i say i waited 15mins then put in the next drive - was checking to make sure the disk was online. using view disk command - i would see the status change from empty - initializing - rebuilding - ready / online. Sorry for not being clearer on that.

I was able to get little extra support from third party vendor - they did a lot of the same plus a little more. They were seeing few issues with the drives and saw a ghost LUN - in raid 6. They/we were able to remove ghost LUN - then tried a few other things and noticed a few drives (6) were starting to rack up bad sectors/faults - I had to order 4 more drives

-- I am/was concerned about the data but I have backups of it - just not all the servers/data backup was up to date - the critical stuff was - just minor changes and servers didn't get updated due lack of priority. Currently restoring the vm's/data to another SAN - I should have just enough room to bring them up - By that i mean, i would have about 1.5 TB left in storage tank. Until the new drives are here and i can rebuilt the san and storage pool into raid 6 or 10 (leaning toward 10)

Thanks for the help - I will update as I move forward - probably next week.

View All

No Events found!

EqualLogic

equallogic -PS4100-e