My in house EMC guy is away and I have what may be a very basic issue.
A long power outage that drained our UPS's causing a dirty shut down of the SAN, now powered back on I see the following error message under 'Storage' - 'Systems' in the GUI
"Storage processor is ready to restore"
Other than that our SAN's Status is "OK" however if I run 'nas_storage -c -a' in the CLI I see the following, I assume the root disk fail over is due to the SPA failover?
Discovering storage (may take several minutes)
Error 5017: storage health check failed
CKM00095100510 SPA is failed over
CKM00095100510 root_disk is failed over
CKM00095100510 d51 is faulted/removed
CKM00095100510 d52 is faulted/removed
I did have a failed disk '0' pre power outage which has since been replaced but doesn't to want to rebuild, which may be due to the fact SPA is failed over.
Any help would be greatly appreciated guys, I am moving servers off this SAN but need it to keep running in a healthy state for now.
like you said, typically that means that LUNs have been trespassed to non-default SP but the fact that you are having issue with drive rebuilds could indicate there is something more serious, i would open a ticket with support.
From those errors, I would assume that there is still a drive failure issue...
Can you provide the failed disk location? How was the drive replaced? When a drive fails they get copied over to a HS. Do you know if the Transition was completed prior to the unexpected shut down? Usually when a drive is replaced they get rebuild via hs or the raid group, you can find out the current state of the drive by issue
/nas/sbin/navicli -h SPA getdisk (bus)_(enclusure)_(disk) -state -type -rb
e.g. /nas/sbin/navicli -h SPA getdisk 1_1_2 -state -type -rb
(I would do the same to the HS location that was copied over)
then you can compare results to see the state of the current condition
As for the trespass lun,
I would double check agianst the luns that are affected and ensure there are no other hardware failure aside from that single drive then issue
/nas/sbin/navicli -h SPA getlun -trespass ( to determine if what luns are trespassed)
nas_storage -l ( to dertermine the primary backend storage from the nas prospective)
nas_storage -failback id=1 (id will be indicated from the command above)
nas_storage -c -a ( you will probably still get failure due to the drive)
I would still recommend to open up a case with EMC since the array was shutdown unexpected which you may encounter some dirty cache although EMC SPS on the celerra should give enough time for the cache to destage to disk.
Thanks for the replies,
The disk that had failed was removed and one of the HS had fully taken over while waiting for new disk to arrive.
$ /nas/sbin/navicli -h SPA getlun -trespass
LOGICAL UNIT NUMBER 0
Default Owner: SP A
Current owner: SP B
$ nas_storage -l
id acl name serial_number
1 0 CKM00095100510 CKM00095100510
If I do go ahead and restore this will the servers running from this storage temporarily loose their connection to the SAN?
The Support contract on this SAN expired last month, I didn't want to renew as I am moving everything off this SAN over the next several weeks. Just my luck!
Lun 0 is the control lun for the nas side, you should able to trespassed back to the default sp with little performance(pending on IO) or no impact at all.
It should be transparent to the host like dynamox mentioned above.
If I may recommend, (safest way)
dont forget to mark your status to answered if resolved and any post that is helpful to you. So others can reference to it if they are in the same situation Happy Friday!
LUN 0 will be trespassed back to default SP, should be transparent to the server/datamover
login to the Unispere(GUI),double click/rt-click properties on the connected storage and click on the restore SP.