PowerEdge R320 Failed Disk

Question

We've got a PowerEdge R320 which had a amber disk drive and was displaying the following amber alert: 'PDR1001 Fault Detected on Drive 0 in disk drive bay 1. Check disk' We've replaced the drive with a like for like and it was replaced while the server was switched on. The drive is now displaying a green light but the message is still on the server display. Could someone point me in the right direction on how I can get rid of this message? Do I have to do something else after the drive was replaced? Can't figure out what to do next.

Daniel My · Answer

Hello

If you hot swap to the same slot a rebuild will typically initiate automatically. If a rebuild does not initiate automatically you can start a rebuild by setting the replacement disk as a hot spare on most of our controllers. Disk replacement procedures can vary between controllers. More information should be in the manual.

http://www.dell.com/storagecontrollermanuals/

Status messages will usually clear in a short amount of time. If the status persists after the issue is resolved then you may need to clear the hardware logs. Clearing the hardware logs will get rid of inactive errors that are still being reported on the LCD. You may want to save the log first. I would wait a few hours after the rebuild completes to see if the LCD error clears.

You can clear/view hardware logs and check the status of the storage controller from OpenManage Server Administrator. You can download OMSA from the system support page. The version you will want to install on the server is the managed node or (node). Documentation can be found on the OMSA support page.

http://www.dell.com/support/

http://www.dell.com/openmanagemanuals/

Thanks

CallMeMonks · Answer

I'm guessing it has rebuilt because when I go to the disk configuration the replaced disk has an online status rather than a status where I can assign it as a hot spare.

It's been a couple of days since I've replaced the drive so I'll try clearing the logs to see what happens. I don't remember seeing the option to clear logs on OMSA but I'll have another look.

CallMeMonks · Answer

The error message is still displayed on the servers display after clearing the log. Any ideas?

I cleared the logs from OpenManage Server Administrator (System > Logs > Clear Log), is this the correct method?

Daniel My · Answer

Yes, that is the correct log. If the error persists then it is an active error. If you cannot find information about the error in OMSA then you will need to review the controller log. You can pull a controller log using using OMSA if the controller has logging capabilities. You may also be able to find more information in the alert log, it is on the same page as the hardware log. You can try saving/clearing that log as well.

If the error persists after all logs are cleared then it is an active error being reported by something.

CallMeMonks · Answer

Tried everything and still can't get rid of the error. I have checked logs and nothing is being reported. Cleared all the logs. Updated the firmware and drivers and still getting an amber alert on the servers display!

Could the drive I've replaced it with also be faulty? Although it's not reporting any errors in the RAID configuration and states that it is ONLINE. This is driving me mad.

Daniel My · Answer

Maybe, what RAID controller are you using?

CallMeMonks · Answer

It's a PERC H710 Mini controller.

CallMeMonks · Answer

I couldn't copy and paste the log file to a text sharing site as I kept getting error message about the size being too big but I have uploaded the text file to the URL below:

https://ufile.io/r8qmu

Hope that helps...

If not I can copy and paste it on here if that works.

Daniel My · Answer

You should be able to go to the storage section in OMSA and export a controller log. It should be in the controller task menu. That log will provide the most details about the storage devices.

If you want to share the log for the community to review then please use a text sharing site and provide a URL.

Daniel My · Answer

That site does not load for me. No, please do not paste logs directly into the forum. If the file is too large then I would split it up. It is just a text file, so it should not be difficult to split it into multiple uploads.

CallMeMonks · Answer

OK, finally managed to find a site that would allow me to paste the entire text file. The link is below: http://oneclickpaste.com/1432/ One long log file. Hope this works.

Daniel My · Answer

It looks like it cut off the log, most likely when it hit the file limit. The controller logs the last 10k lines, there are less than 2k lines in the log. The part of the log that is showing is time stamped in 2017. I checked pastebin and their file limit is 512KB. I pulled a controller log and it is just under 1MB. The bottom of the log is the newest, so I would just copy the bottom 33% of the log and upload that.

Daniel My · Answer

I don't see any issues since the rebuild. The LCD error appears to inaccurate. If the error persists after clearing all of the logs then I would perform these actions until it clears: restart iDRAC, restart server, disconnect power to clear NVRAM.

If the issue persists after doing all of that then make sure all hardware is up to date.

CallMeMonks · Answer

OK... third attempt. Just gone through the log file. I've picked out three events which might give the best results.

https://pastebin.com/m56q3iHY - Events from November when the drive first reported a failure.

https://pastebin.com/ZE3VuFei - Events from January when the drive was replaced.

https://pastebin.com/yuFQ77PK - The latest log entry from today.

CallMeMonks · Answer

Thank you. I'll try and update and restart the server and see what that does. I'll give an update when I've done that.

PowerEdge Hardware General

Was this post helpful?