PowerEdge Hardware General

Last reply by 04-07-2021 Unsolved
Start a Discussion
2 Bronze
2 Bronze
1244

Poweredge R-610 random crashes

Hi.
I bought a Poweredge R610 server by July, 2019. I installed Debian 10 Buster on it and everything was working great.
I had no issue at all. But then, for several months now, I've been having problems. The server now randomly crashes and I don't know why.
I didn't change anything in hardware, BIOS, iDRAC or Lifecycle. I haven't installed any new driver or something hardware-related on Debian either.

But here's the issue.
The server randomly crashes and then don't reboot because it waits for me to press F1 or F2.
boot.png

So if it crashes in the night or when I'm not available, the server is unusable until I press F1, or reboot it.

Here's the error message iDRAC reported to me.
error.png

 

It's about OEM diagnostics and "CPU 2 machine check detected", then the server crashes with this "A fatal IO error detected on a component at" message which doesn't let me know what's the actual issue.

 

Usually when the server crashes like this, the LCD panel says "e1715 fatal i/o error. review & clear sel", but I don't know why, when I am writing these lines, the current message is different.

Here it is lcd error2.png

What's also weird is that the server may run 2 days, 2 weeks or a complete month without any issue, and BAM. Coming from nowhere, it crashes with "A fatal IO error detected on a component at". So that's really really random.

Here are some informations about the machine given by iDRAC :

BIOS Version6.6.0
Firmware Version2.92 (Build 05)
Lifecycle Controller Firmware1.7.5.4

 

I would like to point out that I know a little bit about PC hardware and software, but I don't have a great knowledge in the world of servers, so at the moment I don't know what to do.

I already tried googling a lot of things and updated PERC controller, but I couldn't solve the issue.
May someone help me here, please?

Thank you

Replies (8)
1223

Laizrod,

 

Would you confirm what, if any, expansion cards are installed? If available, you may want to remove the expansion cards and see if the issue persists. For the time being, until we locate the issue, you can access the BIOS and on the bottom of the main page you can Disable the F1/F2 prompt.

 

Lastly, would you private message me your email contact details, so that I can give you steps to run and send a dset report on the server? 

 

Thank you.

 


Chris Hawk
Social Media and Communities Professional
Dell Technologies | Enterprise Support Services
#Iwork4Dell

Did I answer your query? Please click on ‘Accept as Solution’
‘Kudo’ the posts you like!
1208

Hi

Thanks for your answer.
I'm not sure about it. I may be wrong, but I think the only extension cards installed are the basic ones that come with the machine.
Unfortunately, I don't have physical access to the server yet. I have to wait until the weekend to get it and confirm or not.

Thanks for the suggestion, I will do that.

Alright, I'm sending you a private message.

 

Thank you

1169

Hi.
So I checked that up today, so here are the installed cards:
I have the IDRAC6 of course, and 2 riser cards installed on B and U:

IMG_0505.jpg

On the 'U' riser card, nothing has been installed. However on the 'B' riser card, there's my RAID controller.

So in that case, what do you suggest please?

2 Bronze
2 Bronze
838

This problem has been solved several months ago by updating some things and installing XCP-ng instead of Debian 10.
But now, after 5-6 months running without any issue, I am facing once again this problem. The server crashed and I got the same error message.

If someone can help me, thanks

803

Laizrod,

 

Since the BIOS and iDrac are up to date, would you verify if the other devices installed in the server are also up to date as well?

I would start there.

 

Let me know.

 

 


Chris Hawk
Social Media and Communities Professional
Dell Technologies | Enterprise Support Services
#Iwork4Dell

Did I answer your query? Please click on ‘Accept as Solution’
‘Kudo’ the posts you like!
800

Yes I will do that and let you know.
Is there any tool provided by Dell to check if each device is up to date?
That would save me some time.

Thanks

797

You can use the Server Update Utility found here - https://dell.to/2PB9W6X

Let me know how if goes

 


Chris Hawk
Social Media and Communities Professional
Dell Technologies | Enterprise Support Services
#Iwork4Dell

Did I answer your query? Please click on ‘Accept as Solution’
‘Kudo’ the posts you like!
3 Silver
710

Are you getting core dumps that you can analyze?

As per my experience with random crashes, the causes were usually hardware related in one way or another:  In no particular order:
1. Bad/failing power supply (i.e.: T3500, 2 of them thus far      )
2. Bad memory -- memtest86 can detect bad modules
3. Bad caps on motherboard -- a couple of older Optiplexes and several other, non-Dell machines.   A thorough visual inspection can usually find them and they can usually be replaced relatively easily with the right equipment, though that may not always make much sense especially if you can't solder
4. Cracked leadfree solder under a surface mounted chip -- an older SFF Optiplex and an older HP laptop were suffering from this.    Difficult to diagnose and even more difficult to repair. 

There are of course a number of other possible causes, these are the ones I had to deal with at one time or another.

Latest Solutions
Top Contributor