Start a Conversation

Unsolved

L

5 Posts

1694

May 26th, 2020 03:00

Poweredge R-610 random crashes

Hi.
I bought a Poweredge R610 server by July, 2019. I installed Debian 10 Buster on it and everything was working great.
I had no issue at all. But then, for several months now, I've been having problems. The server now randomly crashes and I don't know why.
I didn't change anything in hardware, BIOS, iDRAC or Lifecycle. I haven't installed any new driver or something hardware-related on Debian either.


But here's the issue.
The server randomly crashes and then don't reboot because it waits for me to press F1 or F2.
boot.png

So if it crashes in the night or when I'm not available, the server is unusable until I press F1, or reboot it.

Here's the error message iDRAC reported to me.
error.png

 

It's about OEM diagnostics and "CPU 2 machine check detected", then the server crashes with this "A fatal IO error detected on a component at" message which doesn't let me know what's the actual issue.


 

Usually when the server crashes like this, the LCD panel says "e1715 fatal i/o error. review & clear sel", but I don't know why, when I am writing these lines, the current message is different.

Here it is lcd error2.png

What's also weird is that the server may run 2 days, 2 weeks or a complete month without any issue, and BAM. Coming from nowhere, it crashes with "A fatal IO error detected on a component at". So that's really really random.

Here are some informations about the machine given by iDRAC :

BIOS Version 6.6.0
Firmware Version 2.92 (Build 05)
Lifecycle Controller Firmware 1.7.5.4

 

I would like to point out that I know a little bit about PC hardware and software, but I don't have a great knowledge in the world of servers, so at the moment I don't know what to do.

I already tried googling a lot of things and updated PERC controller, but I couldn't solve the issue.
May someone help me here, please?



Thank you

Moderator

 • 

8.4K Posts

May 26th, 2020 12:00

Laizrod,

 

Would you confirm what, if any, expansion cards are installed? If available, you may want to remove the expansion cards and see if the issue persists. For the time being, until we locate the issue, you can access the BIOS and on the bottom of the main page you can Disable the F1/F2 prompt.

 

Lastly, would you private message me your email contact details, so that I can give you steps to run and send a dset report on the server? 

 

Thank you.

 

5 Posts

May 27th, 2020 04:00

Hi

Thanks for your answer.
I'm not sure about it. I may be wrong, but I think the only extension cards installed are the basic ones that come with the machine.
Unfortunately, I don't have physical access to the server yet. I have to wait until the weekend to get it and confirm or not.

Thanks for the suggestion, I will do that.

Alright, I'm sending you a private message.

 

Thank you

5 Posts

June 1st, 2020 03:00

Hi.
So I checked that up today, so here are the installed cards:
I have the IDRAC6 of course, and 2 riser cards installed on B and U:

IMG_0505.jpg

On the 'U' riser card, nothing has been installed. However on the 'B' riser card, there's my RAID controller.

So in that case, what do you suggest please?

5 Posts

April 5th, 2021 05:00

This problem has been solved several months ago by updating some things and installing XCP-ng instead of Debian 10.
But now, after 5-6 months running without any issue, I am facing once again this problem. The server crashed and I got the same error message.

If someone can help me, thanks

Moderator

 • 

8.4K Posts

April 5th, 2021 07:00

Laizrod,

 

Since the BIOS and iDrac are up to date, would you verify if the other devices installed in the server are also up to date as well?

I would start there.

 

Let me know.

 

 

Moderator

 • 

8.4K Posts

April 5th, 2021 08:00

You can use the Server Update Utility found here - https://dell.to/2PB9W6X

Let me know how if goes

 

5 Posts

April 5th, 2021 08:00

Yes I will do that and let you know.
Is there any tool provided by Dell to check if each device is up to date?
That would save me some time.

Thanks

42 Posts

April 7th, 2021 11:00

Are you getting core dumps that you can analyze?

As per my experience with random crashes, the causes were usually hardware related in one way or another:  In no particular order:
1. Bad/failing power supply (i.e.: T3500, 2 of them thus far      )
2. Bad memory -- memtest86 can detect bad modules
3. Bad caps on motherboard -- a couple of older Optiplexes and several other, non-Dell machines.   A thorough visual inspection can usually find them and they can usually be replaced relatively easily with the right equipment, though that may not always make much sense especially if you can't solder
4. Cracked leadfree solder under a surface mounted chip -- an older SFF Optiplex and an older HP laptop were suffering from this.    Difficult to diagnose and even more difficult to repair. 

There are of course a number of other possible causes, these are the ones I had to deal with at one time or another.

No Events found!

Top