I bought a Poweredge R610 server by July, 2019. I installed Debian 10 Buster on it and everything was working great.
I had no issue at all. But then, for several months now, I've been having problems. The server now randomly crashes and I don't know why.
I didn't change anything in hardware, BIOS, iDRAC or Lifecycle. I haven't installed any new driver or something hardware-related on Debian either.
But here's the issue.
The server randomly crashes and then don't reboot because it waits for me to press F1 or F2.
So if it crashes in the night or when I'm not available, the server is unusable until I press F1, or reboot it.
Here's the error message iDRAC reported to me.
It's about OEM diagnostics and "CPU 2 machine check detected", then the server crashes with this "A fatal IO error detected on a component at" message which doesn't let me know what's the actual issue.
Usually when the server crashes like this, the LCD panel says "e1715 fatal i/o error. review & clear sel", but I don't know why, when I am writing these lines, the current message is different.
Here it is
What's also weird is that the server may run 2 days, 2 weeks or a complete month without any issue, and BAM. Coming from nowhere, it crashes with "A fatal IO error detected on a component at". So that's really really random.
Here are some informations about the machine given by iDRAC :
|Firmware Version||2.92 (Build 05)|
|Lifecycle Controller Firmware||126.96.36.199|
I would like to point out that I know a little bit about PC hardware and software, but I don't have a great knowledge in the world of servers, so at the moment I don't know what to do.
I already tried googling a lot of things and updated PERC controller, but I couldn't solve the issue.
May someone help me here, please?
Would you confirm what, if any, expansion cards are installed? If available, you may want to remove the expansion cards and see if the issue persists. For the time being, until we locate the issue, you can access the BIOS and on the bottom of the main page you can Disable the F1/F2 prompt.
Lastly, would you private message me your email contact details, so that I can give you steps to run and send a dset report on the server?
Thanks for your answer.
I'm not sure about it. I may be wrong, but I think the only extension cards installed are the basic ones that come with the machine.
Unfortunately, I don't have physical access to the server yet. I have to wait until the weekend to get it and confirm or not.
Thanks for the suggestion, I will do that.
Alright, I'm sending you a private message.
So I checked that up today, so here are the installed cards:
I have the IDRAC6 of course, and 2 riser cards installed on B and U:
On the 'U' riser card, nothing has been installed. However on the 'B' riser card, there's my RAID controller.
So in that case, what do you suggest please?
This problem has been solved several months ago by updating some things and installing XCP-ng instead of Debian 10.
But now, after 5-6 months running without any issue, I am facing once again this problem. The server crashed and I got the same error message.
If someone can help me, thanks
Since the BIOS and iDrac are up to date, would you verify if the other devices installed in the server are also up to date as well?
I would start there.
Let me know.
You can use the Server Update Utility found here - https://dell.to/2PB9W6X
Let me know how if goes
Are you getting core dumps that you can analyze?
As per my experience with random crashes, the causes were usually hardware related in one way or another: In no particular order:
1. Bad/failing power supply (i.e.: T3500, 2 of them thus far
2. Bad memory -- memtest86 can detect bad modules
3. Bad caps on motherboard -- a couple of older Optiplexes and several other, non-Dell machines. A thorough visual inspection can usually find them and they can usually be replaced relatively easily with the right equipment, though that may not always make much sense especially if you can't solder
4. Cracked leadfree solder under a surface mounted chip -- an older SFF Optiplex and an older HP laptop were suffering from this. Difficult to diagnose and even more difficult to repair.
There are of course a number of other possible causes, these are the ones I had to deal with at one time or another.