Unsolved

2 Posts

217

October 5th, 2021 10:00

PE R630 - multiple hardware failures

We are trying to troubleshoot a PE R630 that either fails to boot or, if it boots, it crashes in short order.

The system log shows a multitude of errors that started a few days ago including:

Fatal error bus 0 device 1 function 0
Multi-bit memory errors on DIMM_B1
Multi-bit memory errors on DIMM_A1
CPU 1 machine check error
CPU 1 has internal error
CPU 2 has internal error
Fatal error bus 3 device 0 function 0

It seems unlikely to me that all of these individual components would have problems all at once and that the common denominator here is the system board.

Am I on the right track here?  Any thoughts on how to approach this (out of warranty) server would be appreciated.

6 Operator

 • 

2.9K Posts

October 5th, 2021 15:00

Hello,

 

The system board may well be a reasonable suspicion. However, with memory controllers being on the processor, I'd be a little hesitant to replace the board, quite yet. If possible, I'd take the server down to a 1 processor configuration (CPU1 only) and DIMM A1 only. Then, cycle the server up and see what combination of errors you get. If you get a processor error, I would swap out the processor first. If you don't then go ahead and swap out one of the DIMMs. The idea here is to attempt to fault isolate, incase a failure on one of these individual components is creating the illusion of a cascaded failure, if that makes sense.

 

You might also check the iDRAC to see what device it is showing at 3:0:0 to see if it can be removed, as well.

2 Posts

October 5th, 2021 16:00

Thanks for the info.  We will give your suggestions a try!

No Events found!

Top