Unsolved
2 Posts
0
217
October 5th, 2021 10:00
PE R630 - multiple hardware failures
We are trying to troubleshoot a PE R630 that either fails to boot or, if it boots, it crashes in short order.
The system log shows a multitude of errors that started a few days ago including:
Fatal error bus 0 device 1 function 0
Multi-bit memory errors on DIMM_B1
Multi-bit memory errors on DIMM_A1
CPU 1 machine check error
CPU 1 has internal error
CPU 2 has internal error
Fatal error bus 3 device 0 function 0
It seems unlikely to me that all of these individual components would have problems all at once and that the common denominator here is the system board.
Am I on the right track here? Any thoughts on how to approach this (out of warranty) server would be appreciated.
No Events found!


Dell-DylanJ
6 Operator
•
2.9K Posts
0
October 5th, 2021 15:00
Hello,
The system board may well be a reasonable suspicion. However, with memory controllers being on the processor, I'd be a little hesitant to replace the board, quite yet. If possible, I'd take the server down to a 1 processor configuration (CPU1 only) and DIMM A1 only. Then, cycle the server up and see what combination of errors you get. If you get a processor error, I would swap out the processor first. If you don't then go ahead and swap out one of the DIMMs. The idea here is to attempt to fault isolate, incase a failure on one of these individual components is creating the illusion of a cascaded failure, if that makes sense.
You might also check the iDRAC to see what device it is showing at 3:0:0 to see if it can be removed, as well.
DukeB
2 Posts
0
October 5th, 2021 16:00
Thanks for the info. We will give your suggestions a try!