Unsolved
This post is more than 5 years old
9 Posts
0
35449
February 15th, 2013 14:00
Dell R415 "CPU 2 machine check detected."
I just got a new R415 from the Dell Outlet site, and periodically, it's rebooting, with the following messages - any ideas?
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 19:02:10 | CPU 2 machine check detected. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | An OEM diagnostic event has occurred. | |||
| Fri Feb 15 2013 17:37:11 | CPU 2 machine check detected. |
No Events found!



DominoTree
9 Posts
0
February 15th, 2013 14:00
Here's the inventory. All firmware are current according to the update utility.
Hardware Inventory
Component Attribute Value
iDRAC.Embedded.1 Model N/A
Firmware Version 1.92
RAID.Slot.1-1 Type PERC H700 Adapter
Firmware Version 12.10.4-0001
CPU.Socket.2 Model AMD Opteron(tm) Processor 4234
Manufacturer AMD
CPU.Socket.1 Model AMD Opteron(tm) Processor 4234
Manufacturer AMD
DIMM.Socket.B4 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.B3 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.B2 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.B1 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.A4 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.A3 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.A2 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
DIMM.Socket.A1 Type DDR3 DIMM
Size 8192 MB
Part Number M393B1K70DH0-YH9
Disk.Bay.0:Enclosure.Internal.0-0:RAID.Slot.1-1 Model ST500NM0011
Manufacturer ATA
Serial Number N/A
Size 499558383616 Bytes
NIC.Embedded.2-1 Name Broadcom NetXtreme II Gigabit Ethernet - 08:9E:01:70:F8:FC
Permanent iSCSI MAC Address 00:00:00:00:00:00
Permanent MAC Address 08:9E:01:70:F8:FC
Current MAC Address 08:9E:01:70:F8:FC
NIC.Embedded.1-1 Name Broadcom NetXtreme II Gigabit Ethernet - 08:9E:01:70:F8:FB
Permanent iSCSI MAC Address 00:00:00:00:00:00
Permanent MAC Address 08:9E:01:70:F8:FB
Current MAC Address 08:9E:01:70:F8:FB
Video.Embedded.1-1 Name MGA G200eW WPCM450
Manufacturer Matrox Graphics, Inc.
Disk.vFlashCard.1 Name No SD Card
Capacity N/A
PSU.Slot.1 Model PWR SPLY,500W,RDNT,DELTA
Manufacturer Dell
Part Number 0H318JA03
Serial Number CN1797224G0WS7
Firmware Version 03.02.31
PSU.Slot.2 Model PWR SPLY,500W,RDNT,DELTA
Manufacturer Dell
Part Number 0H318JA03
Serial Number CN1797224G0WRN
Firmware Version 03.02.31
Back to Top
Firmware Inventory
Component Firmware Version
Dell Server BIOS 11G 2.0.2
Dell Lifecycle Controller 1.5.5.27
OS Driver Pack 7.2.0.5
Broadcom NetXtreme II Gigabit Ethernet 1 7.4.8
Broadcom NetXtreme II Gigabit Ethernet 2 7.4.8
Dell 32 Bit Diagnostics 5158A3
DominoTree
9 Posts
0
February 15th, 2013 14:00
I have just noticed that the drive listed in the inventory is incorrect. It's the drive that the unit shipped with, however, it's been replaced by four 15k SAS disks. Also, here's the iDRAC6 Enterprise info:
Attribute Value
Device Type iDRAC6
Hardware Version 0.01
Firmware Version 1.92 (Build 05)
Firmware Updated Wed Feb 6 00:03:04 2013
RAC Time Fri Feb 15 22:37:43 2013
Number of Possible Active Sessions 5
Number of Current Sessions 0
LAN Enabled Yes
IPMI Version 2.0
theflash1932
9 Legend
•
16.3K Posts
0
February 15th, 2013 14:00
Do you have any add-in cards installed?
What version is your BIOS at?
DominoTree
9 Posts
0
February 17th, 2013 10:00
Anyone? The hardware shipped to us from the Dell outlet about a week ago. It passes the built-in diagnostics without issue. The only "add-on" hardware is iDRAC6 Enterprise and a PERC-H700, and all firmware has been updated to the latest available version.
kevsan
6 Posts
0
February 28th, 2013 05:00
I have four Dell 11G servers that exhibit this behaviour. Two R415 and two R515 (they share similar chip-sets, so they're similar servers).
I've so far been unable to avoid these reboots - they can run for two days, three weeks, three months, or six months continuously, but eventually they reboot with the messages similar to the ones you posted (the messages can vary - including false PSU failure).
The servers were bought directly from Dell a couple of years ago. Dell support have been unable to offer a solution, but they've also effectively been unwilling to escalate the issue. I've been in contact with someone who has 30+ of these machines, having similar problems, and Dell has not offered a solution (at my last contact). He was running Windows, I'm running Linux, suggesting that the problem is below the level of operating-system or drivers.
My latest area of interest is in the dmesg output: -
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
If the Linux kernel has genuinely had to correct something here (I trust the kernel, so I assume it's true), then it suggests that the hardware is being misconfigured. If you can afford to remove a CPU temporarily, then it might be worth a try.
Please feel free to contact me directly if you'd like to experiment and share ideas - if you're running Linux, then I'd be interested in comparing our 'dmesg' output. I've posted details on what I've tried so far here and on a blog (see http://www.susa.net/wordpress) - you might glean some info for a head-start.
DominoTree
9 Posts
0
February 28th, 2013 13:00
Ended up swapping out CPU2 and it fixed the problem... Guess it was just a bad CPU!
kevsan
6 Posts
0
February 28th, 2013 18:00
How long were the intervals between reboots on your server?
DominoTree
9 Posts
0
February 28th, 2013 18:00
It'd go a few days between reboots sometimes, and sometimes I'd get 3 in a day.
SnakeSkull
1 Rookie
•
1 Message
0
August 8th, 2024 14:01
Watching the power monitoring, every time the server reboots it have a high watts pick before it happends