Unsolved

This post is more than 5 years old

35449

February 15th, 2013 14:00

Dell R415 "CPU 2 machine check detected."

I just got a new R415 from the Dell Outlet site, and periodically, it's rebooting, with the following messages - any ideas?

   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 19:02:10 CPU 2 machine check detected.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 An OEM diagnostic event has occurred.
   Fri Feb 15 2013 17:37:11 CPU 2 machine check detected.

February 15th, 2013 14:00

Here's the inventory.  All firmware are current according to the update utility.

Hardware Inventory

Component Attribute Value

iDRAC.Embedded.1 Model N/A

Firmware Version 1.92

RAID.Slot.1-1 Type PERC H700 Adapter

Firmware Version 12.10.4-0001

CPU.Socket.2 Model AMD Opteron(tm) Processor 4234

Manufacturer AMD

CPU.Socket.1 Model AMD Opteron(tm) Processor 4234

Manufacturer AMD

DIMM.Socket.B4 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.B3 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.B2 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.B1 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.A4 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.A3 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.A2 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

DIMM.Socket.A1 Type DDR3 DIMM

Size 8192 MB

Part Number M393B1K70DH0-YH9

Disk.Bay.0:Enclosure.Internal.0-0:RAID.Slot.1-1 Model ST500NM0011

Manufacturer ATA

Serial Number N/A

Size 499558383616 Bytes

NIC.Embedded.2-1 Name Broadcom NetXtreme II Gigabit Ethernet - 08:9E:01:70:F8:FC

Permanent iSCSI MAC Address 00:00:00:00:00:00

Permanent MAC Address 08:9E:01:70:F8:FC

Current MAC Address 08:9E:01:70:F8:FC

NIC.Embedded.1-1 Name Broadcom NetXtreme II Gigabit Ethernet - 08:9E:01:70:F8:FB

Permanent iSCSI MAC Address 00:00:00:00:00:00

Permanent MAC Address 08:9E:01:70:F8:FB

Current MAC Address 08:9E:01:70:F8:FB

Video.Embedded.1-1 Name MGA G200eW WPCM450

Manufacturer Matrox Graphics, Inc.

Disk.vFlashCard.1 Name No SD Card

Capacity N/A

PSU.Slot.1 Model PWR SPLY,500W,RDNT,DELTA

Manufacturer Dell

Part Number 0H318JA03

Serial Number CN1797224G0WS7

Firmware Version 03.02.31

PSU.Slot.2 Model PWR SPLY,500W,RDNT,DELTA

Manufacturer Dell

Part Number 0H318JA03

Serial Number CN1797224G0WRN

Firmware Version 03.02.31

Back to Top

Firmware Inventory

Component Firmware Version

Dell Server BIOS 11G 2.0.2

Dell Lifecycle Controller 1.5.5.27

OS Driver Pack 7.2.0.5

Broadcom NetXtreme II Gigabit Ethernet 1 7.4.8

Broadcom NetXtreme II Gigabit Ethernet 2 7.4.8

Dell 32 Bit Diagnostics 5158A3

February 15th, 2013 14:00

I have just noticed that the drive listed in the inventory is incorrect. It's the drive that the unit shipped with, however, it's been replaced by four 15k SAS disks.  Also, here's the iDRAC6 Enterprise info:

Attribute Value

Device Type iDRAC6

Hardware Version 0.01

Firmware Version 1.92 (Build 05)

Firmware Updated Wed Feb 6 00:03:04 2013

RAC Time Fri Feb 15 22:37:43 2013

Number of Possible Active Sessions 5

Number of Current Sessions 0

LAN Enabled Yes

IPMI Version 2.0

9 Legend

 • 

16.3K Posts

February 15th, 2013 14:00

Do you have any add-in cards installed?

What version is your BIOS at?

February 17th, 2013 10:00

Anyone?  The hardware shipped to us from the Dell outlet about a week ago.  It passes the built-in diagnostics without issue.  The only "add-on" hardware is iDRAC6 Enterprise and a PERC-H700, and all firmware has been updated to the latest available version.

6 Posts

February 28th, 2013 05:00

I have four Dell 11G servers that exhibit this behaviour. Two R415 and two R515 (they share similar chip-sets, so they're similar servers).

I've so far been unable to avoid these reboots - they can run for two days, three weeks, three months, or six months continuously, but eventually they reboot with the messages similar to the ones you posted (the messages can vary - including false PSU failure).

The servers were bought directly from Dell a couple of years ago. Dell support have been unable to offer a solution, but they've also effectively been unwilling to escalate the issue. I've been in contact with someone who has 30+ of these machines, having similar problems, and Dell has not offered a solution (at my last contact). He was running Windows, I'm running Linux, suggesting that the problem is below the level of operating-system or drivers.

My latest area of interest is in the dmesg output: -

mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.

If the Linux kernel has genuinely had to correct something here (I trust the kernel, so I assume it's true), then it suggests that the hardware is being misconfigured. If you can afford to remove a CPU temporarily, then it might be worth a try.

Please feel free to contact me directly if you'd like to experiment and share ideas - if you're running Linux, then I'd be interested in comparing our 'dmesg' output. I've posted details on what I've tried so far here and on a blog (see http://www.susa.net/wordpress) - you might glean some info for a head-start.

February 28th, 2013 13:00

Ended up swapping out CPU2 and it fixed the problem...  Guess it was just a bad CPU!

6 Posts

February 28th, 2013 18:00

How long were the intervals between reboots on your server?

February 28th, 2013 18:00

It'd go a few days between reboots sometimes, and sometimes I'd get 3 in a day.

1 Rookie

 • 

1 Message

August 8th, 2024 14:01

Watching the power monitoring, every time the server reboots it  have a high watts pick before it happends

No Events found!

Top