Start a Conversation

Unsolved

This post is more than 5 years old

73599

March 8th, 2017 04:00

CPU0704 CPU1/CPU2 Machine check error detected on Front LCD Panel

We have an R730 that has been powercycling on it's own with a recurring amber front LCD with a CPU0704 CPU1(and CPU2) Machine check error detected. Power cycle system" message.

This first occured after the initial Windows Server 2012r2 load, Dell BIOS/Drivers/Firmware updates, McAfee, and Windows updates. We initially saw this once a week and it has occured more and more frequently, currently several times a day.

In the Windows Systems Event log these are registering as kernel power failures and the Lifecycle Controller logs show multiple(dozens) instances of "An OEM diagnostic event occured." within a couple of seconds preceded by the same CPU1/CPU2 machine check error.

Upon contacting Dell support we were asked to provide a DSET or TSR several times (despite a search returning that DSET has been retired and and TSR has been replaced with SupportAssist). Due to our environment limitations, the only way we are able to connect externally is via proxy and iDRAC doesn't support proxy configurations. We have tried running a Dell Tech recommended Support Live Image, and only to have our system reboot before being able to complete diagnostics.

Requesting CPU and motherboard replacements via Dispatch after referring efforts to the initial Service Request were denied asking us to update BIOS and selected firmware (which we did initially) and again asking for a DSET.

We're currently in communications to let support know we've already done this and feel like we've exhausted our options at this point. A search in forums indicate that BIOS updates or CPU replacements have worked in most cases -- as the BIOS is current. Just wondering if anyone else has come across this issue and has any recommendations.

Moderator

 • 

8.8K Posts

March 8th, 2017 08:00

Kaizenphilo, 

Would you clarify the processors you have installed? Are they the Intel E5-26xx V4 or Intel E5-26xx V3  versions, if so would you verify in the BIOS if C states are also enabled?

Let me know and we can go from there. 

March 10th, 2017 04:00

Hi Chris,

Thanks for the response: They are the Intel E5-2643 v4 and we're using the default Performance per Watt (DAPC) System Profile setting -- C states are enabled.

Moderator

 • 

8.8K Posts

March 10th, 2017 05:00

Thank you. There is a BIOS update it sounds like is in the works to resolve this, but there is a temporary work around that we can try temporarily until the update comes out. 

Try the following;

1.        Change BIOS profile to Performance

2.        Select Custom profile, set C State to disable and C1E to enable.

Let me know if that, for the time being, resolves the issue for you.

1 Message

August 22nd, 2017 11:00

Chris,

Does this also apply to R430 with E5-2650 v3? What BIOS version fixes this?

1 Message

March 27th, 2018 11:00

KaizenPhilo,

We are going through this exact same issue, on a T620 with dual Xeon E5-2630's. We are getting CPU0704 CPU2 Machine check error. I've have updated BIOS and firmware with no change, as well as change the performance setting in BIOS. It is rebooting very frequently...about every 40 to 90 minutes. Did you ever get a resolution on this? I've contacted support and sent a support assist log. They suggested swapping the CPUs to see if the error followed the CPU. I have not done this yet. But may have to if no other solution is known. Any information you have would be appreciated. Thanks in advance!

4 Posts

June 4th, 2018 06:00

KaizenPhilo or Copper,

Do you have a solution now?

 At the moment we have 5 servers (PowerEdge T640) with the same problem. "CPU machine check error" randomly on CPU1 or 2 about once a week, at random time. The server restarts without Windows logging. BIOS / Fireware / drivers are all up-to-date. Is a Windows 2016 server with Hyper-V enabled. BIOS profile = Performance (C-state is disabled).

Dell support has replaced three system cards. The problem exist.

It looks like the error is sometime come up when a virtual machine is starting up in the morning. I can’t reproduce the problem manual.

Dell is searching for a solution, but it is slow. Someone any idea?

June 15th, 2018 08:00

I'm having/had the same issue on an R740, Server 2016 Standard with Hyper-V role. CPU Machine check error, OEM diagnostic events, and Intel Management Engine warnings.

2+ months into service call and we've replaced the motherboard twice, the PCI card (BOSS NVMe riser) twice, the cables once... Flea drains, cold boots, AC cycles, you name it, we've tried it.

We've cut down the CPU machine check errors but the OEM diagnostic and Intel M.E. events still come along together, and it looks like it's happening when the system hangs. I've seen it now several times where during a restart it will hang on Shutting down service: Hyper-V Virtual Machine Management.

Support just informed me of a new BIOS, 1.4.5, so I'm running my restart script on it and see what happens... the results just in, updating the BIOS re-generated the CPU machine check error (does this mean it didn't go away? Is it a throwback reference? Who knows) and then the system stopped on trying to shut down H-V VMM service.

4 Posts

June 18th, 2018 01:00

Hi Sierra,

We also update the BIOS to version 1.4.5 for five days now. Update also Windows 2016 to KB4103720. We have not seen the CPU machine check error till now. But it is too early to cheer.

Do you have the possibility to reproduce the problem manual? Reboot the machine is in our environment no problem. The CPU machine check error coming up only random on a running machine. 

I know the waiting time for 'Hyper-V Virtual Machine Management' service to stop. But it is an event I have seen on a lot of machines.

When I have news, I will post it.

4 Posts

July 3rd, 2018 01:00

All servers are up for three weeks since the BIOS update 1.4.5
I must slowly believe that the BIOS update solve the problems.

July 30th, 2018 08:00

Sorry for the delayed reply, it didn't notify me and there's been so much going on :D

 

So, we never did get a fix for this - I have no idea what's caused it, but this continued to be a problem. Eventually they caved, got us a new server dispatched, and we have sent the old server in for their Dev team to rip apart and find out.

If I ever find out what the cause was, I'll happily share it with the world - there was no need for this to have taken me well over 3 months to have a brand-new server that works. But, I'm just happy it does...

December 13th, 2018 02:00

Hi There, I am facing the same problem with R630 with 2 X E5-2650 v3. The diagnostic check gives no error but server reboots randomly, and hangs in boot screen indicates: UEFI0078: One or more Mechine check errors occured in the previous boot. Check system event log to identify the source of Mechine Check error and resolve the issues. The centos 7 operating system informed irqbalance killed by SIGSEGV abrt report. I have disabled irqbalance to prevent hardware interrupts to sperad other cpu's and updated all firmwares. No luck and this time neither abrt nor other log files indicate any error. Anyone managed to find root cause?

1 Rookie

 • 

117 Posts

December 17th, 2018 08:00

We have a stack of R620's with 2X E5-2667 v2 that almost all seem to have this problem.  What I don't understand is when running memtest or stress it runs for days without any problems.  Once we try loading up Windows 2016 again it starts to crash with the CPU0704 errors on CPU1 & CPU2.

Setting the Profile to Performance in the BIOS appeared to help on some of the servers but not all of them.  I thought maybe it was just a Windows 2016 problem and was going to try loading CentOS but it looks like someone already tried that and was still having problems.

9 Legend

 • 

16.3K Posts

December 17th, 2018 08:00

Update the iDRAC/LCC and the BIOS and other system firmware? That is the most common reason for those types of errors. Any third-party cards/hardware?

1 Rookie

 • 

117 Posts

December 17th, 2018 11:00

Everything is updated. Nope, no extra hardware/cards. We already pulled the whole server apart, cleaned everything, re-seated CPU, ram, all the risers and cables. Swapped CPU 1 & 2 around but still having the same problem. Extended stress tests and Dell Utilities don't find anything wrong.

7 Posts

December 18th, 2018 02:00

We have a M1000e full of M620 blades with dual Xeon E5-2650 v2 (Model 62 Stepping 4) and have this problem on all of them.

They are all running Windows 10 and Bios, CPLD, DRAC, etc is all latest versions. I know Win 10 is unsupported but we need it. The blades is just part of a render farm and Win Server is overkill for that.

It first started about a year ago. And it seems to happen in waves and linked to when Microsoft pushes out updates. Last time was in august, and now again in the middle of December. Restarting a machine without installing updates is mostly working just fine.

The only way to get a machine back up is to just keep on trying to restart it and hope that it will start, or reinstall Windows completely.

Setting system profile to Performance helped a little bit but it still happens.

This is so **bleep** frustrating. Please Dell, fix this.

No Events found!

Top