1 Rookie

 • 

11 Posts

5551

October 28th, 2021 01:00

Poweredge 730xD, CPU {x} machine check error detected

Hello!

System hardware:
Xeon E5-2673 v3 (x2), 128GB RAM

I have a Dell Poweredge 730xD (2cpu). In the past 2 months, we had 3 "crashes" after ~2.5 years error-free operation. From iDrac we got these errors.

Warning The Intel Management Engine has reported an internal system error. PWR2262
Error CPU 1 machine check error detected. CPU0704
Error One or more Machine Check errors occurred in the previous boot. UEFI0078


Before the crash, the logs show the same order. The whole "pre-crash" event occurs in  60sec

  1. The Intel Management Engine has reported an internal system error. - PWR2262 (warning)
  2. The Intel Management Engine has reported an internal system error. - PWR2262 (warning)
  3. System CPU resetting -SYS1003
  4. Requested system hardreset RAC0703
  5. System CPU resetting - SYS1003
  6. A problem was detected related to the previous server boot. - UEFI0078 (error)
  7. CPU 1 machine check error detected. CPU0704 (error)
  8. An OEM diagnostic event occurred CPU9000 (about 10 times in 1 sec)
  9. CPU 1 machine check error detected. CPU0704 (error)
  10. An OEM diagnostic event occurred CPU9000 (about 10 times in 1 sec)
  11. C: boot completed OSE1002 

I found a lot of forum posts about this CPU family on both Intel and Dell forums. It seems to me it's a general problem with these CPUs. 

iDrac does not show any hardware-related problem, every component has a green symbol. 
Edit: Windows only created one BSDO file it shows a ntsokrnl.exe crashed. (So it can be anything, ntskrnl handles every hardware and near hardware things)

Things I did until this post:

  • Firmware, driver update. (exception: Perc controller)
  • Windows server update (server 2016)

Now I will shut down the server and re-paste the CPU and swapping them.

What more can we do? Is there any bug report/know issue about this CPU I did not find? 

Thanks for helping,
F

1 Rookie

 • 

11 Posts

October 30th, 2021 11:00

Hello,

Update for the post. We swapped the CPU and there was thermal paste on the wrong side of the CPU2...
 After the swap something just broked, and we got these errors:

 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_A{x}.
PST0091
A problem was detected in Memory Reference Code (MRC).
UEFI0058
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.

One of the channels (0,1,2,3) randomly did not work after restarting the server.

We tried: swapping memory modules, only using one CPU, BIOS reset. 

So it's a replacement. 

F

Moderator

 • 

4K Posts

October 28th, 2021 08:00

Hello,

this kind of error is hard to diagnose, expecially if they appears randomly.

The best thing you can do as it seems an hardware issue is contact support and ask for motherboard /CPU replacement, so we are sure to fix it.

I don't see other part that can be responsable. Please try also to upgrade PERC firmware anyway.

Thanks
Marco

 

130 Posts

October 28th, 2021 09:00

I agree with swapping CPU1 with CPU2 to see if the problem then transitions to CPU2. Its also possible that the act of swapping positions will renew the processor to socket contact points. 

#Iwork4Dell

No Events found!

Top