Same thing is happening to me with one of my three new R740s. All 3 machines are nodes of a 2012 R2 failover cluster, running about 15 VMs.
Node 1 and 2 have been perfectly stable for a week. Node 3 has randomly rebooted twice in the last 7 days, both at different times of day (1am and 9pm). No Windows dump file, nothing in event viewer other than Event-Power event ID 41 and then events relating to restart. KB4088875 not installed.
It's not our UPS as all 3 R740s run off the same UPS and this issue is only affecting one of them.
OpenManage Server Admin hardware / ESM log just shows: 'OEM software event' and 'C: boot completed'.
I don't have the iDRAC configured. I see others reporting CPU related errors via their iDRAC logs. Before I try changing my BIOS System Profile to 'Performance' and disabling C1E/C states, I'd like to know if I am receiving these CPU errors as well.
Does the iDRAC log show more information than the OMSA ESM / hardware log?
I enabled the iDRAC and am receiving the same CPU errors as others.
2018-04-04 01:11:38 SYS1001 System is turning off.
2018-04-04 01:11:38 SYS1003 System CPU Resetting.
2018-04-04 01:11:21 RAC0703 Requested system hardreset.
2018-04-04 01:11:20 CPU0000 Internal error has occurred check for additional logs.
I changed our system profile to 'Performance' (which disables C1E/C states of the CPUs), as others have recommended earlier in this thread.
I think this fix has done the trick. 6 days without any reboots. Fingers crossed it stays this way.
The Performance mode shows different bahaviour on older BIOS versions so be sure you're on the latest version.
v1.1.7 only disables C1E state
v1.3.7 disables both C and C1E states
Running stable for about 5 weeks now
@tabletrtdwrote:Opgailey,hello friend! we have the same trouble. please tell me, node (after turn on max perf and off c1e) still works yet without any reboots? if yes then how days already?
I can confirm that since I made this change, our 3 x R740 servers (acting as failover cluster nodes) have been stable. No more random reboots.
Stable for almost a month now.
Just another update in case anyone comes across this thread and wants to know.
I can confirm that since I made this change, our 3 x R740 servers have remained stable. No more random reboots.
Perfectly stable for 6 months now.