PowerEdge OS Forum

Last reply by 10-15-2019 Unsolved
Start a Discussion
2 Bronze
2 Bronze
34734

Random Reboot R740

Hello,

We currently have a Windows server 2016 Datacenter server failover cluster with two PowerEdge R740 nodes.


The hardware configuration of each node is as follows:
2x Intel (R) Xeon (R) Silver 4116 CPU @ 2.10GHz Model 85 Stepping 4
RAM 196608 MB
Nvidia Tesla M60 Video Card
SAS connection with a PowerVault® 3420 SAN

Video cards are used in Discrete Device Assignment by virtual machines

We encounter a problem of brutal random reboot of nodes without error message in logs other than an event id 41 Kernel-Power "The system has rebooted without cleanly shutting down first".

EventData 
BugcheckCode 0 
BugcheckParameter1 0x0 
BugcheckParameter2 0x0 
BugcheckParameter3 0x0 
BugcheckParameter4 0x0 
SleepInProgress 0 
PowerButtonTimestamp 0 
BootAppStatus 0 
Checkpoint 0 
ConnectedStandbyInProgress false 
SystemSleepTransitionsToOn 0 
CsEntryScenarioInstanceId 0

The reboot of the nodes is not simultaneous and occurs in a totally random way.

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

Best Regards

Replies (29)
21514

Same thing is happening to me with one of my three new R740s. All 3 machines are nodes of a 2012 R2 failover cluster, running about 15 VMs.

Node 1 and 2 have been perfectly stable for a week. Node 3 has randomly rebooted twice in the last 7 days, both at different times of day (1am and 9pm). No Windows dump file, nothing in event viewer other than Event-Power event ID 41 and then events relating to restart. KB4088875 not installed.

It's not our UPS as all 3 R740s run off the same UPS and this issue is only affecting one of them.

OpenManage Server Admin hardware / ESM log just shows: 'OEM software event' and 'C: boot completed'.

I don't have the iDRAC configured. I see others reporting CPU related errors via their iDRAC logs. Before I try changing my BIOS System Profile to 'Performance' and disabling C1E/C states, I'd like to know if I am receiving these CPU errors as well. 

Does the iDRAC log show more information than the OMSA ESM / hardware log?

UPDATE:

I enabled the iDRAC and am receiving the same CPU errors as others.

2018-04-04 01:11:38 SYS1001 System is turning off.
2018-04-04 01:11:38 SYS1003 System CPU Resetting.
2018-04-04 01:11:21 RAC0703 Requested system hardreset.
2018-04-04 01:11:20 CPU0000 Internal error has occurred check for additional logs.

UPDATE 2:

I changed our system profile to 'Performance' (which disables C1E/C states of the CPUs), as others have recommended earlier in this thread. 

I think this fix has done the trick. 6 days without any reboots. Fingers crossed it stays this way.

21415

Opgailey,
hello friend! we have the same trouble. please tell me, node (after turn on max perf and off c1e) still works yet without any reboots? if yes then how days already?
21383

The Performance mode shows different bahaviour on older BIOS versions so be sure you're on the latest version.

For example:

v1.1.7 only disables C1E state

v1.3.7 disables both C and C1E states

Running stable for about 5 weeks now :BigSmile:

21206


@tabletrtdwrote:
Opgailey,
hello friend! we have the same trouble. please tell me, node (after turn on max perf and off c1e) still works yet without any reboots? if yes then how days already?

 

 

I can confirm that since I made this change, our 3 x R740 servers (acting as failover cluster nodes) have been stable. No more random reboots. 

Stable for almost a month now. :)

19502

Just another update in case anyone comes across this thread and wants to know.

I can confirm that since I made this change, our 3 x R740 servers have remained stable. No more random reboots. 

Perfectly stable for 6 months now.

16795

I am also experiencing this with two R740s in a failover Cluster with 10 VMs, both servers are rebooting randomly and this has been affecting the cluster. I have changed the profile to "Performance" as recommended and have my fingers crossed that it will work.
16785

update - no random reboots since ‎03-23-2018 after changing to performance

13326

hi i am having same issue ,Dell replaced motherboard once but problem remains same ,any one got a solution ?
12476

We have an R940 that is doing the same thing, it's had the system board changed already but it's still failing.

I suspect a CPU failure.

11443

Can confirm:  Had a T640 with this issue, started rebooting daily, multiple times a day, starting in late Sept 2019.  I  had upgraded all firmware to the latest and greatest with DSU, no change.

In BIOS I set the profile to Performance, determined that C / C1E states were disabled, rebooted and have not had it go down in 11 days.

Latest Solutions
Top Contributor