Start a Conversation

Unsolved

This post is more than 5 years old

41740

November 7th, 2017 08:00

Random Reboot R740

Hello,

We currently have a Windows server 2016 Datacenter server failover cluster with two PowerEdge R740 nodes.


The hardware configuration of each node is as follows:
2x Intel (R) Xeon (R) Silver 4116 CPU @ 2.10GHz Model 85 Stepping 4
RAM 196608 MB
Nvidia Tesla M60 Video Card
SAS connection with a PowerVault® 3420 SAN

Video cards are used in Discrete Device Assignment by virtual machines

We encounter a problem of brutal random reboot of nodes without error message in logs other than an event id 41 Kernel-Power "The system has rebooted without cleanly shutting down first".

EventData 
BugcheckCode 0 
BugcheckParameter1 0x0 
BugcheckParameter2 0x0 
BugcheckParameter3 0x0 
BugcheckParameter4 0x0 
SleepInProgress 0 
PowerButtonTimestamp 0 
BootAppStatus 0 
Checkpoint 0 
ConnectedStandbyInProgress false 
SystemSleepTransitionsToOn 0 
CsEntryScenarioInstanceId 0

The reboot of the nodes is not simultaneous and occurs in a totally random way.

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

Best Regards

March 22nd, 2018 05:00

We solved the problem - it was down to a MS Security update (KB4088875) - rolled it back an all is fine.

Acknowledged by MS as a problem.

1 Message

March 22nd, 2018 11:00

Same issue here, new R740 with Windows Server 2016 std w HyperV Role, all firmware and drivers updated and keeps crashing... The system doesnt have the update KB4088875, only see three updates: KB4088787, KB4049065 and KB3192137...

 

2018-03-22 13:27:25 SYS1003 System CPU Resetting.
2018-03-22 13:27:18 PWR2271 The Intel Management Engine has encountered a Exception Event.
2018-03-22 13:27:18 SYS1003 System CPU Resetting.
2018-03-22 13:27:18 SYS1000 System is turning on.
2018-03-22 13:27:10 SYS1001 System is turning off.
2018-03-22 13:27:10 SYS1003 System CPU Resetting.
2018-03-22 13:26:55 RAC0703 Requested system hardreset.
2018-03-22 13:26:54 CPU0000 Internal error has occurred check for additional logs.

 

2 Posts

March 23rd, 2018 00:00

same problem here, new R640 with Windows Server 2016 std, Hyper-V Cluster Node

everything up to date... nothing special in the iDRAC logs and neither in Windows Event Viewer

after a call with Dell Support, I should deinstall the Update KB4088875 and it could be a OS-problem...

 

I changed the profile in BIOS to performance... do not know if that really solves the problem...

March 26th, 2018 04:00

Since the iDRAC log pointed to CPU 0000 internal errors, we've changed our system profile to Performance now which disables de C1E/C states of the CPU's. Seems like the HLT instruction triggered those reboots. 

 

The R740's are running fine now for about 2 weeks, a bit early to call it victory yet but at least a start :Party:

21 Posts

April 4th, 2018 05:00

Same thing is happening to me with one of my three new R740s. All 3 machines are nodes of a 2012 R2 failover cluster, running about 15 VMs.

Node 1 and 2 have been perfectly stable for a week. Node 3 has randomly rebooted twice in the last 7 days, both at different times of day (1am and 9pm). No Windows dump file, nothing in event viewer other than Event-Power event ID 41 and then events relating to restart. KB4088875 not installed.

It's not our UPS as all 3 R740s run off the same UPS and this issue is only affecting one of them.

OpenManage Server Admin hardware / ESM log just shows: 'OEM software event' and 'C: boot completed'.

I don't have the iDRAC configured. I see others reporting CPU related errors via their iDRAC logs. Before I try changing my BIOS System Profile to 'Performance' and disabling C1E/C states, I'd like to know if I am receiving these CPU errors as well. 

Does the iDRAC log show more information than the OMSA ESM / hardware log?

UPDATE:

I enabled the iDRAC and am receiving the same CPU errors as others.

2018-04-04 01:11:38 SYS1001 System is turning off.
2018-04-04 01:11:38 SYS1003 System CPU Resetting.
2018-04-04 01:11:21 RAC0703 Requested system hardreset.
2018-04-04 01:11:20 CPU0000 Internal error has occurred check for additional logs.

UPDATE 2:

I changed our system profile to 'Performance' (which disables C1E/C states of the CPUs), as others have recommended earlier in this thread. 

I think this fix has done the trick. 6 days without any reboots. Fingers crossed it stays this way.

1 Message

April 11th, 2018 18:00

Opgailey,
hello friend! we have the same trouble. please tell me, node (after turn on max perf and off c1e) still works yet without any reboots? if yes then how days already?

April 13th, 2018 04:00

The Performance mode shows different bahaviour on older BIOS versions so be sure you're on the latest version.

For example:

v1.1.7 only disables C1E state

v1.3.7 disables both C and C1E states

Running stable for about 5 weeks now :BigSmile:

21 Posts

April 30th, 2018 07:00


@tabletrtdwrote:
Opgailey,
hello friend! we have the same trouble. please tell me, node (after turn on max perf and off c1e) still works yet without any reboots? if yes then how days already?

 

 

I can confirm that since I made this change, our 3 x R740 servers (acting as failover cluster nodes) have been stable. No more random reboots. 

Stable for almost a month now. :)

21 Posts

September 28th, 2018 02:00

Just another update in case anyone comes across this thread and wants to know.

I can confirm that since I made this change, our 3 x R740 servers have remained stable. No more random reboots. 

Perfectly stable for 6 months now.

1 Message

October 30th, 2018 23:00

I am also experiencing this with two R740s in a failover Cluster with 10 VMs, both servers are rebooting randomly and this has been affecting the cluster. I have changed the profile to "Performance" as recommended and have my fingers crossed that it will work.

2 Posts

October 31st, 2018 00:00

update - no random reboots since ‎03-23-2018 after changing to performance

1 Message

July 8th, 2019 22:00

hi i am having same issue ,Dell replaced motherboard once but problem remains same ,any one got a solution ?

August 26th, 2019 00:00

We have an R940 that is doing the same thing, it's had the system board changed already but it's still failing.

I suspect a CPU failure.

October 15th, 2019 08:00

Can confirm:  Had a T640 with this issue, started rebooting daily, multiple times a day, starting in late Sept 2019.  I  had upgraded all firmware to the latest and greatest with DSU, no change.

In BIOS I set the profile to Performance, determined that C / C1E states were disabled, rebooted and have not had it go down in 11 days.

No Events found!

Top