Unsolved
This post is more than 5 years old
1 Rookie
•
5 Posts
0
42535
Random Reboot R740
Hello,
We currently have a Windows server 2016 Datacenter server failover cluster with two PowerEdge R740 nodes.
The hardware configuration of each node is as follows:
2x Intel (R) Xeon (R) Silver 4116 CPU @ 2.10GHz Model 85 Stepping 4
RAM 196608 MB
Nvidia Tesla M60 Video Card
SAS connection with a PowerVault® 3420 SAN
Video cards are used in Discrete Device Assignment by virtual machines
We encounter a problem of brutal random reboot of nodes without error message in logs other than an event id 41 Kernel-Power "The system has rebooted without cleanly shutting down first".
EventData
BugcheckCode 0
BugcheckParameter1 0x0
BugcheckParameter2 0x0
BugcheckParameter3 0x0
BugcheckParameter4 0x0
SleepInProgress 0
PowerButtonTimestamp 0
BootAppStatus 0
Checkpoint 0
ConnectedStandbyInProgress false
SystemSleepTransitionsToOn 0
CsEntryScenarioInstanceId 0
The reboot of the nodes is not simultaneous and occurs in a totally random way.
We have no errors in hardware testing and no explicit events in Open Manage.
Do you have any idea what caused this problem ?
Best Regards
ThelmaCottage
1 Rookie
1 Rookie
•
2 Posts
0
March 22nd, 2018 05:00
We solved the problem - it was down to a MS Security update (KB4088875) - rolled it back an all is fine.
Acknowledged by MS as a problem.
Indusflow
1 Message
0
March 22nd, 2018 11:00
Same issue here, new R740 with Windows Server 2016 std w HyperV Role, all firmware and drivers updated and keeps crashing... The system doesnt have the update KB4088875, only see three updates: KB4088787, KB4049065 and KB3192137...
2018-03-22 13:27:25 SYS1003 System CPU Resetting.
2018-03-22 13:27:18 PWR2271 The Intel Management Engine has encountered a Exception Event.
2018-03-22 13:27:18 SYS1003 System CPU Resetting.
2018-03-22 13:27:18 SYS1000 System is turning on.
2018-03-22 13:27:10 SYS1001 System is turning off.
2018-03-22 13:27:10 SYS1003 System CPU Resetting.
2018-03-22 13:26:55 RAC0703 Requested system hardreset.
2018-03-22 13:26:54 CPU0000 Internal error has occurred check for additional logs.
kimse
2 Posts
0
March 23rd, 2018 00:00
same problem here, new R640 with Windows Server 2016 std, Hyper-V Cluster Node
everything up to date... nothing special in the iDRAC logs and neither in Windows Event Viewer
after a call with Dell Support, I should deinstall the Update KB4088875 and it could be a OS-problem...
I changed the profile in BIOS to performance... do not know if that really solves the problem...
PowerEdgeR740
8 Posts
0
March 26th, 2018 04:00
Since the iDRAC log pointed to CPU 0000 internal errors, we've changed our system profile to Performance now which disables de C1E/C states of the CPU's. Seems like the HLT instruction triggered those reboots.
The R740's are running fine now for about 2 weeks, a bit early to call it victory yet but at least a start :Party:
Opgailey
21 Posts
0
April 4th, 2018 05:00
Same thing is happening to me with one of my three new R740s. All 3 machines are nodes of a 2012 R2 failover cluster, running about 15 VMs.
Node 1 and 2 have been perfectly stable for a week. Node 3 has randomly rebooted twice in the last 7 days, both at different times of day (1am and 9pm). No Windows dump file, nothing in event viewer other than Event-Power event ID 41 and then events relating to restart. KB4088875 not installed.
It's not our UPS as all 3 R740s run off the same UPS and this issue is only affecting one of them.
OpenManage Server Admin hardware / ESM log just shows: 'OEM software event' and 'C: boot completed'.
I don't have the iDRAC configured. I see others reporting CPU related errors via their iDRAC logs. Before I try changing my BIOS System Profile to 'Performance' and disabling C1E/C states, I'd like to know if I am receiving these CPU errors as well.
Does the iDRAC log show more information than the OMSA ESM / hardware log?
UPDATE:
I enabled the iDRAC and am receiving the same CPU errors as others.
2018-04-04 01:11:38 SYS1001 System is turning off.
2018-04-04 01:11:38 SYS1003 System CPU Resetting.
2018-04-04 01:11:21 RAC0703 Requested system hardreset.
2018-04-04 01:11:20 CPU0000 Internal error has occurred check for additional logs.
UPDATE 2:
I changed our system profile to 'Performance' (which disables C1E/C states of the CPUs), as others have recommended earlier in this thread.
I think this fix has done the trick. 6 days without any reboots. Fingers crossed it stays this way.
tabletrtd
1 Message
0
April 11th, 2018 18:00
PowerEdgeR740
8 Posts
0
April 13th, 2018 04:00
The Performance mode shows different bahaviour on older BIOS versions so be sure you're on the latest version.
For example:
v1.1.7 only disables C1E state
v1.3.7 disables both C and C1E states
Running stable for about 5 weeks now :BigSmile:
Opgailey
21 Posts
0
April 30th, 2018 07:00
I can confirm that since I made this change, our 3 x R740 servers (acting as failover cluster nodes) have been stable. No more random reboots.
Stable for almost a month now. :)
Opgailey
21 Posts
2
September 28th, 2018 02:00
Just another update in case anyone comes across this thread and wants to know.
I can confirm that since I made this change, our 3 x R740 servers have remained stable. No more random reboots.
Perfectly stable for 6 months now.
fms_Vespucci
1 Message
1
October 30th, 2018 23:00
kimse
2 Posts
1
October 31st, 2018 00:00
update - no random reboots since 03-23-2018 after changing to performance
Msafeer
1 Message
0
July 8th, 2019 22:00
Paul_McGuire
2 Posts
0
August 26th, 2019 00:00
We have an R940 that is doing the same thing, it's had the system board changed already but it's still failing.
I suspect a CPU failure.
support-mmeconsulting.com
1 Message
0
October 15th, 2019 08:00
Can confirm: Had a T640 with this issue, started rebooting daily, multiple times a day, starting in late Sept 2019. I had upgraded all firmware to the latest and greatest with DSU, no change.
In BIOS I set the profile to Performance, determined that C / C1E states were disabled, rebooted and have not had it go down in 11 days.