Eluich
1 Nickel

Random Reboot R740

Hello,

We currently have a Windows server 2016 Datacenter server failover cluster with two PowerEdge R740 nodes.


The hardware configuration of each node is as follows:
2x Intel (R) Xeon (R) Silver 4116 CPU @ 2.10GHz Model 85 Stepping 4
RAM 196608 MB
Nvidia Tesla M60 Video Card
SAS connection with a PowerVault® 3420 SAN

Video cards are used in Discrete Device Assignment by virtual machines

We encounter a problem of brutal random reboot of nodes without error message in logs other than an event id 41 Kernel-Power "The system has rebooted without cleanly shutting down first".

EventData 
BugcheckCode 0 
BugcheckParameter1 0x0 
BugcheckParameter2 0x0 
BugcheckParameter3 0x0 
BugcheckParameter4 0x0 
SleepInProgress 0 
PowerButtonTimestamp 0 
BootAppStatus 0 
Checkpoint 0 
ConnectedStandbyInProgress false 
SystemSleepTransitionsToOn 0 
CsEntryScenarioInstanceId 0

The reboot of the nodes is not simultaneous and occurs in a totally random way.

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

Best Regards

0 Kudos
26 Replies
Moderator
Moderator

RE: Random Reboot R740

Hello

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

If there are no errors or warnings then look at what happened just before the system shut down. Check the hardware log at the time of the shutdown. The hardware log should state what initiated the shutdown. If there is nothing in the hardware log that states what initiated the shutdown then this is a hardware issue.

Thanks

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
Eluich
1 Nickel

RE: Random Reboot R740

Hi Daniel

Thank you for your update

When you said "Check the hardware log at the time of the shutdown", how I can verify the hardware log ? By Open Mange Essential, iDRAC,...?

Thank you in advance for your answer

Best Regards

0 Kudos
Moderator
Moderator

RE: Random Reboot R740

It is under the log section of the iDRAC. It is called the System Event Log in the iDRAC. It is not the same as the operating system's System Event Log. In OpenManage it is listed as the Hardware Log.

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
Eluich
1 Nickel

RE: Random Reboot R740

When the reboot occurs, i have only this hardware logs:

OEM software event.

C: boot completed.

So that means there's a hardware problem ?

Best Regards

0 Kudos
Eluich
1 Nickel

RE: Random Reboot R740

On iDRAC, in Lifecycle Logs i have this event before the reboot

0 Kudos
dafoxx
2 Iron

RE: Random Reboot R740

Just an idea but set the power options in the bios to Max proformance

Does it hapen when the when the GPUs/ systems are underloading?

0 Kudos
Highlighted
Eluich
1 Nickel

RE: Random Reboot R740

Hi

Today I made the changes in the BIOS configuration because the servers had Watt Performance Optimization Profile Settings (DAPC) as the profile settings.

Now the configuration of each node is in custom mode with maximum performance and disables C1E and C-state options.

I hope that will solve the problem.

Best Regards

0 Kudos
Moderator
Moderator

RE: Random Reboot R740

When the reboot occurs, i have only this hardware logs:

OEM software event.

C: boot completed.

So that means there's a hardware problem ?

No, those are normal messages that occur during system shutdown and startup. You need to review all of the software and hardware logs and cross-reference them at the time the events occur. Until you find something in the logs or diagnostics to indicate what is happening it is just guess work.

Make sure you turn off automatic recovery on failure in the operating system. If the OS is faulting it automatically restarts the system by default.

Thanks

Daniel Mysinger
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
dafoxx
2 Iron

RE: Random Reboot R740

Assumeing the nodes are windows Vm's? if so, set those  to high power mode in the OS AND host OS, are they on the latest firmware?

Also look at the Idrac power/graph see if the systems are useing too much power.

0 Kudos