Start a Conversation

Unsolved

This post is more than 5 years old

41741

November 7th, 2017 08:00

Random Reboot R740

Hello,

We currently have a Windows server 2016 Datacenter server failover cluster with two PowerEdge R740 nodes.


The hardware configuration of each node is as follows:
2x Intel (R) Xeon (R) Silver 4116 CPU @ 2.10GHz Model 85 Stepping 4
RAM 196608 MB
Nvidia Tesla M60 Video Card
SAS connection with a PowerVault® 3420 SAN

Video cards are used in Discrete Device Assignment by virtual machines

We encounter a problem of brutal random reboot of nodes without error message in logs other than an event id 41 Kernel-Power "The system has rebooted without cleanly shutting down first".

EventData 
BugcheckCode 0 
BugcheckParameter1 0x0 
BugcheckParameter2 0x0 
BugcheckParameter3 0x0 
BugcheckParameter4 0x0 
SleepInProgress 0 
PowerButtonTimestamp 0 
BootAppStatus 0 
Checkpoint 0 
ConnectedStandbyInProgress false 
SystemSleepTransitionsToOn 0 
CsEntryScenarioInstanceId 0

The reboot of the nodes is not simultaneous and occurs in a totally random way.

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

Best Regards

Moderator

 • 

6.2K Posts

November 7th, 2017 10:00

Hello

We have no errors in hardware testing and no explicit events in Open Manage.

Do you have any idea what caused this problem ?

If there are no errors or warnings then look at what happened just before the system shut down. Check the hardware log at the time of the shutdown. The hardware log should state what initiated the shutdown. If there is nothing in the hardware log that states what initiated the shutdown then this is a hardware issue.

Thanks

5 Posts

November 7th, 2017 11:00

Hi Daniel

Thank you for your update

When you said "Check the hardware log at the time of the shutdown", how I can verify the hardware log ? By Open Mange Essential, iDRAC,...?

Thank you in advance for your answer

Best Regards

5 Posts

November 7th, 2017 12:00

When the reboot occurs, i have only this hardware logs:

OEM software event.

C: boot completed.

So that means there's a hardware problem ?

Best Regards

Moderator

 • 

6.2K Posts

November 7th, 2017 12:00

It is under the log section of the iDRAC. It is called the System Event Log in the iDRAC. It is not the same as the operating system's System Event Log. In OpenManage it is listed as the Hardware Log.

5 Posts

November 7th, 2017 23:00

On iDRAC, in Lifecycle Logs i have this event before the reboot

48 Posts

November 8th, 2017 07:00

Just an idea but set the power options in the bios to Max proformance

Does it hapen when the when the GPUs/ systems are underloading?

5 Posts

November 8th, 2017 07:00

Hi

Today I made the changes in the BIOS configuration because the servers had Watt Performance Optimization Profile Settings (DAPC) as the profile settings.

Now the configuration of each node is in custom mode with maximum performance and disables C1E and C-state options.

I hope that will solve the problem.

Best Regards

Moderator

 • 

6.2K Posts

November 8th, 2017 08:00

When the reboot occurs, i have only this hardware logs:

OEM software event.

C: boot completed.

So that means there's a hardware problem ?

No, those are normal messages that occur during system shutdown and startup. You need to review all of the software and hardware logs and cross-reference them at the time the events occur. Until you find something in the logs or diagnostics to indicate what is happening it is just guess work.

Make sure you turn off automatic recovery on failure in the operating system. If the OS is faulting it automatically restarts the system by default.

Thanks

48 Posts

November 8th, 2017 09:00

Assumeing the nodes are windows Vm's? if so, set those  to high power mode in the OS AND host OS, are they on the latest firmware?

Also look at the Idrac power/graph see if the systems are useing too much power.

1 Message

November 20th, 2017 06:00

Same issue with a single T430 / Windows Server 2016. No hardware errors, sometime the server reboots two or three time within 5 minutes, sometimes it is ok for ours. For the moment , i just installed the OS, no users, no activity !!!

DELL support asked for hardware test, no problem, so no other answer.

48 Posts

November 20th, 2017 07:00

Assuming the firmware is on the latest?

Do you have any add-in cards?

try running the system on OS power optermised or max proformance.

Install open manage

System > main System Chassie

Power managment.

Managment > Profile

Choose

OS power control then apply.

1 Message

January 30th, 2018 07:00

Eluich,

I am having the same problem with R740s that are randomly rebooting. These are out of the box servers that I've applied the latest.drivers from Dell's website. Did changing the BIOS config to max performance resolve your issue?

Thanks,

Mike

March 14th, 2018 03:00

We have the same issues on a Citrix cluster of R740's (dual Xeon 6136/128GB)

BIOS: 1.3.7

iDRAC 3.15.17.15

The strange thing is there is like no BSOD or critical in the eventlog on the host. There is also no load. We can't find a way to trigger it since it happens randomly, even an hour of 3dsmax/vray rendering wont do the job.

Capture.JPG

We took them out of our production environment for now.

March 15th, 2018 10:00

Same problem here with a R430. Only started yesterday - reboots for no reason. No indication in logs at all. Not sure if this is a coincidence but did coincide with a Windows update ?

1 Message

March 22nd, 2018 05:00

I have the same problem with a R740 that was just added to a Hyper-V failover cluster.  I currently have a case open with Dell.

No Events found!

Top