Start a Conversation

Unsolved

K

5 Posts

11996

September 26th, 2018 21:00

Dell R630 overheating than other R630 servers

Hi,

We have multiple R630 servers with us running for different purposes. 4 of them are serving same purpose of providing internet connection to our users company wide. But, one of the R630 is overheating. It always well above 75 degrees and reaching upto 85 degrees. I couldn't find any reason why this is happening. We're running CentOS Linux 7.3.1611. Main usage is Google Chrome for multiple users. More than 100 users will be using that system at same time. In iDRAC, it shows alert for "The system inlet temperature is greater than the upper warning threshold." & "The system inlet temperature is greater than the upper critical threshold. " And then it becomes normal. I could see the inlet temp is always 18 to 20 degrees most of the time. But, When that specific alerts happens as mentioned above, it specifically overheat. It always overheats. iDRAC version is 2.52.52.52. Other R630 are hovering around 60-70 degrees most part. I don't suspect there is any thing wrong with OS setup. Because, our OS deployments are identical and doesn't differ in any way. Let me know.

 

Help me troubleshoot it.

September 26th, 2018 21:00

Hi,

We have multiple R630 servers with us running for different purposes. 4 of them are serving same purpose of providing internet connection to our users company wide. But, one of the R630 is overheating. It always well above 75 degrees and reaching upto 85 degrees. I couldn't find any reason why this is happening. We're running CentOS Linux 7.3.1611. Main usage is Google Chrome for multiple users. More than 100 users will be using that system at same time. In iDRAC, it shows alert for "The system inlet temperature is greater than the upper warning threshold." & "The system inlet temperature is greater than the upper critical threshold. " And then it becomes normal. I could see the inlet temp is always 18 to 20 degrees most of the time. But, When that specific alerts happens as mentioned above, it specifically overheat. It always overheats. iDRAC version is 2.52.52.52. Other R630 are hovering around 60-70 degrees most part. I don't suspect there is any thing wrong with OS setup. Because, our OS deployments are identical and doesn't differ in any way. Let me know.

 

Help me troubleshoot it.

Moderator

 • 

8.8K Posts

September 27th, 2018 07:00

Kamesh_s_s,

The first thing I would look at is if the cooling shroud is actually installed in the overheating R630. 

Also verify that the System cover, EMI filler panel, memory-module blank, back-filler bracket have not been removed.
•External airflow is not obstructed.
•A cooling fan is not removed or has not failed.
•The expansion card installation guidelines have been followed.

Also additional cooling can be added by one of the following methods:

From the iDRAC Web GUI
1.Click Hardware > Fans > Setup.
2.From the Fan Speed Offset drop-down list, select the cooling level needed or set the minimum fan speed to a custom value.

From F2 System Setup
1.Select iDRAC Settings > Thermal, and set a higher fan speed from the fan speed offset or minimum fan speed.

From RACADM Commands
1.Run the command racadm help system.thermalsettings

Let me know what you see.

October 1st, 2018 03:00

Hi,

 

Thanks for the suggestions, Chris. I checked on iDRAC and other places. Everything seems to be on Default profile. This is how all other settings are in our other R630 servers. Only this seems to be creating more heat. I might have to make a downtime for this server and open then check if there is any fans are not working as intended. But, if a fan fails, we would get alert. But, I don't think this would be a problem. And also see if any wires or some other things are blocking airflow. Have to check if the CPU heatsink is seated properly. I assume, it should be okay. By the way, at what point, a Xeon E5 2697A v4 would start thermal throttling? I used lm_sensors to check temps on my Linux machine. It says max: 83 and critical as 93. That means, 83 is thermal throttling point? Please let me know.

 

Thanks,

Kamesh.

Moderator

 • 

8.8K Posts

October 1st, 2018 12:00

I would schedule a time and then open the case to see if all the fans are spinning at the rate they should be, just to be sure. You are correct when you said it should throw an error if a fan fails, but I would check to be safe. Also, is there ANY hardware differences between this server and the others, such as an expansion card, externally connect devices, etc?

October 30th, 2018 00:00

Attaching an image of the server. Could you check if you're finding something abnormal?

45088667_1086077838231903_4825257291605344256_n.jpg

 

 

 

October 30th, 2018 00:00

Hey Chris,

Thanks for your time. I did indeed open the case and checked. All the fans are spinning properly. Still the issue is same. We have same R630 above it and the temp seems to be better on it. But, this one is acting weird. Both R630 servers are identically same and serving same purpose. So, the load is same on both servers. I'm still not able to figure out what the problem is. Help me out.

Thanks,
Kamesh.

Moderator

 • 

8.8K Posts

October 30th, 2018 05:00

Everything looks fine from the picture. What I would like to try is increasing the fans. Reboot to the BIOS (F2) then select iDrac Settings - Thermal - then change the Fan Speed Offset to Low. If that doesn't cause it to stop overheating then you can try increasing to Medium and then so on.

 

7 Posts

July 16th, 2020 10:00

Hi Chris, 

I have a Poweredge R420 that one day with no reason all the fans went to full blast and stayed there.

I checked the WEB GUI IDRAC page and i saw the System Board Inlet Temp is +127 degrees Celsius..

I read some forums and seems that there's a problem with different firmware versions for IDRAC.or exra parts installed which do not communicate very well. I have to tel you that i added a second CPU and RAM to the server but i dont think this can affect taking in consideration the server worked fine in the last two weeks with two CPUs.

After that i updated the entire server using automatic update option ..so, Bios ,NIC,IDRAC,etc.. all are updated to the latest version. 

After this update the server went back to normal but the fans problem is not entirely solved.From time to time it shows me the System Board Inlet Temp -128 degrees Celsius..this is the moment when the server is quiet ..fans spinning at 24% sometimes goes to 15%. but sometimes goes to 127 and server goes nuts..

There are episodes when the temperature goes +127 and then -128...I really dont know what to do anymore.

I'm suspecting the temp sensor is faulty or the firmware/software part doesnt manage it very well.

I'd really appreciate any idea..i really dont know what to do with this server..

Thanks in advance. 

Regards,

Moderator

 • 

3.7K Posts

July 16th, 2020 11:00

Hello Vasi_2020,

 

When adding a second processor you also need another fan and the correct risers.

 

https://dell.to/2OyeXJe

Page 79 Installing a Processor

NOTE: If you install a second processor, you must remove the dummy fan from the FAN 6 slot, install a cooling fan in the FAN 6 slot (CUSKITs for system fans are available with MOD: HR6C0 or G8KHX.), and upgrade both the riser cards (riser1 and riser 2). For more information,  see Expansion Card Installation Guidelines (page 66).  (Need two of  PN# 488MY  - ASSY,PWA,RISER-2,1P,2P,R4/320 )

 

NOTE: When installing a second processor in your system with a 350 W redundant power supply, it is highly recommended that you upgrade to 550 W redundant power supply, to avoid potential degraded  performance.

 

Please let me know if that helps you.

 

Moderator

 • 

3.7K Posts

July 16th, 2020 13:00

Hello Vasi_2020,

 

Thank you for the update. Since everything from the shrouds to DIMM blanks are designed with cooling involved, the correct risers may be required in a dual CPU configuration.  It would be a good idea to remove CPU2. 

 

I need to make a correction on the risers:

You need  Riser2_1P and  Riser2_2P

not

Riser1_1P and Riser1_2P

 

Please confirm you have the shrouds installed.  There should be one over the power distribution board and one behind the fans.

Reseat the cables to the control panel at the control panel and the system board side.

Confirm environment is clear and nothing blocking air flow.

7 Posts

July 16th, 2020 13:00

Hi Chris,

I appreciate very much your kind answer.

The server has been upgraded by somebody else and i didn't pay to much attention about the riser hardware compatibility.  

Indeed, there are two risers installed on the motherboard.One which is located on the CPU1 side and has 1P/2P written on it and another riser located on the CPU2 side on where IDRAC is connected to and is written 1P on it. I think both risers should have written on then 2P, isnt it ? Specially the one where IDRAC is connected to...

Also all the fans are installed..there's no dummy fun connected.All working.

At the moment i don't have available a compatible riser for second CPU so i'll remove the second CPU and keep only the first one but before doing that i wanna make sure that this behavior of System board Inlet Temp going from -128 degrees directly to +127 degrees is generated by my riser incompatibility or there's another reason or hardware or software that is causing all that.

I hope there's nothing wrong on the motherboard ..i'm thinking about the temperature probe/sensor being faulty..

Please let me know if you encountered such thing before or if you have other ideas.

Many thanks. 

Regards,

Vasile.

 

 

7 Posts

July 17th, 2020 02:00

 

Hi Charles,

Thank you for your answers. 

I opened up the servers,removed the second CPU and RAM the i took out the IDRAC..both riser plates all the harddisks...and start it again..the system board inlet temp is greater than the threshold ..+127 degrees..

I put back the risers and the idrac without second CPU and RAM and got the same problem ,the fans at full blast right after iDRAC initialiazation.I also checked and reseat all the cables no result.

I believe the temp probe is faulty .But where is that temp sensor, where is located ,on the motherboard or on the iDRAC?...Actually it doesnt quite matter even if i know where it is..if it's faulty ..only in electronic repair show can be fixed.

I also found out from a colleague that they had several faulty servers with this fan problem before.

I know that they are electronics and can damage but i always believed DELL has some very strong servers available on the market.

Do you have any other ideas? Or we should call it a day...

Thank you very much. 

Regards, 

Vasile.

Moderator

 • 

3.7K Posts

July 17th, 2020 05:00

Hello Vasi_2020,

 

At this point you may try resetting the DRAC and BIOS to defaults and check results:

 

Boot to F2 > iDRAC Settings > scroll down to Reset iDRAC configurations to defaults

-reconfigure any custom settings

 

Boot to F2 > System BIOS > Default button next to Finish button at the bottom

-reconfigure any custom settings

 

Pending results if you don't see any improvement, suspect components would be:

Control panel (this has the sensor), control panel cables and System Board.

 

7 Posts

July 17th, 2020 06:00

Hi Charles,

My PowerEdge R420 has Service Tag: <Tag removed by moderator> and Express Service Code: .

When i upgraded all the firmwares i upgraded also the one for IDRAC up to latest version iDRAC with Lifecycle Controller v. 2.65.65.65. 

Today i said that i should try to rollback the previous firmware 2.63.60.62.A00 and guess what, with lifecycle controller enabled from bios the server fans went back to 21%..then 17% then 15% . CPU 1 and CPU 2 have 65 degrees celsius which seems to be ok .Now server is quiet like a mouse at 15%. I'm also helping the server to stay at a constant temperature with an external room fan pointed towards the server.

The thing is the System board Inlet Temperature went up to -86 degrees from -128 degrees celcius. So it's still doesn't work properly.But thie IDRAC firmaware bersion 2.63.60.62.A00 is controling the fans much better than the last firmware version. Very strange ,indeed..

I don't know what to say and do.. i'll keep it under observation hopefully it wont go nuts again..i started to hate having jet planes in my room ..ehehe...

If it goes nuts again i will reset everything to default ...i will rollback the previous firmware versions and if this doesn't work i will call it a day and throw it to the bin..Thing is i'm throwing to bin some good money and it's a pity cause i always trusted Dell in having very good stuff.

I will write you if something happens. Until then would be highly appreciated if you could find out why we have those absurd negative and positive values on the system board inlet temp taking in consideration that ambient temperature is normal and none of the fans are faulty as per manual troubleshooting page. I guess i'm not the only one dealing with this issue.For sure some dell technicians encountered this before and they can give us a resolution about it.

Many many thanks for you answers. I hope this topic will help also other people.

We'll be in touch.

Best regards,

Vasile.

Moderator

 • 

3.7K Posts

July 17th, 2020 07:00

Hello Vasi_2020,

 

Thank you for the update. Good call to try rollback firmware. You have a good plan to monitor that reset to defaults if the issue occurs again.

 

Wild swing in temperature like that could be a sensor.   Suspect components would be: 
Control panel (this has the sensor), control panel cables and System Board.

 

I would advise in your post not to post personal information like Service Tag. If we need the tag or other personal information we can do that through a private message.

No Events found!

Top