Start a Conversation

Solved!

Go to Solution

1774

July 27th, 2020 04:00

Fan at max speed on C6300 server with 2 x Xeon Phi

I own a C6300 server, hosting two C6320p blades with the exact same Xeon Phi 7230 @ 1.30GHz config.

Symptoms :

With only one of the C6320p blade in the host chassis, everything works fine.
But as soon as we plug a second one (even powered off), the fans of the C6300 go crazy and reach 16,800 RPM (over 17,850), causing a very relaxing plane taking off sound.
As soon as one blade is powered on, and a second one just plugged in, the problem appears.

What I tried :

- The available updates :
    - iDRAC : 2.75.75.75 (on both blades)
    - C6300 Firmware 3.18 (https://www.dell.com/support/home/de-ch/drivers/DriversDetails?productCode=poweredge-c6300&driverId=8CTMG)

- The "raw 0x30 0x12" ipmitool command
    - Output with 1 blade :
      01 60 1b 03 18 06 26 00 00 01 ff 00 01 25 2a ff
      ff 0f c2 00 00 01 04 01 04 31 c5 11 ff 0f

    - Output with 2 blades :
      01 8c 1b 03 18 00 00 00 00 01 ff 00 01 25 2a ff
      ff 0f c2 00 00 01 04 01 04 31 c5 11 ff 0f


Does someone have a mysterious raw ipmitool hexadecimal command to solve this problem?
Or even more information about what this output means?
The green values (03 18) means the firmware version is indeed the last one available.
But what about the red ones?

I precise I don't have any other weird behavior on this machine, no LED blinking, no strange ipmi bugs... Everything seems to work just fine except the fan control.

Thank you for your help!

7 Posts

July 31st, 2020 09:00

Hello,

I "solved" the problem by moving my 2 sleds to slot 2 and 4 of the machine (the bottom ones).

As soon as I have 2 machines plugged into the C6300 enclosure, and one of them is in slot 1 or 3, my fans go crazy. I don't understand why, but maybe something is not properly plugged between the upper connection card and the fan card / power splitting card?

However, I think I already tried this configuration before trying to update the firmware and BIOS of the sleds. Maybe it helped.

If you hear about someone having the same kind of problem, do not hesitate to contact me.
I won't need to add any other sleds to this enclosure, so this solution is fine for me. But I am still curious about the real origin of the problem.

Thank you all for your help.

Moderator

 • 

3.3K Posts

July 27th, 2020 10:00

Hello Xarboule,

 

On the fan speed running high; Does it matter which one is installed first? Does the issue follow the blade or does it not matter which one is installed first?

 

See if you have the current BIOS:

Dell Power Edge Server BIOS C6320P Version 2.3.0

Release date :  06 Feb 2020

 

https://dell.to/2EiGVXC

 

 

Was this working normally before and just started doing this? Was anything changed before this started?

Has anything changed on the hardware like added memory?

Are there any hardware issues on either blade?

 

Please let me know and I'll be happy to assist you.

7 Posts

July 28th, 2020 00:00

Hello Charles,
Than you for your answer.

Indeed, we tried to change the order we plug the blades in. Apparently, the problem only show up when we plug both blades, it does not depend on the order nor the slots used in the chassis.

The 2 BIOS are version 2.3.0.

This machine never worked properly except with only one blade plugged in.

There is indeed one change in the hardware :
We had to add a cable in each blade to connect the blades to our SSD front bay. There is no disk attached on the blades, each of them use an SSD from the chassis.
This makes the iDRAC unable to locate the disk in the "Storage" section apparently. But we can boot our OS from the blades, everything seems to work as expected.
We use standard Intel SSDs bought from DELL with the machine.

Do you think this may me related ?
Screenshot_phi-idrac.png

Thank you again.
















Moderator

 • 

2.2K Posts

July 28th, 2020 02:00

Hello,

 

That hardware change might have caused this. There is unsupported hardware or near-maximum hardware configuration that can be cause fan issues. Can you try a simple step at first with doing drain flea power? because I'm seeing this RAC0501 warning at the screenshot.

  1. Turn off your server.
  2. Disconnect the power cable.
  3. Press and hold down the power button for 15 seconds to drain the flea power.
  4. Connect the power cable.
  5. Turn on your server.

While not sure, there may be an issue with the iDRAC firmware. Have you performed the iDRAC update recently or have the fans started running at maximum speed after the iDRAC update? We once solved it by rollingback to the previous version because the iDRAC update caused the fans to run at maximum speed. In generally fan issues solve after release new iDRAC&LCC FW versions for poweredge servers.

Let us know if it helps.

7 Posts

July 28th, 2020 03:00

Hello,

Nothing new after draining the flea power. It was worth a try though.

I recently did the iDRAC 2.75.75.75 update, hoping it would solve the problem. So we already had the problem with the previous version (2.55.55.50).

For the hardware change, I tried to remove both SATA cables to the front panel, so the machines don't have access to any disk and stay in the BIOS. But the problem remains. (Even after draining the flea power).

Thank you for helping

Moderator

 • 

2.2K Posts

July 28th, 2020 05:00

Hello,

 

Thanks for info, and sorry for continuing problem remains. I'm digging our articles and try to find out what causes this issue. I assume you didn't upgrade any memory or change memory slot, did you? I have seen a similar situation for C6525 model. I'm asking this because such an additional memory can also cause the wrong memory population to increase the high speed fan idle conditions. 

Moderator

 • 

3.3K Posts

July 28th, 2020 06:00

Hello Xarboule,

 

That would be a good test to try. Leave one DIMM per processor. Check results.

 

You may take them down to minimum to post, see if the fans are normal and put things back a little at a time until you find the faulting component:


Minimum configuration to POST

 

One power supply unit
One Processor (CPU) (minimum for troubleshooting)
One Memory Module (DIMM) installed in the socket A1

 

 

Please let me know your results

7 Posts

July 28th, 2020 06:00

Hello,

No problem, it is already great to have some help / feedback.

We did not add memory, nor change the slots in use.
Do you think it could be interesting to remove all the RAM modules to try without ?

Thank you!

7 Posts

July 28th, 2020 07:00

Hello,

I just tried, with minimum config to post. And I even changed the RAM module remaining in each blade, to make sure those are not responsible.
No change.

So, I only see 2 possibilities:

    1.  The fan control card does not understand something that happens when we plug 2 blades with this config, but it does not come from 1 specific blade. Only the combination triggers the problem.

    2. From what I read, the fan speed is sent by the iDRAC to the fan control card. If that is true, what if the 2 iDRAC disagree on the value to set ? They have the exact same version and settings in the FAN section though.

Moderator

 • 

3.3K Posts

July 28th, 2020 08:00

Hello Xarboule,

 

Please check the LifeCycle log for any errors reported. You can see this log in the DRAC or in the LifeCycle Controller by choosing F10 on boot up. 


Can you confirm all fans are working?

 

 A fan failure or temperature that is too high in the system will result in a notification from the BMC and the remaining fans will ramp to 100% duty cycle.

 

Try swapping the fans to get a good reseat -  remove fan 1, move all other fans down one and put fan 1 in the last slot.

7 Posts

July 28th, 2020 09:00

Hello,

I found no thermal-related alerts in the Lifecycle Controller. Nothing about temperature or fans.

And unless my iDRAC is lying, the 4 fans seem to be working.
I will try the fan swap tomorrow, I will keep you informed.

Thank you again.

Moderator

 • 

3.3K Posts

July 28th, 2020 11:00

Hello Xarboule,

 

That sounds good. I'll watch for an update. Check over these things also:


CAUTION: To maintain proper system cooling, all empty hard-drive bays must have drive blanks installed.
Dummy HDD (Required for non-populated drive slots)

 


Ensure that none of the following conditions exist:

 

System cover, cooling shroud, drive blank, front or back filler panel is removed.
Ambient temperature is too high.
External airflow is obstructed.
Cables inside the system obstruct airflow.
An individual cooling fan is removed or has failed.

Moderator

 • 

3.3K Posts

July 31st, 2020 10:00

Hello Xarboule,

 

That is great to see you were able to get it working. 
If you want to investigate the other slots more, you may check any cable connections for lose connections or any bent pins.
Hope you have a great day!

No Events found!

Top