Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

802716

July 29th, 2011 02:00

Power supply detected a failure Sensor location

Hello,

    I get this alert in hardware log at openmanage:

 1304 Thu Jul 28 03:52:12 2011 Instrumentation Service Redundancy regained Redundancy unit: System Board PS Redundancy Chassis location: Main System Chassis Previous redundancy state was: Lost
 1352 Thu Jul 28 03:52:05 2011 Instrumentation Service Power supply returned to normal Sensor location: PS 2 Status Chassis location: Main System Chassis Previous state was: Critical (Failed) Power Supply type: AC Power Supply state: Presence detected
 1354 Thu Jul 28 03:51:40 2011 Instrumentation Service Power supply detected a failure Sensor location: PS 2 Status Chassis location: Main System Chassis Previous state was: OK (Normal) Power Supply type: AC Power Supply state: Presence detected, Failure detected
 1306 Thu Jul 28 03:51:40 2011 Instrumentation Service Redundancy lost Redundancy unit: System Board PS Redundancy Chassis location: Main System Chassis Previous redundancy state was: Normal

 

    An this ones in hardware log at openmanage:

 Thu Jul 28 03:51:32 2011 Can not communicate with power supply 2.
 Thu Jul 28 03:51:33 2011 Power supply redundancy is lost.
 Thu Jul 28 03:51:33 2011 Power supply 2 failed.
 Thu Jul 28 03:51:43 2011 The power supplies are redundant.
 Thu Jul 28 03:51:43 2011 Power supply 2 is operating normally.
 Thu Jul 28 08:56:50 2011 Communications are restored for power supply 1.
 Thu Jul 28 08:56:50 2011 Can not communicate with power supply 1.
 Thu Jul 28 08:56:52 2011 Power supply 1 failed.
 Thu Jul 28 08:56:53 2011 Power supply redundancy is lost.
 Thu Jul 28 08:56:57 2011 Power supply 1 is operating normally.
 Thu Jul 28 08:56:58 2011

The power supplies are redundant.

 

 

 

         Any idea?

29 Posts

July 29th, 2011 15:00

I agree with Todd and would add that another level of test is to run each power supply independently. This eliminates the slight potential for one power supply to be just out of tolerance (ratio-wise) with the other. That isn't commonly a problem in dual power supply chassis, but it is a frequent problem in the tri and quad power supply systems (higher likelihood of inconsistent tolerances across all power supplies).

5 Practitioner

 • 

274.2K Posts

July 29th, 2011 06:00

There can be a number of scenarios with the errors experienced.  Issues ranging from UPS issues to power distribution board issues.  What server model are the errors originating from?  From the errors, only power supply 2 is giving specific errors.  This potentially can directly cause power supply 1 to have issues.  Are there any lights other than the green light on PS2, perhaps the amber light?  Also, are the power supplies connected via a Y cable?

What I would advise doing is powering down, swapping power supply 2 with power supply 1 and clear the hardware log and monitor from there.  Obviously if either power supply has an amber light, that would indicate the failure.

188 Posts

July 31st, 2011 03:00

The Server is a Dell PE r710; it's on a datacenter and getting power from 2 diferent UPS; one for each PS.

What worries me is this kind of messages:

Thu Jul 28 08:56:52 2011 Power supply 1 failed.

Thu Jul 28 03:51:33 2011 Power supply 2 failed.

The logs before were taken after swap PS; I only get errors each, lets say; 92 hours.

Thanks a lot.

6 Posts

September 10th, 2011 17:00

We're getting similar problems on two R515 servers, each at a different data-centre, each on separate UPS for PSU1 and PSU2. The first log entry in the iDRAC log usually shows a CRC error communicating with the PSU. These servers have been known to reboot unexpectedly on at least one occasion each in around 4 months.

I'm beginning to wonder if there's an issue with the embedded SMBus/I2C controller.

Fortunately, the restarts have been at times when the servers are idle - though this may have been coincidental. Still working on this one, so any further information or suggestions would be appreciated.

In my latest attempt to solve this, I blacklisted the i2c_piix4 module, rebooted, then performed a hard reset on BMC or iDRAC6 (depending on what was fitted to each specific machine). No symptoms have reappeared for around 20 days so far.

UPDATE:

The problem reappeared. These sporadic reboots are affecting 2 x R415 and 2 x R515 servers. I'm convinced that something is corrupting one or more of the two-wire sensor busses (I2C and derivatives). Intervals between reboots have ranged between two days to 5 months, but typically two weeks or more.

There were a number of firmware updates available via the 'update_firmware' utility that didn't seem to show up on Dell's support site. I applied everything that was available, and also downgraded my iDRAC6/BMC firmware to 1.54 from 1.80 (the BMC intuitively seemed to be a strong candidate). I also reset the BMC to default settings and reconfigured using OpenManage (the servers are remote).

If the problem reappears, my next course of actions are a) reset the NVRAM using the motherboard jumpers, and reconfigure the machines entirely and if, that doesn't work, then b) permanently reboot the BMC/iDRAC in a loop (to prevent it from doing anything). A bit drastic, but if it works I won't really care.

1 Message

February 17th, 2012 10:00

Hi there,

I'm having the exact same problem with a Dell R610.  We have 14 others, this is the only one with this problem.  Changed the power supply on PS2.  Changed the physical power cord.  Checked the power: PS1 is on UPS and PS2 is straight AC power.

Is there a board sensor that could be in error?

Latest Errors from Server Admin Console:

Critical;Thu Feb 16 11:54:23 2012;PS 1 Status: Power Supply sensor for PS 1, failure (Communications error) was asserted
Ok;Thu Feb 16 11:54:25 2012;PS 1 Status: Power Supply sensor for PS 1, failure (Communications error) was deasserted

Ok;1152;Mon Jan 30 02:19:10 2012;Instrumentation Service;Voltage sensor returned to a normal value
Sensor location: PS 2 Voltage
Chassis location: Main System Chassis
Previous state was: Unknown
Voltage sensor value (in Volts): 122.000

Ok;1151;Mon Jan 30 02:18:46 2012;Instrumentation Service;Voltage sensor value unknown
Sensor location: PS 2 Voltage
Chassis location: Main System Chassis
Previous state was: OK (Normal)
Voltage sensor value (in Volts): 0.000

6 Posts

April 7th, 2012 18:00

That error you posted looks very familiar. Interestingly, upgrading all the firmware has caused the errors being reported to change, though the machines would still reboot.  I used ipmitool to disable the Event Message Buffer, and so far none of our problem servers has rebooted (a couple of months now?). See a blog post I've just written on this for more details... www.susa.net/.../dell-r415-r515-11g-random-reboots

6 Posts

July 25th, 2012 17:00

@Todd Pietzsch

In my case, I don't believe that any of the reported hardware is actually at fault. Here are my reasons: -

1. The R415/R515 pairs were bought 3 months apart. All four exhibit the same problems.

2. The errors have changed over time. I've had OEM errors, ECC errors, PSU errors, and CPU machine check errors.

3. There's a temporal element to the occurrences - roughly two days, two weeks, or six weeks.

4. The 'last error' IPMI command (I forget the specific command) shows a date of January 1970 (suggesting all zeros, and an invalid entry).

5. The machines otherwise run perfectly.

I'd put money on this being BMC related. There is a bug in these machines, which share a suspiciously similar chipset. Something's not playing ball.

1 Message

February 20th, 2017 07:00

I have a server machine "PowerEdge C6220". Here is a problem of electricity failure and i have also UPS 5 KVA and 6 KVA. When electricity fail then all my computer lab switch to UPS and also Routers and Switches also switch to UPS but my server goes OFF, it is not switch on UPS. Kindly guide me what i do with my Server Machine. 

No Events found!

Top