Start a Conversation

Unsolved

This post is more than 5 years old

R

100323

December 8th, 2014 08:00

Alerts - Server Off

Hello,

I've been trying setup notifications on OME 2.0.1 to send an email alert when a server has a warning or critical error.  I'm still testing it but I was able to receive an alert when one of our servers sent an alert saying it's hardware log was full.

Over the weekend, one of our site servers went down due to a power outage but no alert was sent.  When I edit email alert I created, under Alert Categories, I do have everything checked which should include power alerts. 

Is there something I'm missing that I need to have an alert send when a server is currently registering an off state in OME?

Any help would be appreciated.  Thank you.

Ryan

2 Intern

 • 

2.8K Posts

December 8th, 2014 08:00

Hi Ryan,

Can you post a screen shot or the text of the Power SNMP alert from the SNMP console in OME?  That might give us some clue on why the email rule did not trigger.

Thanks,

Rob

delltechcenter.com/ome

22 Posts

December 8th, 2014 09:00

So in the fourth screenshot, the server highlighted in blue, gws01, is the server that was off this morning but no message was sent.  I was able to send a test message to the email address and it was received. (I did removed the email addresses from the screenshots.)

22 Posts

December 8th, 2014 09:00

22 Posts

December 8th, 2014 13:00

Ah, I apologize, I misunderstood.  Looking through the console alerts, there is no alert saying that the server is down.  It shows a controller information alert Saturday morning and then a SNMP Agent cold start message this morning when it was powered back on.  OME never generated an alert for this event.

Is this something I need to configure in OME?

2 Intern

 • 

2.8K Posts

December 8th, 2014 13:00

Yeah, Ok, but can you post a screen shot of the actual _alert_ that came in to the SNMP alert console in OME itself?  That alert would have the OID, IP address, etc.

Thanks,

Rob

22 Posts

December 8th, 2014 14:00

Yes, it was a power outage.  Likely the server did go down gracefully since it's on a UPS and software that will power the server down after an amount of time.  However, I would really like to have a way to receive an alert that shows when a server is reported as being in an off condition in the console.  Is this possible?

I appreciate your feedback.

2 Intern

 • 

2.8K Posts

December 8th, 2014 14:00

Oh...well there's your problem :) :)

I think if the server "up and dies" then there is no outgoing SNMP alert from the crashed box.  The email gets sent as a filter on the incoming SNMP alerts.  No SNMP alert, then no email.

There are some predictive / redundancy failure events, but that depends on the cause of failure.

Do you know why it went down?

Rob

2 Intern

 • 

2.8K Posts

December 8th, 2014 14:00

Ok, is your UPS set up to send SNMP alerts when it kicks in and does a shutdown?  This would be something outside the OMSA or iDRAC instrumentation.  OME does support MIB import.  So if there is a way to do this with your UPS and it has a MIB, that might be a way to do this.

Thanks,

Rob

December 8th, 2014 22:00

Another way to achieve this in OME would be through internal connection status alerts. OME generates internal connection status alerts whenever it detects change in connection status of the device (On to Off or vice-versa). Internal connection status alerts can be enabled through OME preferences. Navigate to Preferences --> Alert Settings, select the required check box and apply settings as shown below:

Internal connection status alert looks like this:

Since you have selected all the alert categories and severity for the email action, this should trigger an email when the server is reported down in the subsequent discovery operation or status poll.

Hope this helps.

22 Posts

December 9th, 2014 08:00

Ok, I've enabled the Internal Connection Status alerts.  I'll see if there is a box I can setup to test this with.  Thank you for the suggestions, I really appreciate it.

We're getting new PowerEdge T420's for all our school locations so I want to make sure OME is working properly before this is done (because in theory we should have no problems with our new servers right...). 

22 Posts

December 9th, 2014 08:00

Ok, I found a server that I could power off and I did receive two alerts to my email (I removed identifiable names and tag numbers).

Device:******************* **************, Service Tag:**********, Asset Tag:, Date:12/9/2014, Time:09:13:34, Severity:Warning, Message:LinkDown Port or number: 14

and

Device:******************* **************, Service Tag:***************, Asset Tag:, Date:12/9/2014, Time:09:13:36, Severity:Warning, Message:Device *************** has changed status to Unknown.

My last question, is there a way for the status to report that it's off instead of Unknown?  This is certainly better than nothing but Unknown isn't specific even though via the console the Status says Off.

2 Intern

 • 

2.8K Posts

December 9th, 2014 17:00

Ok, just to clarify...when you click on the down server, then you are saying the health is unknown (on the tree)  and the connection status is showing off...is that correct?

Rob

1 Rookie

 • 

49 Posts

December 10th, 2014 10:00

For this I imported the UPS MIBs and then monitored the UPS. This alerted us to a problem with power to the server through the UPS. It's highly unlikely there would be a situation where the server and UPS would die at the very same moment.

22 Posts

December 11th, 2014 09:00

Wow, I'm definitely going to look into this.  Thank you.

December 11th, 2014 09:00

The connection status alert (reporting system down) has higher priority than that of health status alert (reporting unknown health). So, when the server goes down, irrespective of whether discovery or status poll has been performed, alert reporting system is down should ideally be received first.

Can you try the use case once more?

1. Discover the server.

2. Shut it down.

3. Run discovery or status poll.

4. Observe the alert log.

We would like to make sure that the behavior you observed is consistent. In such case, we would like to take a deeper look into the OME setup that you have. You can open a support ticket based on this result.

No Events found!

Top