Start a Conversation

Unsolved

This post is more than 5 years old

35114

August 23rd, 2013 05:00

OME 1.2 False Positives

Hi,

We receive false positive system down events for servers. OME says the server is down, but actually the server is up and running. The ICMP is configured as follows (Timeout 2500ms and retries 5), to try and solve the false positive issue, but this does not prevent the false positive messages. The ping time for the servers is between 40 and 200 ms, depending on their location. Status polling is set to every 5 minutes.

Is there something I have missed during the configuration of OME and/or the servers? Does someone has a solution for this issue?

 

 

1K Posts

August 23rd, 2013 05:00

Hi Shortek,

Which is the version of OME that you are using? If it is OME 1.1, upgrading to OME v1.1.1 or OME v1.2 should solve your problem.

If you are already using OME v1.2, then this link should be of some help:

http://en.community.dell.com/techcenter/systems-management/f/4494/p/19505533/20417596.aspx#20417596

 

1K Posts

August 23rd, 2013 06:00

Hi Shortek,

Thanks for confirming. Couple of questions:

  • Is it happening for all the servers in your device tree or a specific range?
  • Is it happening only for servers or there are other devices like storage or network devices or they are under unknown group?
  • What is the status poll frequency set on you OME server? What i mean is after how much time does status poll run?
  • During the time when status poll runs, can you try to check the ping time for any of the device for which the system up/down alert is getting generated?
  • Is your other monitoring application sitting on the same server on which OME is installed?
  • How many devices are there in your OME server and in your OME server having minimum required configuration? 
You can find this information in the UserGuide at 

http://www.dell.com/support/Manuals/us/en/04/Product/dell-opnmang-essentials-v1.2

10 Posts

August 23rd, 2013 06:00

I am already on OME 1.2. I will have a look at the given article.

Thanks so far and I will update this threat if I succeeded to solve the issue with the given article.

 

10 Posts

August 23rd, 2013 06:00

The given article is not the solution I am looking for, as we already did the testing that is provided there. Currently our ICMP settings are already above normal values. timeout = 2500ms and retries = 5

I can't believe a normal server has a reply of 2500ms for more than 5 pings.

With that said, our other monitoring system shows the server up and the timeout within this system is set to 500 with 3 retries.

 

10 Posts

August 23rd, 2013 07:00

 

  • Is it happening for all the servers in your device tree or a specific range?
    • I made a split between local and abroad servers. The abroad servers are giving the issues.
    • Our current environment is working from a test server. In the production environment, all servers will be listed as abroad, as the production server is in another country then the test server.
  • Is it happening only for servers or there are other devices like storage or network devices or they are under unknown group?
  • There are only servers listed in OME, so only servers.
  • What is the status poll frequency set on you OME server? What i mean is after how much time does status poll run?
    • At this moment it is set to 5 minutes

  • During the time when status poll runs, can you try to check the ping time for any of the device for which the system up/down alert is getting generated?
    • We have done this and the ping shows a normal ping time. At least the time is not more than 2500ms
    • The issue does not happen all the time. But when the issue happens, the response of the server is normal, both by doing a ping and in our other monitoring environment.
  • Is your other monitoring application sitting on the same server on which OME is installed?
    • No, the other monitoring tool is on a different server.
  • How many devices are there in your OME server and in your OME server having minimum required configuration? 
  • At this moment there are 37 devices listed in OME. The server has the minimum requirement

Let me know if you need more information

2 Intern

 • 

2.8K Posts

August 23rd, 2013 10:00

Thanks for all of this detail.

So let's try to put the status poll back to the default 1 hour and run with that for a while to see how it behaves.

One misconception with OME is that you have to drop the status poll to 5 minutes in order to get timely alerts.  OME has on-demand health polling.  So when an alert comes into the console we always to out and poll the device for its health status.   So it is not usually necessary to have aggressive status polling.

Let's see what that does and report back.

Thanks much,

Rob

10 Posts

September 9th, 2013 04:00

Sorry for not coming back to this sooner.

It looks like it is going better now, but still now and then we get some false positives

We are busy moving the OME server to a different location. This will be a fresh install, with a new database, so maybe the problems will be solved after the move.

If not, I will come back to this.

Thanks to everyone who tried to help.

10 Posts

September 16th, 2013 01:00

We have moved the installation, but the problems are not solved. One question, are the ICMP configuration settings, set in the Discovery Ranges, also used for the Status Polling?

We still have devices mentioned as down and an hour later they are up again. Because the status polling saw the devices down, while they were actually up and running. This has probably to do, with missed replies, during the scan. Since we use off site backup, over the LAN line, this probably can cause high ping times or connection time out messages. Upgrading the line is not an option at this moment.

So if I change the number of retries and timeout in the ICMP settings, will this also affect the Status Polling

1K Posts

September 16th, 2013 02:00

Hi,

The answer is yes, OME uses the same ICMP configuration settings for status polling as well. So if you change the number of retires and timeout in the ICMP settings, it will affect Status Polling. 

This post should help you: http://en.community.dell.com/techcenter/systems-management/f/4494/p/19505533/20417596.aspx#20417596

10 Posts

September 16th, 2013 02:00

Perfect, thanks for your response.

Going to trial and error for the best settings.

10 Posts

September 23rd, 2013 05:00

At this moment I have the following settings in place

ICMP Configuration:

Timeout: 2500ms

Retries: 10

Status Schedule:

1 hour

Speed is in the middle of the bar.

Still we receive false positives. Sometimes, 4, 5 or 6 servers at a time. When I receive a notification and ping the device myself from my machine or from the OME server, the device is UP.

Correct me if I am wrong, but with the above settings, the status polling should give an error when it misses 10 pings or 10 pings above 2500ms. Am I correct?

What could cause this issue?

 

 

No Events found!

Top