Start a Conversation

Unsolved

This post is more than 5 years old

B

2900

October 26th, 2017 10:00

OME losing connection to DRACs

We have a number of DRACs in our environment that OME keeps being unable to query status on, so every hour they either "go down" (OME can't ping them) or "come up" (OME is able to contact them again).

Every time we completely reboot our network environment (e.g. due to total power loss from a hurricane) the DRACs that OME has trouble with change to different ones. So OME might keep alerting on DRAC A but then after the network reboot OME stops alerting on DRAC A and starts alerting on DRAC B.

I know this doesn't sound like an OME/DRAC issue, but I wondered if anyone else had seen a similar problem in their environment and/or had some ideas as to what the problem might be.

Thanks.

October 29th, 2017 22:00

Hi, thanks for the query.

I would suggest you to increase the status polling time in OME to avoid these alarms until you network setup is stable. By default it is 1 hour and can be changed as per your needs.

336 Posts

November 2nd, 2017 12:00

Thanks, but you evidently didn't understand my query. You're basically asking me to stop OME from polling my servers, which defeats the point of the status polling in the first place.

Besides, that doesn't actually fix anything. Something is causing the status polling to fail for certain servers. Turning off/delaying polling doesn't fix the problem.

As stated in my original post, I've noticed that the servers that "fail" status polling have changed since the last network reboot. I'd like to know the reason for the change, but I'm drawing a blank on what to look for.

November 3rd, 2017 01:00

As you mentioned there were IP address changes/re-assignments on target servers in your environment, it would be best to first delete such servers from OME and discover them again with correct IPs.

336 Posts

November 3rd, 2017 07:00

Please read my posts again. I said nothing about IP changes or reassignments. What I said was that OME has trouble polling certain devices (i.e. they sometimes respond to the poll and sometimes do not, so I get frequent erroneous up/down alerts) and that after a network reboot (i.e. all the switches and routers are powered down e.g. due to a hurricane and then powered back on) the specific devices that OME is unable to poll correctly changes to different ones. So OME was having trouble with device A before the network reboot, and now is having trouble with device B instead.

November 3rd, 2017 07:00

OME will generate these errors when it is not able to reach the devices during the status poll. A simple ping test can be run to ensure the devices are responding in that period. Request you to ensure the n/w infra is fine and not intercepting any packet drops or other errors which might lead to OME not being able to connect to targets.

336 Posts

November 3rd, 2017 07:00

Thank you for the reply, but I am trying to do just that - determine what could be causing the status poll to fail. Your response to "ensure the network infrastructure is fine" doesn't give me much to go on. If I ask my network group they will say the network is fine.

This is why I posted this question, to see if anyone else had seen similar behavior in their environments and could make some suggestions as to what to look at/for to troubleshoot the issue.

I am very puzzled why OME would have issues polling certain servers and then after a network reboot have issues polling different servers. If it was a physical problem with the infrastructure or devices themselves I would expect OME to have trouble with the same devices both before and after a network reboot.

1 Message

November 21st, 2017 14:00

[tag:metoo] 

I've been dealing with this issue for a while too. I had a single iDRAC node failing in OME for no reason after the status poll following the daily discovery and reporting as UP just a couple of minutes later. I played with Dell's suggestion of increasing the polling time (since this host was in a different site than the OME server) and that didn't help, but then a few weeks later the issue went away on its own.

Recently, however, I changed the SNMP trap settings on all iDRACs to include more data for alerts (about 155 servers) via racadm and all of a sudden about 40 of those now throw the same Node Down error right after the daily discovery, only to report as back up a few minutes later. Unbelievably frustrating.

Sorry I don't have much for you to go after, but here's hoping Dell can give us more clues as to why this is happening and, more importantly, how to fix it. Right now, i have all our Dell alerts go only to me as I can't spam the entire team with daily garbage of false alerts.

By the way, I'm on OME 2.2 and iDRACs are on 2.41.40.40. Does OME 2.3 have the same issues? I think Dell has changed the way they're doing status polling in 2.3.

336 Posts

November 21st, 2017 15:00

Thanks for the response. I sympathize...

I noticed that your nodes report being back up "a few minutes later" whereas mine wait for the next scheduled status poll (1 hour for us). Do you poll every few minutes?

I wouldn't know about OME 2.3 as I'm still running 2.2.0.2056. My iDRACs are on similar levels too. If all it takes to fix this is an upgrade that would be great, but I don't want to make things worse...

No Events found!

Top