Start a Conversation

Unsolved

T

1 Rookie

 • 

34 Posts

3581

November 24th, 2020 03:00

iDRACs SNMP doesn't respond steadily

I have a R430 that simply doesn't like to give me SNMP responses like the rest of our 33 R430 servers do.

I made a script that keeps querying the host name over and over and over again with a 5 second pause. That one server I have issues with might give a response, or it might not. Sometimes it time-outs, sometimes it says "No Such Object available on this agent at this OID". And even when it responds, I can see (with tshark) that it's only due to snmpget internally retrying it. With tshark I can see that it gets a lot of ICMP 139 Destination unreachable (Port unreachable). But then the next second it might reply.

Is there any way possible that I run some kind of tshark on the idrac itself to see whether or not it receives those packages? Is iDRAC's CPU the limiting factor here? Or do I have faulty wires or bad switch ports?

The iDRACs are all connected to a HP J9775A 2530-48G Switch, revision YA.16.03.0004, ROM YA.15.19

Moderator

 • 

3.5K Posts

November 24th, 2020 09:00

Hello,

did you try to reset a soft reset with RACADM or iDRAC gui?

You can see here how to do it

https://dell.to/3nQYtM3

 

If the issue persists then make sure the system BIOS and lifecycle controller are up to date. I would also review their configuration to see if they are identical, mainly the processor.

 

Try also to change port speed of the idrac.

Please let me know if it helps.

 

Thanks

Marco

1 Rookie

 • 

34 Posts

November 24th, 2020 21:00

Nope, didn't help. I dropped the speed from 1G to 100M, and still time outs. And I compared all the network settings between a working idrac and the broken one, and didn't find anything different.

1 Rookie

 • 

34 Posts

November 24th, 2020 21:00

Identical hardware bought on the same order. I've reset the iDRAC 2-3 times while testing this. But changing the port speed is something I haven't tried. I'll see where that can be done and test if it helps. Thanks!

4 Operator

 • 

3K Posts

November 24th, 2020 23:00

What iDRAC Firmware you have on the server where it is not working? Is all working and not working iDRAC connected to same network and switch?

1 Rookie

 • 

34 Posts

November 25th, 2020 00:00

System Model: PowerEdge R430
BIOS: 2.11.0
Firmware Version: 2.75.75.75
Lifecycle Controller Firmware: 2.75.75.75

And all our 33 hosts are compliant to the latest what Dell has to provide according to OpenManage Enterprise.

I've been running these snmp queries for all 33 hosts every now and then for a while, but this one host is the only one giving me time outs or replies only after a retry.

When it starts totally timing out, it does so for several minutes. Then it comes back to working (albeit, tshark shows that it has to be queried multiple times before it actually gives a response). Those "No Such Object" responses are totally random here and there. But again, only from this one server.

Moderator

 • 

3.4K Posts

November 25th, 2020 01:00

Hi,

 

It's weird that 33 identical servers are working as intended but 1 would not. Have you tried swapping network cable from one of the 33 server's iDRAC port to the problem server? Or have you tried swapping IP addresses from any 33 servers?

 

Does all servers have iDRAC Enterprise? If yes, you can swap from another server, but you may need to export the iDRAC license, so you can import it to the swapped server. You may need to be careful on the license swap, not to mix up the licenses.

 

Let us know how it goes.

1 Rookie

 • 

34 Posts

December 6th, 2020 21:00

It seems that there really is something wrong with that specific iDRAC. We swapped the port and the wire and that same iDRAC still fails to respond.properly.

Moderator

 • 

4.1K Posts

December 6th, 2020 22:00

Hi, how about we try this?

 

The following are some example use cases:

  • Run the tool and select the SNMP protocol, specify protocol specific settings, and provide remote system IP Address when unable to discover a supported device in OpenManage Essentials or the device is classified as an unknown device while using SNMP protocol. For Dell PowerEdge servers, verify whether the tool provides Server Administrator version in the result area for a successful test (using the SNMP Protocol for an in-band server discovery).
  • Verify the ability to ping a remote device (using Ping/ICMP protocol) and the ability to resolve the host name of the device from the management station (using Name Resolution protocol).
  • Verify if the Dell OpenManage Server Administrator services are running on a remote device (using Services protocol).
  • Check warranty status of any Dell server (using Service Tag).
  • Execute a command on the remote system and verify its output.
  • Listen to incoming SNMP traps.
  • Use the tool to forward test SNMP traps.
  • Use the tool to verify the SMTP settings to ensure that OpenManage Essentials can successfully run the Alert Emails action.
  • Use the tool to verify the availability of the local or remote ports.
  • Identify the user groups of the active user.
  • View or delete the pending system update jobs in an iDRAC.
  • Verify connectivity to the deployment file share.

Hope this helps and have a great one!

Moderator

 • 

3.3K Posts

January 18th, 2021 03:00

Hi,

 

Thanks for posting this summary.

 

Please inform me if you need any further assistance.

 

regards Martin

1 Rookie

 • 

34 Posts

January 18th, 2021 03:00

We've done some heavy digging into this problem and found out lots of interesting stuff.

All of our servers are suffering from weird SNMP behaviour. The reply time to a simple SNMP query can vary between 0.02 seconds up to 50+ seconds.

When we started plotting a graph that drew the amount of time it takes for a simple SNMP query, and we drew this graph on top of eachother from all the hosts, we noticed that depending on the time of day, the duration of the queries went up. In other words, sometimes all the hosts seemed to respond very quickly, and sometimes all of them were giving huge delays, and sometimes something between.

When we looked at the CPU load graphs from all these hosts, we noticed that they correlate. Meaning, whenever a host has load, the SNMP reply times went up.

Now this didn't make any sense at all, especially since iDRAC has its own network port, own CPU etc.

But...then I remembered the service dcismeng, the service module installed on the operating system. Its purpose is to fetch data from the operating system to iDRAC, so that iDRAC can give me that data via the SNMP queries. Like the name of the operating system for example (system-osname .1.3.6.1.4.1.674.10892.5.1.3.6.0 and system-osversion .1.3.6.1.4.1.674.10892.5.1.3.14.0).

So I tried shutting the services down from the hosts that were suffering from the delays right now. And indeed, the delays went away.

So what we learned was, that the service module in Ubuntu is the culprit for the delays, making it something we can't really be using. The SNMP results are more important.

Moderator

 • 

3.3K Posts

January 18th, 2021 04:00

Hi,

 

i checked our Database and no developer is aware of this because we could not reproduce this.

I think you are the only one with this because no one reported this failure in our Hotline before.

Here you could check our Supportpage for iDRAC:

https://dell.to/3qtg3ag

 

regards Martin

 

 

1 Rookie

 • 

34 Posts

January 18th, 2021 04:00

Well I wouldn't mind if someone told me that this is by design Are developers aware of service module having this kind of impact? Should it have? Do I have to tweak it somehow? The internal USB bus shouldn't be over saturated after all, and querying the host name shouldn't be that big of an effort. 

1 Rookie

 • 

34 Posts

June 4th, 2021 04:00

But that's not all. We've had problems with our SNMP queries ever since we set the system up, and we've been tracking the source of the problem, but simply can't find it.

So as I said earlier, we measure the time it takes for snmp to give us a reply. We store that data in influx and draw a graph where we can compare the time it takes for different servers to reply. And these values are not even close to be the same across different servers. The load on the servers have nothing to do with the variations in time anymore, because we disabled the service module altogether. We actually uninstalled them. So we are not querying for the distro name or distro version anymore.

The servers are more or less identical. Same model, same CPU, same amount of RAM, which actually doesn't even matter at all, since iDRAC SNMP replies shouldn't even cause load on the CPU of the server. Same network cards, same speeds, same switches they connect to, same BIOS settings, same settings in switches. We can't find out why this happens.... and it matters because some servers constantly time out and spam our logs about it causing alerts.

SNMP Reply times

4 Operator

 • 

2.7K Posts

June 4th, 2021 08:00

Hello @tosaraja,


As far as I see this issue, I think this should be escalated to the system engineers. If any of the servers affected by this issue are warranty I suggest you to open a request via phone support and request the engagement of a lvl 3 technician.


This is either an issue that should be checked by them or a insfrastructure issue.


Regards.

No Events found!

Top