Unsolved
1 Rookie
•
34 Posts
0
3581
iDRACs SNMP doesn't respond steadily
I have a R430 that simply doesn't like to give me SNMP responses like the rest of our 33 R430 servers do.
I made a script that keeps querying the host name over and over and over again with a 5 second pause. That one server I have issues with might give a response, or it might not. Sometimes it time-outs, sometimes it says "No Such Object available on this agent at this OID". And even when it responds, I can see (with tshark) that it's only due to snmpget internally retrying it. With tshark I can see that it gets a lot of ICMP 139 Destination unreachable (Port unreachable). But then the next second it might reply.
Is there any way possible that I run some kind of tshark on the idrac itself to see whether or not it receives those packages? Is iDRAC's CPU the limiting factor here? Or do I have faulty wires or bad switch ports?
The iDRACs are all connected to a HP J9775A 2530-48G Switch, revision YA.16.03.0004, ROM YA.15.19
DELL-Marco B
Moderator
Moderator
•
3.5K Posts
0
November 24th, 2020 09:00
Hello,
did you try to reset a soft reset with RACADM or iDRAC gui?
You can see here how to do it
https://dell.to/3nQYtM3
If the issue persists then make sure the system BIOS and lifecycle controller are up to date. I would also review their configuration to see if they are identical, mainly the processor.
Try also to change port speed of the idrac.
Please let me know if it helps.
Thanks
Marco
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
November 24th, 2020 21:00
Nope, didn't help. I dropped the speed from 1G to 100M, and still time outs. And I compared all the network settings between a working idrac and the broken one, and didn't find anything different.
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
November 24th, 2020 21:00
Identical hardware bought on the same order. I've reset the iDRAC 2-3 times while testing this. But changing the port speed is something I haven't tried. I'll see where that can be done and test if it helps. Thanks!
DELL-Shine K
4 Operator
4 Operator
•
3K Posts
0
November 24th, 2020 23:00
What iDRAC Firmware you have on the server where it is not working? Is all working and not working iDRAC connected to same network and switch?
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
November 25th, 2020 00:00
System Model: PowerEdge R430
BIOS: 2.11.0
Firmware Version: 2.75.75.75
Lifecycle Controller Firmware: 2.75.75.75
And all our 33 hosts are compliant to the latest what Dell has to provide according to OpenManage Enterprise.
I've been running these snmp queries for all 33 hosts every now and then for a while, but this one host is the only one giving me time outs or replies only after a retry.
When it starts totally timing out, it does so for several minutes. Then it comes back to working (albeit, tshark shows that it has to be queried multiple times before it actually gives a response). Those "No Such Object" responses are totally random here and there. But again, only from this one server.
DELL-Joey C
Moderator
Moderator
•
3.4K Posts
0
November 25th, 2020 01:00
Hi,
It's weird that 33 identical servers are working as intended but 1 would not. Have you tried swapping network cable from one of the 33 server's iDRAC port to the problem server? Or have you tried swapping IP addresses from any 33 servers?
Does all servers have iDRAC Enterprise? If yes, you can swap from another server, but you may need to export the iDRAC license, so you can import it to the swapped server. You may need to be careful on the license swap, not to mix up the licenses.
Let us know how it goes.
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
December 6th, 2020 21:00
It seems that there really is something wrong with that specific iDRAC. We swapped the port and the wire and that same iDRAC still fails to respond.properly.
DELL-Young E
Moderator
Moderator
•
4.1K Posts
0
December 6th, 2020 22:00
Hi, how about we try this?
The following are some example use cases:
Hope this helps and have a great one!
Dell-Martin S
Moderator
Moderator
•
3.3K Posts
0
January 18th, 2021 03:00
Hi,
Thanks for posting this summary.
Please inform me if you need any further assistance.
regards Martin
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
January 18th, 2021 03:00
We've done some heavy digging into this problem and found out lots of interesting stuff.
All of our servers are suffering from weird SNMP behaviour. The reply time to a simple SNMP query can vary between 0.02 seconds up to 50+ seconds.
When we started plotting a graph that drew the amount of time it takes for a simple SNMP query, and we drew this graph on top of eachother from all the hosts, we noticed that depending on the time of day, the duration of the queries went up. In other words, sometimes all the hosts seemed to respond very quickly, and sometimes all of them were giving huge delays, and sometimes something between.
When we looked at the CPU load graphs from all these hosts, we noticed that they correlate. Meaning, whenever a host has load, the SNMP reply times went up.
Now this didn't make any sense at all, especially since iDRAC has its own network port, own CPU etc.
But...then I remembered the service dcismeng, the service module installed on the operating system. Its purpose is to fetch data from the operating system to iDRAC, so that iDRAC can give me that data via the SNMP queries. Like the name of the operating system for example (system-osname .1.3.6.1.4.1.674.10892.5.1.3.6.0 and system-osversion .1.3.6.1.4.1.674.10892.5.1.3.14.0).
So I tried shutting the services down from the hosts that were suffering from the delays right now. And indeed, the delays went away.
So what we learned was, that the service module in Ubuntu is the culprit for the delays, making it something we can't really be using. The SNMP results are more important.
Dell-Martin S
Moderator
Moderator
•
3.3K Posts
0
January 18th, 2021 04:00
Hi,
i checked our Database and no developer is aware of this because we could not reproduce this.
I think you are the only one with this because no one reported this failure in our Hotline before.
Here you could check our Supportpage for iDRAC:
https://dell.to/3qtg3ag
regards Martin
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
January 18th, 2021 04:00
Well I wouldn't mind if someone told me that this is by design Are developers aware of service module having this kind of impact? Should it have? Do I have to tweak it somehow? The internal USB bus shouldn't be over saturated after all, and querying the host name shouldn't be that big of an effort.
tosaraja
1 Rookie
1 Rookie
•
34 Posts
0
June 4th, 2021 04:00
But that's not all. We've had problems with our SNMP queries ever since we set the system up, and we've been tracking the source of the problem, but simply can't find it.
So as I said earlier, we measure the time it takes for snmp to give us a reply. We store that data in influx and draw a graph where we can compare the time it takes for different servers to reply. And these values are not even close to be the same across different servers. The load on the servers have nothing to do with the variations in time anymore, because we disabled the service module altogether. We actually uninstalled them. So we are not querying for the distro name or distro version anymore.
The servers are more or less identical. Same model, same CPU, same amount of RAM, which actually doesn't even matter at all, since iDRAC SNMP replies shouldn't even cause load on the CPU of the server. Same network cards, same speeds, same switches they connect to, same BIOS settings, same settings in switches. We can't find out why this happens.... and it matters because some servers constantly time out and spam our logs about it causing alerts.
DiegoLopez
4 Operator
4 Operator
•
2.7K Posts
0
June 4th, 2021 08:00
Hello @tosaraja,
As far as I see this issue, I think this should be escalated to the system engineers. If any of the servers affected by this issue are warranty I suggest you to open a request via phone support and request the engagement of a lvl 3 technician.
This is either an issue that should be checked by them or a insfrastructure issue.
Regards.