Start a Conversation

Unsolved

This post is more than 5 years old

68588

June 3rd, 2015 13:00

iDRAC7 unresponsive issue continues to 2.10.10.10 firmware on R620

Hi.


I've got 4 Dell 12th gen R620 servers, each with iDRAC 7 Express.  Over time, the iDRAC 7 becomes unresponsive.  That is, I can't ping it, or SSH to it, even though it was previously accessible.  I recently installed an updated Lifecycle Controller/iDRAC firmware 2.10.10.10 which was supposed to resolve this "known" issue with iDRAC 7.  Unfortunately, it does not completely resolve the issue. 

The 4 servers are running Red Hat Enterprise Linux 7.1.  In my case, 1 of the servers is a file server, while the other 3 are virtualization hosts for an oVirt virtualization cluster.  After several weeks, I found that while the file server iDRAC was still responding, the iDRAC on all 3 virtualization nodes stopped responding entirely.  I wanted to do a "racadm racreset" to soft reset the iDRAC,  but since I did not have a local OMSA installed including a local copy of racadm, apparently I could not.  I tried "ipmitool mc reset warm", which I'm told should do the equivalent of a soft boot for the iDRAC through ipmi,  but this returned "MC reset command failed: Invalid command" even though "ipmitool mc reset" returns: " Not enough parameters given. usage: mc reset ".  ipmi is configured and I can query the power status through ipmi.

 I installed OMSA to get a local racadm tool so that I could issue the soft reset.  The iDRAC did reboot, but was still very very sluggish.  In comparison, if I were to ping the file server iDRAC interface, 100% of the packets would get through, and accessing it via say, ssh would be very responsive.  After the soft reset of the IDRAC on the virtualization hosts they would drop about 80%+  of the ping packets.  When I would manually request power status via ipmi, sometimes, I would get a response, other times, the connection would time out   Response was EXTREMELY slow.  I tried a racreset hard, and surprisingly enough, this also did not solve the problem.  Finally, I rebooted each of the 3 hosts, and the problem was gone.  I'm pretty sure in a few weeks, it will be back.

What differentiates the 3 servers from the file server is that ovirt is using ipmi to ask the servers for their power status once every few minutes.  Nobody is really querying the status of the file server.  Obviously, there must still be some kind of memory leak in this firmware.


Since I've rebooted the hosts, I don't have a host from which I can submit test data to Dell.  However, I'm pretty sure it will happen again, and it would be great if I had a direct contact for reporting the problem data.    I'm more than happy to work with Dell Enterprise tech support in order to help resolve this issue because the latest 2.10.10.10 which was supposed to finally resolve this long-standing iDRAC 7 issue, obviously does not entirely resolve it.   I'm sure you would all agree that   it's very critical for the iDRAC 7 to function properly without requiring periodic server reboots!

Firmware Version        = 2.10.10.10
Firmware Build          = 49
Last Firmware Update    = 04/08/2015 18:37:50
Hardware Version        = 0.01
System Model            = PowerEdge R620
System Revision         = I
System BIOS Version     = 2.5.2

Thanks for any assistance you can provide!

Jason.

13 Posts

June 10th, 2015 11:00

The problem has occurred again.  The server in question was rebooted less than 24 hours ago.

The server is accessible, but the iDRAC is not pingable. (100% packet loss)

I tried to "ipmitool mc reset hard" from the server.  The idrac rebooted, and is pingable, but very very slow.  Ping is showing dropping 80% of packets.  If I ping the IP address that the LOM is sharing, it is still 100% accessible, and working perfectly fine.

I tried to use dell racadm CLI: ./racadm racreset hard


I get back:

ERROR: A firmware update is currently in progress. Unable to reset the RAC at this time.

==============================================================================
IMPORTANT NOTE!
The RAC is unable to communicate with the BMC. This condition may
occur because of (1) no BMC is present, (2) missing or disfunctional
IPMI-related software components. Many RAC features depend on BMC
connectivity in order to work properly, and you may see failures
as a result.
===============================================================================

I tried to ssh in, and it took forever to connect because the number of dropped packets.

After I got in, I first tried a "racreset soft".  The same thing happened as before.

I let the RAC reboot, then tried a racreset hard.  Same thing again.

It seems the only way I can recover from this scenerio is rebooting the server.

No Events found!

Top