Start a Conversation

Unsolved

This post is more than 5 years old

457892

December 9th, 2010 03:00

IPMIDRV 1004 errors on Windows 2008 R2 - M600 blades

Since updating firmware on all our M600 blades in order to attempt to resolve another issue ( http://www.delltechcenter.com/thread/4353610/M1000e+fans+and+the+M610x...what%27s+happening%3F), we have started to have problems with various network connections randomly dropping out - usually for about 10 minutes - before returning with no intervention. The whole time, the following error is logged in the Windows System event log:

Log Name: System
Source: IPMIDRV
Date: 09/12/2010 09:53:46
Event ID: 1004
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: hostname.domain.com
Description:
The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the communication failed due to a timeout. You can increase the timeouts associated with the IPMI device driver.
Event Xml:



1004
3
0
0x80000000000000

3638
System
hostname.domain.com



\Device\00000066
000004000100000000000000EC030580000000000000000000000000000000000000000000000000E50000C0



Can anyone enlighten me as to what the problem might be?
We've had these errors on at least 5 of our 8 M600s, all on Windows 2008 R2.

December 9th, 2010 07:00

This took down our Exchange servers and our Hyper-V cluster. We just installed the newly released BIOS version 2.4.0 and still get these lock-ups. The iDRAC usually bombs at the same time, and it's been happening since we went to iDRAC 1.53 on the M600s. Now still occurring on Rev 1.55 (Build 2).

180 Posts

December 9th, 2010 08:00

Neil,

Pinging PG and IPS.

KongY@Dell

December 9th, 2010 09:00

Raised SR 827226208. I'll dump the logs/DSETs I pull off the M600s into the FileExchanger for that. The response from ProSupport was to reboot the chassis, so we're going to try and get that done at some point. As always, I'll keep this updated with any progress.

180 Posts

December 9th, 2010 10:00

Neil,

Thanks. And your updates are appreciated.

Kong

December 10th, 2010 08:00

For this issue, Dell are 99% sure that a chassis power off will resolve...sort of defeats the object of the blade chassis...but nonetheless we will have to do it. I shall report back in a few hours time!

December 13th, 2010 02:00

We have rebooted the blade chassis and reseated everything. We haven't had a full outage as yet, but we have seen IPMIDRV errors in Windows and in the "racadm racdump" logs. So not closed off the call as yet...we may need some more time.

A new CMC firmware (3.10) came out just before we powered the lot down, so that was installed on Dell's advice.

December 14th, 2010 01:00

We've only seen the IPMI errors on two blades since we powered down our chassis. An engineer with two M600 motherboards is on the way now. Hopefully that will finally resolve.

December 14th, 2010 08:00

Alas - no joy. Replacement MB made no difference. Still getting IPMI errors, Nic and iDRAC drop outs on two of our blades in particular, but occassionally on other blades too.

Had a little point of interest...I downloaded and installed the latest BroadCom firmware the other day, dated 06/Dec/2010. It installed Family 4.6.8. The new motherboard was installed with version 5.x on it...so I went back to the M600 support site and it wouldn't return any updates. Then I went back about 30 minutes later and a completely different file was linked to the release on 6/Dec/2010...this installed Broadcom Family 6.0.1.

All very strange...nonetheless, even this latest firmware has not resolved our issues.

180 Posts

December 14th, 2010 10:00

Neil,

My PG and IPS contacts were not aware of any issues similar to what you're observing. Still waiting for responses from a few extended team members.

Kong

December 15th, 2010 00:00

I checked the file dates for the Broadcom Firmwares - and they are all dated 10th Dec 2010...which ties in with what I saw...maybe a content management problem? Anyway - it's not the cause of my issue. The latest drop out was yesterday. Wondering where to turn now. I've uploaded all sorts of logs and we're nowhere nearer a fix. Is a new chassis a possibility here? What else can be done?

The next thing will be a "midplane" swap I guess...are there any other parts that can be swapped after that?

December 17th, 2010 01:00

The IPMI errors seem to be limited to one server now, pretty much. We swapped out it's system board yesterday. The IPMI errors returned again overnight. If we move the server to another slot, then the IPMI errors follow the server.

We've swapped out the mainboard, so what could possibly be the cause? It's not software, because we've seen it happen in POST. I'm just trying to think of the parts of an M600 that aren't on the mainboard that might possibly cause the problem? TOE/ISCSI Key maybe?

December 22nd, 2010 13:00

Hi - it's been a few days since I had anything significant to say on this one. A replacement TOE key has just gone in so hopefully this will sort it. Only had one server having these drop-outs and it's the iDRAC and Fabric A which drop now. We had IPMI errors on this server and one other at the same time today. but this server has always been involved, and it went offline first, so logic is beginning to point to this server causing the problem. Not that logic is always correct in these sorts of complex cases.

I shall keep you updated on how the TOE key change gets on.

January 4th, 2011 05:00

Replacement TOE key didn't fix

180 Posts

January 13th, 2011 11:00

Neil,

PG Engineering said that they have not observed what you've seen. They're still going through their queues for anything similar.

January 14th, 2011 01:00

This had sort of died down a bit - we'd only seen it on one server for about a month. We're goign to try intel nics instead of broadcoms and move the io to another fabric.
No Events found!

Top