9 Legend

 • 

16.3K Posts

February 15th, 2010 11:00

Well, it can be ignored, as it sounds like the ECC is simply doing its job - correcting errors.  However, I think you would be doing yourself a favor to figure out why they are experiencing so many errors that need correcting.  You mentioned in your post that these errors happen on three different machines, "across all speeds and types".  Did you add after-market memory to these machines, or are they running in their original configuration from Dell?  If after-market, what type of memory came with it and what kind of memory was added to it?  These new machines can be finicky about their memory.  Also, I would check to see if there are any BIOS/ESM updates.

11 Posts

February 15th, 2010 12:00

All original Dell equipment installed in July 09.  Happens on three different memory configurations over 224 machines.   Some machines have different speed and type due to memory configuration (total mem).  There is a BIOS update available for these but there isn't anything in the change log regarding any memory or other bug fixes.  Thanks.

 

 

347 Posts

February 16th, 2010 13:00

My two cents: if you are running openmanage server administrator (omsa), you can use the System Components (FRU) Information section to see what brand of memory you have on the system. Start tracking the service tag of the system, and the fru information for each dimm that is reported in a spreadsheet. Perhaps this is happening to a specific brand or part number of memory andif yo present the data to dell, they may either confirm that part number may have issues, or they can escalate the issue and they may want to capture some of your hardware for analysis. if you are not running omsa, you could use the dell system e-support tool (dset)to create reports of the systems and the fru info would the there too. dset 

7 Posts

February 16th, 2010 16:00

Don't forget to reseat the memory.

11 Posts

February 17th, 2010 11:00

Good advice and we have done all of the above.  The main issue here is Dell isn't concerned about hardware unless its failing, which is logical.  These DIMM's aren't failing a memtest or a Dell diagnostic.  Only when I get a repeat offender do I urge Dell to replace and then they will capture via a process we have set up with them.  The problem is 90% of the time an orange light caused by these ECC warnings on a machine won't trigger a memtest or diag failure.  I know the brand,type, and speed of every DIMM in these 9 racks and the error occur across them all occasionally.

49 Posts

October 1st, 2010 08:00

Dell released a BIOS update (version 1.4.8) for the R410's in September 2010 which may well fix your problem. It updates the microcode of the Intel Xeon 56XX series to avoid having invalid DRAM statuses (which can cause hangs amongst other things).

 

We've seen two such DRAM errors on one of our R410's in the last 6 months and we reseated the RAM the first time. However, the second one has just happened and it looks like a BIOS update might do the trick for us. Snag is that I have been unable to update any BIOS'es on our R410's and R710's all this year because Dell's updating programs for Red Hat Linux bomb out with "The update failed to complete" :-( I posted up about this here:

 

http://en.community.dell.com/support-forums/servers/f/956/t/19325421.aspx

 

but sadly no-one has replied. And, no, BIOS 1.4.8 doesn't apply either and that's only weeks old!

11 Posts

October 5th, 2010 08:00

Thanks for the heads up on this new BIOS.  Looking into it now.

No Events found!

Top