Unsolved

This post is more than 5 years old

3 Posts

789

February 5th, 2019 23:00

r710 memory upgrade - isssues

Hi all,

I recently upgraded the memory on my r710 server running esxi 6.5 and I've started getting PSOD's as a result. The iDrac shows a number of errors which (I believe) I've resolved, but once the first one was fixed, the second one appeared about 12 hours later:

Multi-bit memory errors detected on a memory device at location DIMM_A8 was the first error message and as a result I replaced the DIMM. I now got the same message on DIMM_A7, but I haven't replaced the DIMM yet, just rebooted the server.

Initially I only had 48 GB Ram (M393B5170FH0-CH9) on the server, in all slots but A7-9, B7-9 (so 12 x 4 GB modules). I purchased some more RAM, same model, and plugged these DIMMs in the empty slots.

Once I've done that, **bleep** broke loose ... as the ESX got PSOD twice in 24h now. 

I read somewhere that 3x6x4GB = 72GB is a supported r710 memory configuration and I'm running the latest BIOS and firmware that is available on the DELL website.

Can anyone share some light as to what might be the problem here?

Many thanks.

-trailro 

 

6 Operator

 • 

2.9K Posts

February 6th, 2019 07:00

Good morning,

One potential issue is that the one of the DIMMs could be quad ranked. If a quad ranked DIMM is in use, you can't use more than two slots in the memory channel. I'd also confirm that this isn't a mix of buffered (RDIMM) and unbuffered memory (UDIMM).

As for the arrangement, The matrix I am looking at indicates that fully populating with 4GB DIMMs is supported.

Slots A7 and A8 are in separate memory channels, I'd consider switching processors 1 and 2, so you can isolate the memory controller. 

3 Posts

February 7th, 2019 10:00

Thanks for your answer.

I investigated the situation and I can confirm that all DIMMS are RDIMMs, so they all "look" the same. 

I swapped a couple between them, and the issue persists. Initially I was getting A8 errors, then after about 12 hours, A7 was marked as having issues. Swapped them with some others, same errors.

However, looking at the dimms, I can see only one difference, but I can't find anywhere on the internet if this is an issue or not. The current dimms are M393B5170FH0-CH9Q4, whilst the new ones are M393B5170FH0-CH9Q5. If this is not an issue, then I guess the only explanation would be the memory controller - which seems odd. 

6 Operator

 • 

2.9K Posts

February 8th, 2019 07:00

The specs make them look the same. Moving the processors to see if the problem follows to the B side is what I'd do next. You can fault isolate your way through the issue by moving the proc, seeing if it follows, then if not, switching DIMMs A7, A4, and A1 with DIMMs B1, B2, and B3. this way the DIMMs in the channel can be broken up. Breaking them up into different channels prevents the hardware from misreporting which DIMM is having issues. If the issue still comes up on slot A7 at that time, you'd know you're looking at a system board issue.

No Events found!

Top