Troubleshooting memory errors on PowerEdge systems by swap testing

Troubleshooting memory errors on PowerEdge systems by swap testing


When a single-bit error (SBE) and/or multi-bit error (MBE) is reported on one or more memory DIMM locations, the cause might not be down to the DIMM itself, so some simple troubleshooting will need to be performed to determine where exactly the fault lies. See Figure 1 (English Only) for an example of memory errors appearing in the iDRAC interface on an R715.


Figure 1: Memory errors as displayed in iDRAC 6 logs

Isolating memory issues involves swapping memory DIMMs into different memory sockets, channels, banks and controllers. There are several ways you can swap the DIMMs around to narrow down the fault. You might need to use more than one of these methods to pinpoint the faulty DIMM or Socket. Below you will find a representation of these methods. To make the explanations easy, we will assume the faulty DIMM is A1 or one of the set marked in Blue in the illustrations.

Note: You can read more about memory for your systems in our memory articles


Swapping DIMMs in groups (by Channel or Bank) rather than individually is the best method to identify the failed DIMM or DIMMs
Once a group of DIMMs has been identified to contain the failed DIMM or DIMMs, then moving single DIMMs can be used to identify which DIMM(s) has failed

Note: The Memory Video Archive contains videos showing how to remove and install memory in different servers.


Method 1:

Swapping DIMM A1 (marked in blue) with DIMM A9 (Marked in red) to try the DIMM in a different memory channel and bank


Figure 2: Swapping DIMM A1 with DIMM A9

Method 2:

Swapping DIMM A1 (marked in blue) with DIMM B1 (marked in red) which will put the DIMM on an altogether different memory controller (CPU).


Figure 3: Swapping DIMM A1 with DIMM B1

Method 3:

Swapping the whole bank of DIMMS (A1, A2, A3 - marked blue) with another bank (B1, B2, B3 - marked red) will test the whole bank of DIMMs in a new bank, on a new memory controller.


Figure 4: Swapping DIMMs A1, A2, A3 with DIMMs B1, B2, B3

Method 4

Swapping a whole channel of DIMMs (A1, A4, A7 - marked blue) with another channel (B1, B2, B3 - marked red) will test the whole channel of DIMMs in a new channel, and on a new memory controller.


Figure 5: Swapping DIMMs A1, A4, A7 with DIMMs B1, B4, B7

Interpreting the results after swapping DIMMs

As a general rule, DIMM errors tend to follow the DIMMs identified in the errors. For example with a SBE reporting on DIMM A1, swapping this DIMM with a different DIMM will result in one of following:

  1. The error message is no longer reported and the problem is resolved
  • This indicates that reseating the memory resolved the issue
  1. The error message follows the DIMM (DIMM A1 is swapped with DIMM B1, and error messages is now reported against DIMM B1)
  • This indicates that the DIMM is most likely failed and requires replacement.
  1. The error message follows the DIMM socket (DIMM A1 is swapped with DIMM B1, and error messages is still reported against DIMM A1)
  • This indicates that the system board or CPU is most likely failed
  • Swapping CPUs will confirm which component requires replacement
  • If problem follows the CPU (error message moves after swapping CPUs), replace CPU
  • If problem stays with DIMM socket, replace the system board
  1. The error message doesn't follow the DIMM or the socket (error is reported against a completely different DIMM after swapping)
  • This indicates that a different DIMM or DIMMs is most likely bad
We would advice you to also keep your firmware levels up to date as this can reduce the risk of receiving memory errors as well as prolong the life of the DIMMs


Need more help?
Find additional PowerEdge and PowerVault articles

Visit and ask for support in our Communities

Create an online support Request




文章ID: SLN289424

上次修改日期: 02/07/2017 03:34 AM


评价此文章

准确性
有用性
易理解性
这篇文章对您有帮助吗?
向我们发送反馈
注释中不得包含以下特殊字符:<>()\
抱歉,我们的反馈系统目前发生故障。请稍后重试。

感谢您提供反馈。