Start a Conversation

Unsolved

GA

61 Posts

693

July 1st, 2022 11:00

R420 seems to be randomly rebooting by itself, need help troubleshooting it

This server had an uptime of 450 days, its basically working fine since I bought it. It never had any problem since it started working. 2 days ago I upgraded its RAM, and according to my monitoring (Zabbix running in a VM inside this server) it has rebooted 3x since that. It had 4x8gb and I added more 4x8gb, now the server have 64gb of installed memory. All the memory modules are the same, model M393B1K70DH0-YK0, both the old ones and the new ones. After the installing I ran 90% of the diagnostic tests of the Life Cycle. It ran many memory tests, everything seemed fine. I rebooted the server, it booted the ESXi successfully, and I called it  a day. Then, today, Ive noticed in the monitoring that the server rebooted 3 times and I wanna start troubleshooting it, for which I need some help.

This is the information I have access right now, through the ESXi:

Gregasd_0-1656700204921.png

The original 4x8gb memory modules were installed in slots A1, A2, B1 and B2 (2 CPUs). Ive placed the new modules on the slots A3, A4, B3 and B4. The following images shows how each set of slot were before and after the memory upgrade:

 

The slots were like that before the upgradeThe slots were like that before the upgrade                 thats how it is now, after the upgradethats how it is now, after the upgrade



Today I went to the data center to check it in loco and there were 2 amber lights blinking (picture), and I didn't manage to make it stop by pressing the button in the front panel (the one at the side of the power button) or the one in the back (that was blinking at the same color and rhythm than the frontal lights).

Lights blinking when I arrived at the datacenter todayLights blinking when I arrived at the datacenter today

IDRAC isn't accessible right now because I don't remember the configurations of it. I will eventually have to stop the machine but I need to gather all the knowledge possible before to do it in order to improve the success chance of the troubleshooting. I will also reconfigure the IDRAC when I stop the server, so we will have better ways of monitoring it.

Questions:
-What can those 2 lights mean?
-Did I placed the new memory modules in a wrong way?
-What else may be causing the problem in a machine that never had problems after it started working?

Ps: Ive tried to upgrade the server's total memory when I received it. Ive bought 2 twin machines but I was going to use only one so I got all the memories from the second machine and installed it in the first machine. And I had a lot of errors trying to install and run stuffs in the server. I ended up removing the added memory and using the server with only the initial 32gb it arrived with. Back at the time, I thought Ive misplaced the memories or sc***ed up something else in the configuration... This time, in this new try (the new memory modules I installed now are NOT the same modules I tried to use in the past, the modules from the spare server), I thought I've placed the modules in the correct slots for such configuration (8x8gb) but it seems I didn't. Or I'm having a non-related problem that I'm not aware of it.

What you guys suggest me to check first?














Moderator

 • 

3.7K Posts

July 3rd, 2022 17:00

Hi, I'd say pull out the memory first.

https://dell.to/3ajBVCm

61 Posts

July 4th, 2022 04:00

The new memories? What for, to see if the system gets back to stability? Im quite sure it will get stable if I remove the new memories. I wanna to find out how to make it stable with the full 64gb in, do you have any suggestions about what to test for that?

I will remove the new memories today and see what happens but im 99% sure it will stabilize. What you suggest we do after the removal of them and the confirmation that the system gets stable with only the original 32gb?

4 Operator

 • 

2.7K Posts

July 4th, 2022 10:00

Hello @Greg asd,


I think you really need to check the Lifecycle controller log or the iDRAC log (hardware log). You need to check if there are any memory errors registered on the server's log.


What is the BIOS firmware version?


Regards.

61 Posts

July 4th, 2022 19:00

By the way, now I have remote access via the IDRAC. Here are the firmware versions of everything, Im pretty sure they are all up to date, I updated everything I could when I was setting the machine up in the past

image.png

 

61 Posts

July 4th, 2022 19:00

A few minutes ago I arrived back from the data center where the server is located, I removed the extra memories and took the time to reconfigure the IDRAC's network access, and take a look at LifeCycle logs. And I found a lot of these:

image.png

As I've said in the first post, the problem doesn't happens when only the 32gb are installed. Can it be a problem in memory slots A3 or A4 (CPU1) ? Are there a way to check for it? And why the memory tests in the LifeCycle Controller doesn't detects such problem?

What else can I look in the system to try and find the cause of the problem?

61 Posts

July 4th, 2022 20:00

For some reason the print I uploaded with the log got corrupted, so I will write the text down:

Critical,"Mon Jul 04 2022 20:12:21","CPU 1 MEM VDDQ PG voltage is outside of range."
Critical,"Mon Jul 04 2022 20:12:21","CPU 1 MEM VTT PG voltage is outside of range."

4 Operator

 • 

3K Posts

July 4th, 2022 21:00

Can you only populate A1,A2,A3,B1,B2 and B3 and check whether you are seeing the issue?

61 Posts

July 4th, 2022 22:00

That's my next test. I shall do it at the end of the day, after I confirm the server ran fine with only 32gb for at least 24h.

What else can I look for?

61 Posts

July 4th, 2022 22:00

If I find out that A3 or A4 is defective, can I still use B1,B2,B3 and B4? Or does the processor's memories need to match each other? Can I run 16gb on CPU A and 32 or 40 on CPU B?

4 Operator

 • 

3K Posts

July 4th, 2022 23:00

You can use B1 and B2 if A3 and A4 is defective. It is recommended to have same memory population between CPU's on same server. You can refer below link for memory module installation guidelines.

https://www.dell.com/support/manuals/en-us/poweredge-r420/r420ownersmanual-v2/general-memory-module-installation-guidelines?guid=guid-531f9c5e-ed4d-4905-bf64-5b81ff092739&lang=en-us 

Moderator

 • 

3.7K Posts

July 5th, 2022 00:00

Just to add a little more- https://dell.to/3IiLVsq

61 Posts

July 5th, 2022 05:00

In the manual there is no scenario using A1,A2,A3,A4 and B1,B2,B3,B4. Thats the problematic scenario. Is that usage not recommended by Dell?

4 Operator

 • 

2.7K Posts

July 5th, 2022 07:00

Exactly @Greg asd, you need to follow the recommended memory population rules:


- DIMMs must be installed in each channel starting with the DIMM farthest from the processor. DIMMs should be installed with the largest rank count to the smallest. For example, if DR are mixed with SR DIMMs, the DRs should be placed in the lowest DIMM slots then the SR DIMMs.


- Population order is identified by the silk screen designator and the System Information Label (SIL) located on the chassis cover. The graphic below indicates the installation order for each configuration type. In dual CPU configurations, memory should be built out evenly: A1, B1, A2, B2, etc. .
Memory Optimized (Independent Channel): C1{1}, C2{1}, C1{2}, C2{2}, C1{3}, C2{3}...
Advanced ECC (Lockstep / x8 SDDC): C1{2,3}, C2{2,3}, C1{5,6}, C2{5,6}
Mirrored: C1{2,3}, C2{2,3}, C1{5,6}, C2{5,6}
Rank Sparing Population Order (Lockstep rules): C1{2,3}, C2{2,3}, C1{5,6}, C2{5,6}
Rank Sparing Population Order (Optimized): C1{1}, C2{1}, C1{2}, C2{2}, C1{3}, C2{3}...

 

Memory Population

61 Posts

July 5th, 2022 16:00

I have no idea about what most of it means haha, the parts with the { }.

Anyway it doesn't matter; Ive tried the scenario with A1,A2,A3/B1,B2,B3 and the server barely finishes loading the ESXi. Its resetting nonstop, and the same error stands still. I've tried to clean the slots already. Used some electric contact cleaner, with a soft cleaning brush and a blower. The problem definitively isn't dirt. So... What can I test now? And if the problem really is the B3 slot, what are my options now to increase the amount of memory installed in the system?

Moderator

 • 

3.7K Posts

July 5th, 2022 23:00

Hi, another post I found may also help:
https://dell.to/3P7uSeV

No Events found!

Top