1 Rookie
•
6 Posts
0
231
March 1st, 2024 13:31
Dell T620 Shutsdown randomly
Dear community,
I have a Dell T620 that shuts down randomly. I could be in the BIOS /Lifecycle Controller/Proxmox screen and the shutdown could occur. No power to the front, no power to IDRAC. Running a full hardware diagnosis shows no hardware issues. I need to coax the server to boot by unplugging both PSUs for, usually, an hour, then plugging it back in. When I can power on the server, it could run for 10 minutes or a week. Lately, I *feel like* the times the server is running are getting shorter.
The lifecycle log does not note anything except when I finally coax the server into booting again, the following entries are always displayed with the timestamp being the minute I can get the server powered on and not when the server went down.
(Pulling from memory as I can't get to the logs at this exact moment)
4th event -Something about CPU reseating-
3rd event The iDRAC firmware was rebooted with the following reason: ac.
2nd event Power Supply 1: Status = 0x00, IOUT = 0x0, VOUT= 0x0, TEMP= 0x0, FAN = 0x0, INPUT= 0x0
1st event Power Supply 2: Status = 0x00, IOUT = 0x0, VOUT= 0x0, TEMP= 0x0, FAN = 0x0, INPUT= 0x0
Both PSUs are Delta 750W PSU (PN: 5NF18) on the latest possible firmware: 07.2B.80
PSU1 is connected to a UPS (UPS is not wired to the Server, so no shutdown signal and no logs on the UPS that point to anything strange). PSU2 in a traditional outlet. Both PSUs have a solid green light. The peak power draw is roughly 300W.
With the server working intermittently, I may have fallen for some possible red herrings (not in any particular order):
Replaced the power distribution unit
Tried re-upgrading the PSU firmware
IDRAC reset
Using expletives
Swapping PS1 and PS2
Using one PSU in slot 1
Using one PSU in slot 2
Taking apart the server and visually inspecting all boards for any sign of blown caps or burns/shorts.
Things I haven't tried yet:
Bought new (technically used/refurbished) PSUs
Motherboard replacement
IDRAC7 - there are no logs that mention anything about "watchdog". IDRAC is integrated into this board.
Anything else I should be looking at troubleshooting?
Best,
Randy
0 events found


DELL-Chris H
Moderator
•
9.7K Posts
1
March 1st, 2024 17:50
Riped01,
This may be tought to diagnose without any errors, but we can see what we can do.
What I would start with is making sure the server is up to date on BIOS, iDrac, etc. I know you ran the latest Delta 750w PSU firmware so no need there. If getting the server up to date doesn't resolve the issue you may want to consider taking the server to its minimum to post configuration. Minimum to post for the T620 is removing everything from the server, internally and externally, but the following;
If you aren't able to replicate the issue from the Minimum to Post configuration then one of the devices you removed is causing the issue, and you should individually reinstall each part and test if the issue reappears after installing a part, identifying the faulty part. Now if the issue isn't resolved by the minimum to post, then it is likely one of the devices listed above causing the issue.
Lastly, do you see any lights flashing on the PSUs, if so what are you seeing?
Let me know what you see and if this helps.
riped01
1 Rookie
•
6 Posts
0
March 1st, 2024 18:58
Hi Chris,
Thanks for chiming in - I got so caught up in testing every bit of hardware and upgrading firmware, that I didn't even think about doing a minimum post.
I should mention, when I first bought the server, I ran the server for a couple of weeks straight with no issues. When I figured out how I wanted to use the server (Promxmox w/ various VMs hosting containers and TrueNAS), I figured it was time to update all Firmware:
BIOS
IDRAC
Lifecycle controller
PSU
My last firmware to update was the PSUs. I did 1 at a time and it would take roughly 2 hours before the server powered itself back on at the tail end of the upgrade. iDRAC confirms the firmware upgrades were successful + Firmware version #s match.
After the PSU firmware upgrade, I installed my second CPU and filled the remaining empty 22 x 8 GB DIMM slots. All 192 GB of RAM were detected and memtested. I did a light CPU burn and some other tests to see if that was fine. . Then - a week after the PSU firmware upgrades/hardware upgrades - I've been having this issue. Last week, I ran through the gambit of tests I mentioned above, all passed, everything green in iDrac etc...
Both PSUs have a solid GREEN light. In the iDRAC controller, everything is detected and green (when the server is powered on) - with temperatures in normal ranges under 60C. When PSUs are plugged in and powered, the motherboard has a green "+12V AUX" light that is solid on. However, the server is stone-cold dead. When the server does power on, everything POSTs relatively quickly, and all my VMs run just fine (Media server works fine, steam/game server runs great etc etc). No bugs or weird errors. iDRAC is very responsive.
Last night the server shut down without warning after running for 8 days with no issue. This morning, I got it to run for about 10 mins. I've been trying all day today to get it to turn on. As soon as it does come on, I can screenshot the BIOS and iDRAC versions as well as provide the exact error messages I am getting.
I am also beginning to suspect that the PSU firmware upgrades did not go as smoothly as they should have. 2 hours is a long time for the upgrade to complete. I have "upgraded" the PSU twice, therefore I don't see an option to roll back to the other PSU firmware either.
Looks like I should try the minimum requirements for a post... the painful part about this process is that it would take 10 minutes to a week before the server cuts off (this is the 5-6th time this has happened in total). So what may seem to work for a couple of days might be another red herring. Granted - I already had a month's worth in downtime with this server - I should have spent it ensuring a minimum post wouldn't flake out on me.
I also order 2 PSUs online as well as a replacement - just in case.
+ I forgot to mention the replacement power distribution board came with cables, so those cables were replaced.
riped01
1 Rookie
•
6 Posts
0
March 6th, 2024 01:01
Update -
New PSUs came in. Didn't make any difference.
Pulled all the RAM out except A1 and pulled CPU2 out. Unplugged both SAS connectors.
The server won't even post.
Tried swapping CPU1 and CPU2. Still nothing. Tried a different stick of RAM. Nothing.
This is my 2nd motherboard and 2nd power distributor board. I have replaced everything EXCEPT the SAS backplane. This is a nightmare.
DELL-Joey C
Moderator
•
4.1K Posts
1
March 6th, 2024 03:01
Hi Randy,
So basically you have already done checking the OS and iDRAC logs for any signs of shutdown initiation, there is none stated any hints, right? I read you mentioned it shuts down in BIOS, hence that should be hardware issue.
I've previous done troubleshooting for such cases before, years ago, and the engineer found out it's the chassis that had made some kind of short circuit issue to the mainboard. I wouldn't recommend doing this, but previous for that case, the engineer built the server out of the chassis to check the issue.
You have unplugged the SAS backplane cables, but the issue persist, so it's not going to be the backplane. How about the chassis control panel where the power button is, could it be the button issue?
riped01
1 Rookie
•
6 Posts
0
March 14th, 2024 23:28
Hey Joey- thanks for chiming in.
Thanks for bringing up shorts and the power button.
I took apart the front bezel and everything *looks* fine with respect to the power button. I haven't examined it as closely yet since the iDRAC is not powering on - and lets me believe a different issue. However, if the chassis control panel is bad - then I can see how the system may try not to power itself on for safety reasons.
Once upon a time, I tried to install a supermicro motherboard in a server case we had laying around. One bad standoff caused a short between 2 pins for a fan connector. Powering up that server, I smelled, and learned - quickly. Spare screws or anything metal can easily cause a short. With repsect to this T620 server, I have removed the motherboard and power distribution card twice now and both times I've flipped the chassis upside down and gave it a gentle shake in case I missed anything.
I think I found a solution -
The first board had a voltage problem - during shipping one of the memory voltage regulator chips popped off. The second board is a mystery for it's failure. I put the first board back in - I get iDRAC and sometimes full boot into OS. Because of that voltage chip, the system is unreliable for obvious reasons - BSOD and kernel panics. But at least I got 'consistent' power~ So, I bought another motherboard from Ebay just in case. This afternoon, I installed the new-used board. I did the minimum POST. Everything looks good. I loaded up the server with all my RAM and CPU, ran health tests. Still looks good.
Currently, I am flashing BIOS and then iDRAC7 firmware to get me up to date.
Two bad boards later, I hope this time I am all set.
Maybe is optimism or a placebo, but I swear navigating through the Lifecycle controller and through iDrac is much much faster with this board.
I will keep you guys posted if anything else happens. I will also check back in about a month to share a status update.
riped01
1 Rookie
•
6 Posts
0
April 14th, 2024 14:46
1 month later - new motherboard is working great.
Proxmox, plex/jellyfin, truenas, pihole, nginx, and all the other containers / VMs I've spun up on this hardware is running great.