I purchased a T320 a couple days ago from an ad on craigslist... (Please don't blast me for this - first time I'd ever done that, and last time. Lesson learned)... to be used as a fileserver for my home. It came with a single @E5-2440, 24GB Ram, H710, DVD, and (7) 500GB drives. I paid $600 and figured for what I was getting, it'd be cheaper than buying new or building from scratch.
Ever since I got it home, it's given me nothing but troubles.... well almost.
LCD panel keeps throwing: VLT0304 CPU 1 PLL PG voltage outside of range. Reseat CPU
This happens so sporadically it's not funny. Sometimes, it will throw the error and power cycle during POST. Sometimes, it will throw the error while I am in BIOS. Sometimes it threw the error while I was trying to build the RAID (H710 ctrl-R). It managed to stay up long enough to get Ubuntu 18.04 LTS installed... then the error again and power cycle. I managed to get the latest IDRAC and BIOS updated in between all this, hoping it would solve the issue.
Throughout all of this, I have unplugged, drained the flea, unplugged everything from the motherboard, removed the CPU, removed the H710, put everything back, and powered back on... The first time I did this, it came up without issues for about 30 minutes. The second and third time, didn't seem to help much.
Then all of a sudden, it decides to stay up and running without any issues for hours on end. It managed to stay up for almost 24 hours (overnight), and I've been on/in the OS, tinkering and setting it up, all day without any issue.
I pulled out one of the hot swaps (just to get a model number off the drive) plugged it back in... about 30 minutes later, i got the LCD error again... 3 times now within the time it took me to write this.
Each time this has happened, the SEL shows a bit more:
CPU 1 has a thermal trip (over-temperature) event
CPU 1 PLL PG voltage is outside of range
The system board PS1 PG Fail voltage is out of range
When I run the Hardware Diagnostics, everything comes back green.
I guess I need to know the following:
Is this a bad CPU ? or a bad motherboard? or a bad power supply (350W)?
Is this happening because I have it plugged into 1) the wall and/or 2) surge protector power strip, instead of a UPS ?
If I need to replace the motherboard, does DELL sell them new and how much? (I can't seem to find anything other than "refurbished" on ebay.
I really don't want to call it quits on this just yet, since I've only had the machine a few days now... but it seems if I look at it wrong, it craps out.
Any help would be appreciated.
Each of the errors can be caused by the other errors. An improper shutdown typically causes voltage/power errors. A thermal error could also be caused by an improper shutdown. The CPU can overheat very quickly without cooling. I would monitor temperatures to see if they only spike during the improper shutdown. If they only happen at the time of the shutdown then I would suspect they are a symptom and not a cause of the problem.
I would try the system on a different circuit to rule out an issue of inadequate power available on the circuit. I would also remove any unnecessary hardware or unsupported hardware installed in the system.
Dell EMC, Enterprise Engineer
Thanks Daniel, I appreciate your time.
I hope you aren't suggesting improper shutdown by me. That doesn't happen.
The box itself improperly shuts off... then reboots... as if you were to unplug the server. I take it that's what you meant -that something is causing the server to die, and THAT causes other errors to show up.
According to IDRAC, CPU temp averages a steady 24º, with peak at 25º
There is no unnecessary hardware in the box:
Unsupported... How would I know this? Does the T320 not support 7 WD 500GB BLUE drives and 1 Samsung EVO 960 SSD? (both sincere questions). It seems like it does (when the box isn't crashing), and at the time of this response, the box & OS has been running for 2 hours without issue.
At this point in time, I'm hesitant to run any system-intensive HDD tests other than the Linux fsck.ext4 filesystem checking (which I've already done with 0 errors), for fear of royally screwing up the drives should the box decide to crash in the middle of it.
In the meantime, I have moved the server downstairs and it's plugged in on a different circuit. Time will tell. But if it is "inadequate power available on the circuit" as you suggested, wouldn't something say so somewhere? Is this 350W power supply not meant to be run off the wall? I've run 800W+ gaming rigs in the past with no issues. Just doesn't make sense. Or were you suggesting Dell's 350W power supply was not meant to power 8 hot swaps bays (only 7 in use) that it was configured for?
As for the order of things, based on the log, it almost seems that the "CPU 1 has a thermal trip (over-temperature)" is occurring first... yet the IDRAC is not showing anything above 25º
|Mon Oct 07 2019 21:47:19|
The system board PS1 PG Fail voltage is within range.
|Mon Oct 07 2019 21:47:14||CPU 1 is operating correctly.|
|Mon Oct 07 2019 21:47:14||The system board PS1 PG Fail voltage is outside of range.|
|Mon Oct 07 2019 21:46:55||CPU 1 PLL PG voltage is outside of range.|
|Mon Oct 07 2019 21:46:55||CPU 1 has a thermal trip (over-temperature) event.|
|Mon Oct 07 2019 21:14:03||The system board PS1 PG Fail voltage is within range.|
|Mon Oct 07 2019 21:13:58||CPU 1 is operating correctly.|
|Mon Oct 07 2019 21:13:58||The system board PS1 PG Fail voltage is outside of range.|
|Mon Oct 07 2019 21:13:38||CPU 1 PLL PG voltage is outside of range.|
|Mon Oct 07 2019 21:13:38||CPU 1 has a thermal trip (over-temperature) event.|
Moot point now.
Since I couldn't get a straight answer out of Dell as to what Dell's own error codes represent and what the real problem was with this Dell T320, and how much a new Dell T320 Motherboard would run me (which I specifically asked in my opening post), all of which should have been provided to my by the Dell representative that took the time to respond without really helping at all (in typical Dell fashion)...
Just said 'fck it' and bought a new T340 instead.... Probably Dell's ploy all along. Now I just have to wait and see how long before it actually arrives. Likely to be the last Dell product I buy.