Unsolved

1 Rookie

 • 

48 Posts

3432

February 24th, 2023 17:00

GPU upgrade for my Dell Precision T3610 causes reboots

I have a Dell Precision T3610. 

My PC came with a 685 Watt PSU, and it originally came with an Intel Xeon 1260v2 and Nvidia Quadro K4000, as well as a 1500RPM SAS drive. I replaced the CPU with a Xeon 2667v2, upgraded the RAM to 8x16GB and replaced the SAS HDD with two SSDs and a 7200RPM HDD. 
 
So far the system was working perfectly fine. But then I tried to upgrade the GPU. I got an HP OEM style RTX 2060 Super, hoping that the lower-profile and lower-power GPU would both be easier to fit in my case and not demand too much power. This kind of card: 
 
 
I thoroughly tested the card in another machine for about two weeks and had no issues. 
 
However, one problem is that the PSU only has a single proprietary 8-pin connector for GPU power, which by default gets split into two 6-pins, and that GPU I got is a single 8-pin. I wanted to just get an 8-to-8 pin cable, but I could not find one, all of them were 8-pinc to 2x8pin which I do not trust. So I just got a cable to converts my two 6-pins back to a single 8-pin, this one: 
 
 
Here it is installed:
https://i.imgur.com/mWEvDnX.jpg 
 
One oddity that I noticed was that from the stock OEM cable that connects to my PSU, only 6 of the 8 pins appear to be populated from the part that connects to the PSU: https://i.imgur.com/IRwaHXR.jpg  
 
The whole setup looks like this:
https://i.imgur.com/n64XbRu.jpg 
 
So after I installed the card I ran more tests to make sure my system can handle it. I tried Furmark's stress test, and it ran fine for about 5 minutes. According to my UPS my system was pulling around 300-360 watts. 
 
Then I closed that and tried Prime95, again my system was pulling in the mid-300s according to my UPS. 
 
Then I ran both... to my surprise it seemed to be pulling the same amount of power, a few spikes to 400 watts but that's it. 
 
I walked away for a minute, and when I came back the system had rebooted. 
 
No doubt some kind of current protection had kicked in while I was gone, and I was looking for advice on what to do. I would have assumed that 675 watts would be enough for all this, and I have no idea if it's the PSU at fault. If so there are 800 and 1300 watt PSUs, but I am not sure if they would work for my system. I see a few listing them for the Precision T3600 series and up... but most sites list them for the T5000 series and up. I have no idea if the 800+ watt PSUs are compatible with my system, and I don't want to risk putting in an incompatible PSU... not like I can use a standard PSU. 
 
Or I wondered if it could maybe be the 2x6 to 1x8 pin adapter. Like I mentioned I found it weird that only 6 pins on the OEM cable of all things are populated, and that my system was not pulling more than 400 watts with the CPU and GPU both being stressed while it was pulling in the mid-high 300s with each separate part stressed, so I don't know if it might be either the OEM cable or the 6 to 8 pin adapter I got that is not supplying enough power. 
 
And if that is the case, if that 1x8 to 2x8 adapter might be a better idea, this one here:
https://www.amazon.com/dp/B07P82ZH22
 
I do not trust that kind of cable because it is splitting a single 8 pin to two, something that seems pretty risky and dangerous, but I also would only be using it to power a single 8-pin and it APPEARS to populate all the pins on the PSU side (though it seems like it just re-routes the ground pins on the GPU-end to other pins?) 
 
Or if this could be something else entirely or I am just pushing my system too hard with all the modifications I made: upgrading the CPU, maxing out the RAM, putting in three drives in place of the one which required a SATA splitter (though that is replacing one 15000 SAS drive with a 7200 and two SSDs on a system designed to handle up to two of those SAS drives so I would assume the SATA power is not being overloaded) and now upgrading the GPU. 
 
Any advice on what could be the issue and how to try to solve it?

1 Rookie

 • 

48 Posts

February 24th, 2023 19:00

Whoops, you're right. It's a 1620, not 1260.

The first thing I did was check the Event Log, and I all I saw was "Event 41: Kernel Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly."

Not very helpful. Also as I was running stress tests just now I saw this:

The description for Event ID 0 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

\Device\Video3
Error occurred on GPUID: 200

The message resource is present but the message was not found in the message table

 

Which is odd since I had no trouble running any benchmark or game on the card so far, and the first thing I did before installing the card was boot my system into safe mode and use DDU to uninstall the Quadro drivers, then first thing I did with the card installed was to reboot without internet and install the Geforce drivers. Also the very next message in my event log is "Display driver nvlddmkm stopped responding and has successfully recovered."

This happened well after the reboot though. This is why I am so confused, I spent weeks looking into every upgrade and modification I made and am sure I did all of it right, you mentioned such as well, which is why I am confused just why this happened.

I was having RAM errors several months before, but Windows always logged all of those and since it's EEC RAM it was always caught and fixed without even a crash. I didn't even get a BSOD (and I have auto-reboot on BSOD disabled) I just walked away for barely two minutes and when I came back it had rebooted.

9 Legend

 • 

8.1K Posts

February 24th, 2023 19:00

There seems to be a typo for the original processor.  Anyhow, an E5-1620 v2 has the same TDP as Xeon 2667 v2, so there is no change in power consumption there.

The T3610 specification allows up to 2 full height graphics cards, up to total of 300w, to be used with the 685w PSU.  Therefore it should support a 160w RTX 2060 card without any power issue. 

Your setup using x2 6-pin to x1 8-pin adapter cable is proper and should provide plenty of power (75w + 150w) to the RTX graphics card.  Dell 8-pin cable 0D92C9 is factory wired with 6 active wires and 2 empty, also no problem there.

With all that said, I do not think that you have under power issue, so there is no need to upgrade PSU.  You may want to check event log to see any information regarding system restart.  Try to update different drivers and/or changing power plan to high performance.

As last resort, pulling coin cell battery and reset cmos also may help, just take note of current settings so you can restore it back.

9 Legend

 • 

8.1K Posts

February 24th, 2023 21:00

Well, it seems that all the preparation and manually installation did not help and all indications are pointing to drivers and configurations.  

Let try to see if an RTX driver installed by Windows can work better.  But, must uninstalling the current driver first.  Go to Device Manager, expand display adapter and right click on RTX card, select option to uninstall device.  Check the box attempt to remove driver for this device.  When uninstallation is complete, restart the system and give it a few minutes or you can manually run Windows settings, check for updates.  Let Windows update install the RTX driver and nVidia control panel.  Although it won't require, but go ahead and give your system another restart (to trigger nVidia Audio driver installation). 

Stress test your system again.

1 Rookie

 • 

48 Posts

February 24th, 2023 22:00

That's a slight bit of an issue. I am on driver 528.24 for benchmarking purposes. Since as I mentioned I was testing the card heavily in another system, I also ran days of different benchmarks to also compare the differences in CPUs and GPUs between it and this PC. I was on 528.24 when I tested that system so I don't want to use a different driver version here just to make sure as much is the same as can be.

I also never got that nvlddmkm error on that test system.

... it's getting even weirder now. I tried Prime95 alone again, and it was running fine. I even noticed that I had accidently set it to only use half my cores from when I had the 1620, so set it to use all 8 cores now. But it ran fine... so I tried Furmark's stress test again.... and it still ran fine. Both were now running for 20 minutes before I closed them without a crash or any error in the event viewer (though earlier tests did give me a few more nvlddmkm errors... I will have to check exactly which tests/games did it as most do not). I noticed I was pulling more power now too. I was at around 260-ish watts at 4 cores, around 290 at 8 cores, and at a very steady 496 or something like that with both Prime95 and Furmark running. When I closed Furmark the system was still at 315 watts with just Prime95.

This is the first time I saw the system go over 425 watts, I would almost suspect that I was given a 425 watt PSU but I can't imagine anyone attempting to forge a 675W sticker for an enterprise system to save on a small amount with a 425 watt one.

Attempting to Google this nvlddmkm error was.... a needle in a haystack. Tons of results from a year to 13 years ago, all seemingly never really solved or randomly solved all in different ways that don't work for the others having the same error at the time. I wonder if perhaps somehow my old Quadro drivers were not cleaned out completely or they got corrupted when it crashed since it was heavily using the GPU at the time. Any ideas about any of that? I am VERY stumped on this one, I hate problems that make no sense and seemingly go away only to bring in new problems that Google is no help with.

The only thing I really did different is that I had rebooted, and disconnected my harddrive since I didn't want to risk it getting corrupted if it kept crashing as it's an SMR drive. (The main SSD I performed a backup of just now and the other SSD contains nothing important). I can't imagine that a single HDD was somehow pushing the system past it's limit even if I somehow "overloaded" this system with two SSDs and a HDD when it was supposed to only have a max of two drives connected to it (Even though the motherboard has connectors for SIX SATA drives)

1 Rookie

 • 

48 Posts

February 24th, 2023 23:00

I don't know if I mentioned it, but I unplugged the ethernet cable when uninstalling and reinstalling the drivers. I also disabled automatic updates and make sure it only lets me install them manually. It should not have tampered with my driver's automatically in any way.

I checked and it's set to high performance power plan too.

I got the machine used from a 3rd party seller on Newegg about two years ago, but yeah, I can't imagine it would be worthwhile to tamper with the PSU like that, the only reason I considered it is because when I asked what PSU it comes with they were okay with giving me the 675 watt one instead of the default 425 for no additional cost.

The SMR drive is just for mass storage of large files that don't need fast access, I couldn't afford something better at the time. The operating system and all programs and games run off of the SSDs.

9 Legend

 • 

8.1K Posts

February 24th, 2023 23:00

Windows may end up installing the same driver version you are using now but with a cleaner and more proper (at default) settings.  Also, I wouldn't worry too much about any left over or residual of the Quadro driver as Windows does not use any of it while running the RTX.  

With the power consumption of the CPU and GPU combine, it could spike over 300w.  Adding to the motherboard usage of over 100w for ram, drives, and fans.  That sounds about right for your system to pull close to 500w under maximum load, as extra power was also needed to compensate the efficiency.  I did mentioned in my earlier post to use high performance power plan, that would trigger all cores at full power.  I don't know where you got the machine from but I couldn't imagine the needs and benefits for anyone to tamper with the PSU.

The nvlddmkm error is obviously related to nvidia.  Hence the suggestion for testing with different drivers, including the same driver version you want but let Windows installing it.  About the HDD you unplugged, it would not have any impact on the machine, power wise, and why using an SMR drive as internal daily usage???  Its best uses are write once, read many purposes.  

With additional information provided, I still suggest to look at the nvidia drivers and its setup as the causes of crashed reboots, not the power supply.

9 Legend

 • 

8.1K Posts

February 25th, 2023 00:00

Yes, you mentioned the driver was manually installed.  You can try to run Windows update to see if it picks up any nvidia driver.  When it does not work out, you can roll back driver and having your original installed driver again.

1 Rookie

 • 

48 Posts

February 26th, 2023 00:00

After using DDU to uninstall the driver fully,  including PhysX, and using Disk Cleanup to clean out the Shader Cache and CCleaner to remove any temp files, after re-installing the same drivers I have not once gotten the error again after two days of running GPU stress tests and benchmarks.

However, my CPU strangely failed the OCCT benchmark a few times at the exact same place before passing it.... no other CPU or GPU test failed. Now I am starting to become worried if it's my CPU, specifically, I think maybe it's AVX instructions that trigger it since every time the OCCT CPU Benchmark crashed it crashed at the 1 thread AVX test. Even more confusing, the benchmark was taking 55 seconds when it crashed, but took 44 seconds when it passed.

(Spoke too soon, just got the nvlddmkm error again, although it seems to only happen when I launch the Forspoken demo to test this card)

9 Legend

 • 

8.1K Posts

February 26th, 2023 04:00

Well, when you throw the CPU benchmark test into the mix, it's a whole other ball game.  Keeping test notes separate would help with the accuracy in troubleshooting, identify the causes, and documenting the progress until issues are resolved. 

Although the nvlddmkm error came back, you were able to stress test the RTX for 2 days with a re-installed driver.  That should be a welcome news because it addressed earlier concerns, your power supply is sufficient and the restart issue was driver related.

Beside upgrading the GPU, you had mentioned that you also upgraded the CPU, memory, and storage.  By chance, did you run a memory test and/or re-installing the operating system?  If not, lets run a Windows memory diagnostic and a system CHKDSK.  It may not having a strong impact in solving your failed benchmark tests, but it can rule out some basic issues and mark off your checklist for troubleshooting. 

 

4 Operator

 • 

1.4K Posts

February 26th, 2023 05:00

If I may intrude, i suspect the issue lies with the psu. It may be slightly above the requirements for this setup, but it's possible that due to age, the gpu rails have some glitch and under heavy load randomly cause a fluctuation.

It's not the gpu, works elsewhere. It's not the rest of the system with the original gpu , works fine as you said ( and the k4000 uses just max 80W , so... ).

Personally i would try a new psu of same or higher wattage, or maybe consider reselling the 2060 and take an A2000 ( now there are good bargains i think , you may even break even )

1 Rookie

 • 

48 Posts

February 26th, 2023 11:00


@Chino de Oro wrote:

Beside upgrading the GPU, you had mentioned that you also upgraded the CPU, memory, and storage.  By chance, did you run a memory test and/or re-installing the operating system?

 


First thing I did after upgrading the RAM was run Memtest86 and it passed, and then Memtest86+ overnight and that passed too, and then Prime95 for several hours. I ran into no errors, although I did not think to check the Event Viewer at the time.

After I upgraded my CPU a month later I did the same, however, I also ran OCCT's tests as I had forgotten to do those, and sometimes I would get an error in the RAM test, other times not. Looking at the event log I would see a WHEA error (Prime95 would generate a Event 2 and OCCT a Event 47), but my system never crashed. I have no idea if that's the CPU, RAM, or just a freak accident. But despite that I had my system running with this CPU and RAM 24/7 for months on end with no crashes or errors. OCCT never gave me any errors when I was doing a baseline benchmark a few weeks ago either on with the old Quadro card in this PC to compare to when I installed the new RTX GPU.

9 Legend

 • 

8.1K Posts

February 27th, 2023 00:00

If your system ran 24/7 for a month with no errors and no crashes, then the RTX is a contributing factor to the current issues.  Beside drivers and settings, noting the thermal condition in your troubleshooting test.

I would consider the overall performance of your machine is at satisfactory level.  Combined all the maximum and higher end components together, you already achieved the expectation performance from a decade old system.  Adding an HP (OEM signed) graphics card, manage to get driver properly installed and accepted by a Dell workstation, you should be glad and enjoying your achievement.

On a side note, you may want to take the input from mazzinia into consideration.  Although the VGA rail of your power supply could provide up to 216w (not counting the 75w slot power), the aging condition of the PSU may cause some fluctuation of current (dropped) under high stress duration.  An out of range voltage spike from the RTX might be translated by OCCT that there is issue with the GPU, which did not occur with the lower power draw K4000.  Honest to say, I'm unfamiliar with OCCT, so it's just an elaboration guess.  It will also depend if you have accessible (PSU) for testing as well.

1 Rookie

 • 

48 Posts

February 27th, 2023 19:00

Seems I spoke too soon. While I was using the system my display suddenly went blank. My monitor claimed no signal, seemed like it kept almost seeing a signal but then not. Attempting to soft-reboot the system by pressing Win+X then U twice did not work, nor did ctrl+alt+del.

The event log claimed that nvlddmkm crashed, and then there was dozens of warnings that the driver had recovered.... clearly it had not.

I tried OCCT again, several CPU benchmarks and nothing odd. I then tried a CPU stress test... never heard my fans go that crazy before.... after about 20 minutes into the test the exact same thing happened again.

I am completely stumped as to just what is going on now.

9 Legend

 • 

8.1K Posts

February 27th, 2023 23:00

It might be a long shot here but have you consider undervolting it, just for testing.  Ever use afterburner?  

1 Rookie

 • 

48 Posts

February 28th, 2023 00:00

I only ever used EVGA's tool before, and never messed with voltages.

You think it could still be the GPU since it seems to be stressing the CPU that causes it? I can't tell what to even look at anymore.

No Events found!

Top