Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

14875

March 13th, 2017 15:00

Aurora-R6 : Hard-Lockup and crash while gaming (SOLVED)

On 3-9-2016 I had a hard-lockup while playing Fallout-4 (Steam version). I was unable to access Task-Manager. The (two week old) Aurora-R6 rebooted itself shortly there-after.
 
It was pretty scary since it first looked like the SSD had fried. On consecutive reboots (and even attempted cold boots) it just kept trying to boot from the network instead of the SSD. Turns out the BIOS decided to move the Windows Boot Manager to the lowest Boot Priority. 
 
Event Viewer did not record anything important before or during the lockup.
Its been my experience that this usually points to a hardware problem (or running out of power).

I mistakenly thought the Aurora-R6 had been stable since installing it a few weeks ago. However, after checking the Event Viewer ... I found it had happened 4 other times before since 2-18-2017. Apparently, it's been happening while I'm not around (like when waking itself to install updates or do other maintenance).

These are the only "Critical" errors found.
Quantity:5 (in less than a month)
Level:Critial , Source:Kernel-Power (un-clean shut-down) Event-ID:41

I set Aurora-R6 to not reboot on crash
Set crash-logs to "Small Memory Dump" (mini dump) ... Just to see if it will write one
Set Virtual Memory to Custom on C: 4000/8000
 
My Aurora-R6 passed 2 scans with ePSA Diagnostics.
 
BIOS is currently v1.0.1 (what it shipped with)

Ran Alienware Update (I always run it manual)
- It seems a new BIOS was recently posted
- Said BIOS v1.0.3 was ready to install. Crossed fingers and let it.
- It rebooted to DOS and did it successfully

BIOS v1.0.3 NOW (3-11-2017)
Check BIOS settings ... still the same.

It's happened 2 times since the above time on 3-9-2017.
- Once just an hour into Fallout-4 and once while playing FarCry-3.
- I run this same Steam version of Fallout-4 on my Aurora-R1 (also with GTX-1070) for many hours and sessions with no problems what so ever. Same with FarCry-3 on UPlay.
 
So, BIOS 1.0.3 didn't seem to help with the lockups.
Hopefully it at least fixed the shifting Boot Priorities.
- Would be nice if we could get the issue details (or a least a proper summary) about Dell BIOSes when they are published.
 
Only difference is now, it doesn't reboot itself now (since I told it not to).
- Sometimes the screen is frozen and sound looping.
- Sometimes the screen eventually goes to sleep due to lack of video signal.
I still can't access Task Manager to get a clean shut-down. I have to hold-down power-button to force it off.
Still, no Event Viewer entries about it (other than the Critical Kernel Power entry that is created on reboot after the un-clean shut-down)
No BSOD
No crash logs or memory dumps created.
It's connected to a good APC 1500-LCD UPS.
(Click my name to see it's detailed config)

Edited after initial posting (left out some details)

Dell Rockstar

Registered Microsoft Partner and Apple Developer
- Like many of you, I can appreciate a good game-engine.
- I answer questions here, but I'm not a Dell employee.
- Consider giving posts you like a "thumbs-up"
- Posting models-numbers and software versions speeds trouble-shooting.
- Click "Mark as Accepted Answer" on any post that answers your question best.

8 Wizard

 • 

17K Posts

April 10th, 2017 14:00

My crashing issue appears to be fixed. I have not had any more "hard lockups" or crtical Hardware Errors in Relibility Monitor, since I updated the Nvidia driver (3-11-2017).

It appears that the old Nvidia v375.63 driver (that Dell Pre-installed) is one of those particularly bad sets. It was just the luck-of-the-draw that it ended up in their current Aurora-R6 BootDisk pre-load Image. They would cause an actual Windows Hardware Failure. If it happened during a Full-Screen Direct-X game, the computer would hard-lockup (no BSOD, no crash-dump files). It's been a long time since I've seen something like this, that wasn't truly malfunctioning hardware that needed to be replaced.

Over a period of 3 weeks, I have now logged over 20 hours (in 8 sessions) in Fallout-4 without the slightest problem. I'm calling it fixed.

Updating Nvidia video driver:

Download and installed GTX-1070 v378.78 03-09-2017 WHQL Driver (from nvdia.com)

Select Custom
Selected only Graphics Driver, HD Audio Driver, and PhysX Driver.
Select Clean Install
Rebooted when finished

Nvidia Control Panel says v378.78 is installed
- Left it set on "Let the 3D Applications Decide" for now (about as generic as you can get)
- Left PhysX Processor on Auto-Select (since it said Recommended). It correctly shows that "GTX-1070" is being used.

Of note, the last time it crashed was just after installing BIOS v1.0.3, so I don't think that was the fix. This was also right before updating Nvidia driver.

Thanks for everyones help. I'm only marking this particular post Correct to move it to the top of the thread (so solution can be easily found by others later).

March 18th, 2017 00:00

Tesla,

You've helped this community a lot, sorry to see no one's offered you some back.

I've also been bitten by that darn boot process issue. Similar to as others have posted, I had grief when trying to migrate over to a 960Pro, the boot order changed on me, and that caused a few blood pressure points and 15 minutes of wasted F12 commands... Luckily after changing that back, it hasn't seemed to caused any more grief, I assume same for you.

As for the actual crashes, no doubt you're knowledgeable enough to troubleshoot the basics yourself, but I thought i'd see how this is progressing at least.

From the crash dumps, Windows 10 has that "reliability monitor" tool, does that parse out any info that helps diagnose this issue?

Otherwise, I might suggest Blue screen of death (STOP error) information in dump files.  Nirsoft is a decently well known company in this space, and their software is usually pretty helpful for running through the logs.

I wouldn't put it past the SSD being trouble though. These last 6 months of the 40+ computers I've ordered for work, 20 of them have had PM9X1 failures/corruptions. If it wasn't late on a Friday, I'd remember the bloody Event ID it keeps tripping, but it's pretty obvious and has something to do with the SSD controller.

I've also never really had great luck with that ePSA diag. It really only triggers for full/obvious hardware failures. Event viewer catches so much more for me...

Good luck!

1 Message

March 20th, 2017 09:00

There was an bios update. Have you installed  ?

8 Wizard

 • 

17K Posts

March 20th, 2017 16:00

Yes, I did (as I posted). It didn't seem to help.
 
Hopefully, BIOS v1.0.3 at least fixed the Windows Boot Manager - Boot Priorities bug that CometusCrimsonion and I have experienced. I haven't seen that happen again lately. Sure would be nice if we could get a proper list of fixes, features, and issues about new Dell BIOSes when they are published.

8 Wizard

 • 

17K Posts

March 20th, 2017 16:00

CometusCrimsonion wrote:

 ePSA diag. It really only triggers for full/obvious hardware failures. Event viewer catches so much more for me...

 

That is my experience as well. Same with UEFI Diags on other machines. It's important, but really seems more like just the first hardware test or hurdle you need to pass.

8 Wizard

 • 

17K Posts

March 20th, 2017 16:00

Thanks for posting. Yes, I spent thousands of dollars on this machine and I'm getting a little worried about now.

Yes, I can use all the help I can get. I've been updating drivers, uninstalling Dell stuff, and turning processes off pretty much daily trying to get it stable enough to play a game for a couple of hours without crashing. Productivity work and everything else seems fine. It just seems to be when machine is pushed hard.

For Critical Events in Reliability Monitor I was about to report that it only shows "Windows wasn't properly shut-down" (after each lockup) but further closer inspection now reveals other Criticals like "Hardware Error" . This warrants further analysis for sure. I embarrassed to say I don't use this tool enough when I should. It's definitely easier to find the errors than with Event Viewer. 

Yes, I tried BlueScreenView (and another similar one). Best I can tell, Crash Dumps are working but they never get written during this event. It's like the machine hard-locks before the OS can even BSOD or write a dump file. 

8 Wizard

 • 

17K Posts

March 20th, 2017 19:00

CometusCrimsonion wrote:

... 

I wouldn't put it past the SSD being trouble though. These last 6 months of the 40+ computers I've ordered for work, 20 of them have had PM9X1 failures/corruptions. If it wasn't late on a Friday, I'd remember the bloody Event ID it keeps tripping, but it's pretty obvious and has something to do with the SSD controller.

 

Well, that’s dis-heartening to hear. Overall, I’ve had excellent performance from my Samsung SSDs over the years. Maybe I’ve just been lucky. Also, I’m pretty sure Apple still uses custom Samsung SSDs in their machines. I think the Controller in 960/961 is all new and maybe even a new memory type.

 

I thought I had done good in the Aurora-R6 SSD-lottery with my 512gb pm961 (at least the price was right). I read the posts about NVMe SSD setup challenges with these Dells. I thought it would be nice to finally order a machine that doesn’t have to be upgraded as soon as I open the box (Now-days, I refuse to use a machine without a bootable SSD).

I see a lot of SSD firmware upgrades (but no Samsung) for Aurora-R5 but none for the Aurora-R6 yet. Not sure if that is good or bad. If there was a required firmware upgrade for pm961, you would think Dell would follow-suit and not only have access to it, but to actually post it.

 

Personally, I’m more leery of the (tricked out) Intel-RST (Rapid Storage Technology) driver-suite . While historically unstable (not what you want in a HDD/SSD driver) maybe Intel finally has it right with this Dell pre-installed v15.2.2.1030. Honestly, I’m still a little unsure how all the NVMe stuff works, so I’m not confident enough to try to remove Intel-RST (if it even can be). I’ve got some other M.2-SATA machines here, but this is my first with M.2-NVMe.

Please let me know if you hear anything more about this subject. Thanks again for taking the time to help me.

March 21st, 2017 05:00

Intel RST does not support MVMe drives. you can uninstall the driver.

NVme drives have their own drivers. see here:

Tool & Software | Download | Samsung V-NAND SSD 

the only problem is, you have an OEM SSd and i don't know if the Samsung driver for consumer SSDs works. if not, try to find another one or, you know, contact Dell tech-support.

as for hard lockup: try with your own advice. remove the SSD, use another hard drive, install windows and see what happens.

8 Wizard

 • 

17K Posts

March 21st, 2017 10:00

So, you think Intel-RST is only installed for the secondary 1tb spinner HDD? I did notice the BIOS is set to RAID-Mode, but again, like you said ... that's just for SATA drives? I used to think that but recently came across some conflicting info.

Well, either Windows-10 supports NVMe natively, or you need a "F6 Driver", right? It must exist or it would not be working now. 

Yeah, I was waiting for someone to write my "words of wisdom" back to me ... "you should just Nuke-and-Pave that thing Tesla and get back to us."

March 21st, 2017 16:00

Haha, Yes, your system is the br0kenz, I'll send you an address where you can return it.

This reddit thread I keep trying to post everywhere, cause it's decently useful information IMHO. Basically, to get the information out there as much as possible, there's a few take-a-ways

1) it really doesn't matter, based solely on performance mindset, if your computer is AHCI vs RAID in the bios. If you're going for simple windows usage, just use whichever freggin works.

2) AHCI mode will be the only mode where you can install Samsungs drivers - RAID mode uses generic stuff

3) I THINK (as in, my HYPOTHESIS, but don't really know) the biggest issue in general for people is secure boot ON with AHCI, causing no one to be able to F6 their way into installing drivers...

But basically don't worry too much about it, just find whichever works (and theoretically the generic windows should be "easiest" under RAID)

After 30 years in this industry, when I was going through this process for my new-out-of-box Samsung 960 Pro, I felt absolutely clueless and immensely frustrated. In the end, after 45 minutes of various stages of failed installs/boot, I just cloned the original dell M.2 and called it a day. Luckily I had a laptop with 2x m.2 slots. This is waaaay too hard of a process for 2017 & current tech...

If it helps, i followed the insane set of hoops to jump through for downloading and extracting the samsung drivers here:

I don't entirely remember what happened (round 9 of 30 attempts or some such) but I was able to get win 10 installed, it just wouldn't boot. (this might be the Secure boot:on issue, but TBH I really don't remember)

I do at least remember getting slightly excited that win10 install would at least see my drive finally... Ultimately, Aomei backupper clone of the original OEM drive is how I got my computer to work tho.

Side note:While we're troubleshooting the overall issue, and this is SSD is an absolute time drain, do you have just a "traditional" Sata SSD or yikes-Spindle-based hard drive to try on, if even for a 24 hour basis just to confirm the motherboard/PSU/Memory side of things is in tact?

8 Wizard

 • 

17K Posts

March 21st, 2017 19:00

Both of those links are interesting reads (I also PDFed them both).

Yeah, Macrium Reflect guys figured out similar, but I think slightly different and shorter. It should also be noted that Macrium Reflect bootable-flash works fine on my system as it was shipped from Dell (even in UEFI/SecureBoot mode).

Your cloning solution was ... well, like you said ... whatever works.

As for my problem, I don't think it has anything to do with the SSD. It works fine for heavy productivity work, stress tests, SSD benchmarks, etc. Never had a "disk error" or even a need to "soft repair" anything ... I do a disk-check on both drives after a lockup (it never finds anything wrong). Oh, yeah, it can do this:

Aurora-R6, Samsung PM-961 512gb SSD (M.2 PCIe NVMe), Win-10/64 Pro

March 30th, 2017 12:00

is the problem still not fixed?

is Intel RST still installed? could you please open the Device Manager and then go to Storage Controllers. what do you see there?

8 Wizard

 • 

17K Posts

March 30th, 2017 13:00

Yes, Intel-RST (Rapid Storage Technology) v15.2.2.1030 is still installed (I haven't messed with it). Additionally, it does not seem to be causing a problem with the spinning 1gb HDD or interfering with the PCIe/NVMe SSD.

It has two listed:

Intel Chipset SATA RAID Controller
Microsoft Storage Spaces Controller

 

I updated my NVidia.com drivers again. This time with a Custom Clean-Install of latest WHQL driver and with only the base/minimum "components".

It seems to be running a little better now. However, I haven't had much time to game recently, so I think its still to early to tell. Please recall that it only hard-locked while gaming (and it was fairly intermittent). I am getting less Critical-Events in Reliability Monitor report lately.

 

Thanks for checking-in on me.

March 31st, 2017 03:00

Thats strange. no NVMe Controller listed. instead you have Intel Chipset SATA RAID Controller. not even Intel NVMe RAID or something like that.

it could be the NVIDIA Driver causing problems. every now and then Nvidia releases faulty Drivers. last year it was pretty much hit or miss. they released one hotfix after another....

come back if the System still crushes.

8 Wizard

 • 

17K Posts

March 31st, 2017 10:00

Yes, it seems the Nvidia v375.63 driver that Dell Pre-installed is one of those particularly bad sets. It was just the luck-of-the-draw that it ended up in their current Aurora-R6 BootDisk pre-load Image.

They would cause an actual Windows Hardware Failure. If it happened during a Full-Screen Direct-X game, the computer would hard-lockup (no BSOD, no crash-dump files). It's been a long time since I've seen something like this, that wasn't truly malfunctioning hardware that needed to be replaced.

I'm reminded that this Aurora-R6 is one of my only truly dual-GPU systems (because of its Intel i7-7700k). All my other systems (even my laptops) are either one-or-the-other (Intel/AMD Internal GPU or dedicated external Nvidia/AMD GPU). I'm wondering if this has something to do with it.

The only way I know to insure a system with an intermittent problem is "Fixed" is ... hours of stress-testing without another single re-occurrence of the failure. That's what I'm doing now.

Carbon Based Lifeform wrote:

it could be the NVIDIA Driver causing problems. every now and then Nvidia releases faulty Drivers. last year it was pretty much hit or miss. they released one hotfix after another....

   
No Events found!

Top