Unsolved

This post is more than 5 years old

1 Rookie

 • 

2 Posts

30924

January 29th, 2014 00:00

"Transition to non-recoverable" machine check error

Hi. now iam using PowerEdge R815 with 4 AMD opteron 6272 processors.

I got a system event log with system halt as follows

Sensor ID : CPU Machine Chk (0xd)

Description : Transition to Non-recoverable

Entity ID : 34.1

Generator ID : 00b1

I want to know the error code for the machine check error which is described in AMD specification document (BKDG 15h)

As far as I know, if there is machine check error, I can get error code MCA status register (e.g. MC0 ~ MC6) after next warm reset.

But,  after warm reset, there was not any machine check error in the MCA status register(I make some linux device driver to read MCA status register)

the first question is that 

am i correct about the procedure to get error message from AMD MCA ?

the second question is that

Then, how can i get the error code which cause the machine check error ?

 

the third question is that

What does "transition to non-recoverable" means ???

 


Thanks.

7 Practitioner

 • 

9.7K Posts

 • 

48K Points

January 29th, 2014 06:00

Pat,

There are a a few things that can cause the error. It doesn't mean there is a issue specifically to the processor. A CPU Machine Chk is an error when the processor detects an error during execution of instructions or when told to by the system. I would like to get some additional information from you to help diagnose what's causing the issue. 

What is the OS on the system?

Are you able to boot the server, and if so when do you get the error or halt?

Lastly, what revision update are you at for the BIOS and ESM/Drac? I ask as I have seen out of date systems cause similar issues.

Also, this link is helpful in regards to breaking down the error and has some good information  - ftp://ftp.dell.com/Manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-7250_Reference%20Guide_en-us.pdf

Let me know.

1 Rookie

 • 

2 Posts

January 29th, 2014 07:00

Thanks for your fast reply.

As you mentioned,  there may many things which can be a cause for the machine chk.
now I’m developing some linux device driver running on CentOS 6.2
I think It can cause the machine chk.  I just want to know the exact error code for the machine check as follows

2301 Rev 3.14 - January 23, 2013 BKDG for AMD Family 15h Models 00h-0Fh Processors

Error Type

Description

CTL1

ETG2

EAC

Master Abort

Master abort seen as result of link operation. Reasons for this error include requests to non-existent addresses, and requesting extended addresses while extended mode disabled (see D18F0x[E4,C4,A4,84][Addr64BitEn]). The NB returns an error response back to the requestor with any associated data all 1s independent of the state of the control bit.

MstrAbortEn

L

D

Target Abort

Target abort seen as result of link operation. The NB returns an error response back to the requestor with any associated data all 1s independent of the state of the control bit.

TgtAbortEn

L

D

GART Error

GART cache table walk encountered a GART PTE entry which was invalid.

GartTblWkEn

L

D

 

This is AMD specification for Opteron 6272 which is used in Dell PE R815.
I know that the cause is obviously not from hardware but from software ( especially, my device driver)
So. my question is
1. "   how can i get the error code which cause the machine check error ?
2. " What does "transition to non-recoverable" means exactly ???"
or any advice :-)
About your link.
I think the reference guide is based on Intel architecture. 
am i right? 
I wonder it can be useful for AMD architecture.
is the reference guide compatible with both Intel machine and AMD machine ???

Thanks.

 

 

No Events found!

Top