Unsolved
This post is more than 5 years old
1 Rookie
•
8 Posts
0
47702
March 24th, 2014 11:00
PE R710 with intermittent Core temperature above threshold error
I have a DELL R710 that intermittently gives the following errors within CentOS release 6.3 (Final):
Mar 24 11:43:04 swamp.mail.uwo.pri kernel: CPU10: Core temperature above threshold, cpu clock throttled (total events = 1)
Mar 24 11:43:04 swamp.mail.uwo.pri kernel: CPU22: Core temperature above threshold, cpu clock throttled (total events = 1)
Mar 24 11:43:04 swamp.mail.uwo.pri kernel: CPU22: Core temperature/speed normal
Mar 24 11:43:04 swamp.mail.uwo.pri kernel: CPU10: Core temperature/speed normal
Mar 24 11:45:24 swamp.mail.uwo.pri kernel: [Hardware Error]: Machine check events logged
Mar 24 11:48:05 swamp.mail.uwo.pri kernel: CPU6: Core temperature above threshold, cpu clock throttled (total events = 64905)
Mar 24 11:48:05 swamp.mail.uwo.pri kernel: CPU18: Core temperature above threshold, cpu clock throttled (total events = 64905)
Mar 24 11:48:05 swamp.mail.uwo.pri kernel: CPU6: Core temperature/speed normal
Mar 24 11:48:05 swamp.mail.uwo.pri kernel: CPU18: Core temperature/speed normal
Mar 24 11:50:24 swamp.mail.uwo.pri kernel: [Hardware Error]: Machine check events logged
{noformat}
The errors have been happening intermittently for about a week, sometimes the errors point to CPU 6 and CPU 12.
When the errors happen, there is no error display message on the server, nor is there any indication of a failure (except the above error message) and operation continues as if normal.
Flashed the BIOS up to 6.4.0 after the first error and no change.
Ran full diagnostics on the machine with no errors.
Opened the machine, inspected the fans, airflow etc and there appeared to be no issues.
This server doesn NOT have Dell Openmanage installed and we are in the process of installing it.
Any thoughts or suggestions appreciated.
No Events found!


DELL-Josh Cr
Moderator
•
9.6K Posts
•
42.6K Points
0
March 24th, 2014 14:00
Hi,
Is this a new install? It is possible that the system is overheating and it is an actual temperature issue, however once you get OMSA installed we should be able tell from the hardware logs, though running the diags should eliminate this. If it is not overheating then it could be the version of mcelog that is installed not properly recognizing the processor. It may also help to turn off c-states and C1E in the BIOS processor options.
whatwave
1 Rookie
•
8 Posts
0
April 9th, 2014 12:00
Josh:
This is not a new install, machine has been running like this for at least 2 years with no errors.
Attempted to install the OMSA but could not do it. The OS presently on the system is 32bit and the diags are 64 bit, so we were not able to do the install.
As a long shot (and to see if the error would change) we swapped the 2 CPU's in the unit. Since then (approx 2.5 weeks) we have not had any further errors.
Maybe it's fixed.
dave