Kunskapsbank

PowerEdge Servers - Processor issues: information and troubleshooting techniques


As with most systems, the processor is a key component in a server, computing instructions and managing other components such as memory or PCI buses. So when a CPU seems to be having issues, it can be very worrying.

However, physical processor failures are extremely rare. In fact, in the majority of processor replacements, the CPU shows no failure once tested individually. When a CPU does fail, it's usually caused by an electrical surge to the system, cascade failure from another major component failing, or thermal issues. Therefore, it is critical to follow key troubleshooting steps when a processor failure is suspected in order to properly identify the component at fault.


The information and steps provided in this article will help understand the possible source of the issue. Click on the title to expand the section.

With each generation of servers, the role of the processor has evolved to improve performances and reliability.

Note: Details on the processors supported by your PowerEdge server are provided in our Processor Information page.

Generation 11:

Most of the servers of the 11th generation are equipped with the Intel® Nehalem-EP processors. Nehalem-EP is the codename for the 1-2 socket, with up to four core server/workstation processor targeted for the Intel 5520 chipset based platform (compatible with Intel® Xeon® 5500 platform). Nehalem-EP is part of the family of 45 nm processors based on Intel microarchitecture codename Nehalem. More information is available on the manufacturer's website www.intel.com.

The main change with this microarchitecture is that the Memory Controller is now imbedded in the processor. This will have an impact on the server's performance but also on the errors that can be thought as processor errors.


Generation 12:

For this generation of servers, when equipped with Intel processors, the new platform is called Sandy Bridge EP, replacing the Nehalem microarchitecture. The integration of PCI-E lanes into this processor is a new step towards a multipurpose processing unit. More information is available online or on the manufacturer's website www.intel.com


Generation 13:

The 13th generation of PowerEdge servers features the Intel® Haswell EP product family, offering an ideal combination of performance, power efficiency, and cost. More information is available online or on the manufacturer's website www.intel.com

Note: Which generation your PowerEdge server belongs to? Review our dedicated article to find out.

Since the processor interacts with all the components in a server, the symptoms and errors that can occur are very varied.
Here are some examples of common CPU issues with technical articles and troubleshooting steps:


1. No POST:
The server will not complete Power On Self Test. This means that a component is blocking the server from starting during the Self Test.
Here are some steps to follow to narrow down the list of components that could cause this:
  • Look for a possible error message on the LCD panel or LED lights on the front of the server. If an error message is available, it will provide some valuable information. You can review the CPU related error messages page or type this error message in a search engine to find more information.
  • Remove all ESD (electrostatic discharge ) from the server by:
    1. Turning off the server (hold the power button for 30 seconds),
    2. Disconnect all cables from the server including the power cable,
    3. Hold the power button pressed in for 60 seconds to discharge.
    4. Reconnect the power cable and video cable only.
    5. Try to power on the server.
  • ​If the processor has recently been changed, reinstalled or might be physically damaged, you can do a visual check inside the chassis to see if anything has been damaged (CPU or CPU slot on the motherboard for example)
  • Minimum to POST: Since a component might be causing the No POST situation, removing all unnecessary components to complete POST is an efficient technique.
    The list of minimum components will vary depending on the model of server you currently have. Usually this will include: Power Supply, Motherboard, 1 CPU, 1 DIMM. For the exact list of components you can review the user manual for your Dell PowerEdge server.
Warning: If you want to attempt to remove and/or reinstall the CPU in your server, you must ensure that you use the appropriate tools. Use our CPU video archive to see the detailled steps:
- Processor Video Archive for PowerEdge servers.
- Learn how to Avoid ESD (Electrostatic Discharge) damage when manipulating components.

2. Thermal Issues:

The symptoms for thermal issues can be very varied: temperature / fan / heatsink error message on the LCD panel, server turning off after a lapse of time and not turning back on right away, system fans working at full speed all the time. Examples of error messages on a Dell PE Server:

LCD panel error messages System Event Logs
E0119 - Temp CPU,
E0119 - Temp PROC,
E1414 - CPU # Thermtrip,
E1119 - Chipset # temp out of range. Check motherboard heatsinks
CPU0001 - CPU has a thermal trip (over-temperature) event
CPU0010 - The CPU is throttled due to thermal or power conditions.

For more information on CPU related error messages, you can take a look at our dedicated CPU error page.

Here is a list of key points to check in case of thermal issues:

  • Check the LCD and ESM for any additional error messages to identify the component causing issues.
  • Ensure the airflow to the machine is not blocked. Placing it in an enclosed area or blocking the vent holes can cause it to overheat. If installed in a rack, make sure the rack cooling system is working ok.
  • Verify the ambient temperature is within acceptable levels.
  • Check the internal system fans for obstructions and verify all fans are spinning properly. Swap any failing fans with a known-good fan for testing.
  • Verify any required shroud or any required blanks are installed (power supply, hard drives, DIMM, riser, fan etc.).
  • If all of the fans are spinning properly, verify that the heatsink is installed correctly and thermal grease is applied.
  • For multi-processor servers, you can attempt to test each processor in the first position.
Warning: If you want to attempt to remove and/or reinstall the CPU in your server, you must ensure that you use the appropriate tools. Use our CPU video archive to see the detailled steps:
- Processor Video Archive for PowerEdge servers.
- Learn how to Avoid ESD (Electrostatic Discharge) damage when manipulating components.

3. Errors in the logs of the server:

As mentionned previously, the first step to troubleshoot any issue or error is to review the logs of the server for possible error messages. Our article Error Messages in System Event Log and how they can be viewed will guide you on how to access these logs.

Another example of error messages that refer to the CPU is CPU IErr (for example "E1410 CPU IErr was asserted"). This is usually not an error with the CPU itself, but a sign that the CPU has detected an error in the system, or received an erroneous instruction from a system component. It could be the memory, PCI-E slots, etc.

For more information on this type of error and some troubleshooting steps, you can read our dedicated article: Troubleshooting CPU Internal Error (CPU IErr) on PowerEdge Servers

4. Errors in the OS

When in the operating system, the symptoms for a possible CPU issue can be very varied: slow performance, random reboots, CPU errors in the System Logs of the operating system.
For PowerEdge servers, there are a few key elements to ensure optimal usage of the processor by the operating system:

  • Ensure the physical memory configuration of the server is correct as this will have an impact on the processor. The right DIMM must be in the right slot in the right channel for each processor and the total memory size must be balanced between the channels and the processors.
  • Check the memory configuration in the BIOS. Different settings are available depending on type of behaviour you are looking for (Advanced ECC, Memory Optimized, Mirror). For each setting, the physical memory configuration can change so it's important to verify this.
  • The server BIOS and iDRAC must be up to date. Any improvement or fix that could impact the processor will be done through a BIOS update so it's very important, when faced with a possible CPU issue, to update the BIOS of the server. The Embedded Server Management (also called BMC or iDRAC depending on the generation) is also an important element to have updated as it directly interacts with all the components in the server.
    • Important: Updating the BIOS of the server requires a reboot of the server.
    • ​Article explaining the different server update methods available: SLN293301.
  • ​Review the Operating System provider's website to ensure the hardware is part of the Hardware Compatibility list.

More technical content in our PowerEdge Knowledge Resources


Artikel-ID: SLN298206

Senast ändrad: 05/19/2017 08:38 AM


Betygsätt den här artikeln

Korrekt
Användbart
Lätt att förstå
Var den här artikeln till nytta?
Ja Nej
Skicka dina synpunkter
Kommentarer får inte innehålla följande specialtecken: <>()\
Vårt feedbacksystem är tyvärr ut funktion just nu. Försök igen senare.

Tack för dina synpunkter.