1 Message

May 28th, 2009 16:00

Greetings Forum Readers,

This is more of a response to the responses of the original post, rather than directly to the post itself.  I hope that this helps to clarify a few things.

CPU IERR and PERR errors can happen for a variety of reasons, and sometimes not related to the actual physical processor(s) at all.  In general, CPU IERR and PERR errors happen when the processor (and the motherboard in general) detects something on the motherboard that doesn't fit or isn't compatible with the rest of the hardware, and it's significant enough to cause the server to halt and/or throw errors.  Really, it's a very vague error... unfortunately.  This doesn't mean that it's not fixable, but that special attention should be paid to the following:

1. Recent hardware additions to or changes in the server base hardware configuration (PCI cards, additional processors, memory, etc.)
2. ANY OS changes that affect how the OS addresses the processor(s)
3. Failures on the PCI Risers, the memory boards and/or risers, NICs, HBAs (storage controllers and the like), and any other PCI device not originally ordered with the server

Of course, this is not an exhaustive list, but this is an excellent place to start.  These errors CAN also point to specific processors, so be sure to make note of the entire error message when you see it ("CPU IERR E1410" as opposed to just "CPU IERR").

Now to the nitty gritty.  VMWare, Hyper-V, XEN and XENserver, and Microsoft Virtual Server all have limitations on what can and can't be used for vvirtualization.  The main thing to take from this is that the number of maximum vCPUs (Virtual CPUs) allocated for any SINGLE VM cannot exceed the number of physical CPUs available in the server.  This is including the number of cores in any given CPU, as they are counted as an individual CPU with regards to VMWare core.  So, if you have (2) physical CPUs in the server, each a Quad Core, VMWare would see this as 8 CPUs.  Typically, you would want to always allocate n-1 as the number of CPUs available - "n" equally the total number of cores you have in your server.  This is because you always want to leave at least ONE core for the core OS.

If you set any individual VM to have more vCPUs than what's physically available, it will throw a CPU IERR error message.  This is because VMWare is attempting to access CPUs that don't exist. 

Next is the 64-bit versions of these Operating Systems.  The newer Dell servers (2900, 2950, R200, etc.) come with processors able to handle 64-bit extensions.  These servers can handle VMWare, Hyper-V, etc., loading 64-bit guest OSs.  In these servers, in order to be able to run 64-bit VMs, you absolutely must have VT enabled on the processor.  Virtualization Technology not only provides for fully virtualized environments within programs such as XEN (for SuSE and RH), but also provides the capability of running 64-bit guest operating systems within VMWare.  If this isn't done, this may trigger CPU IERR errors on your server.

These are two very common causes for IERR and PERR errors in servers running Virtualization Technology like VMWare.  You need to be able to provide the server with the hardware needed to maximize VM usage.

To quickly address issue #2 from the original post... the LCD screen on the server reports errors that are in the ESM log in the server.  This is the System Event Log for the hardware.  The DSET reporting utility can be downloaded to pull this log (to see what's actually being reported so that you don't have to read it on a small screen) and can be used to CLEAR this log.  Clearing the ESM log using the DSET reporting utility will clear the LCD of the error on the front of your server, returning the light to blue.  This is important to do after you fix an issue, because if not done, the light will stay amber, and you will not see when/if another error is being reported.  The DSET reporting utility can be downloaded from http://support.dell.com/dset

As always, it's important to make sure that your server is always kept up to date, including BIOS and BMC revisions, as well as your RAID controller (PERC, CERC, etc) so that these issues can be avoided.  These firmware revisions are written to address previous known issues that arise during the course of the servers lifecycle.  These updates can be downloaded from http://support.dell.com.  Keeping your server up to date, especially after anything is changed in the configuration, OS setup, etc., will prevent CPU IERR errors from happening.  Also, be sure that ANY device that you plan on putting in your server is officially supported by the Dell hardware you are connecting it to.  Servers have a much more specific purpose when it comes down to the array of machines that are manufactured, and many PCI cards will simply not work, or worse yet, cause your server to stop working.

Good luck!

 

October 15th, 2009 07:00

Just an update... or rather a non-update.

I too was plagued by the E1420 CPU BUS PERR error on a PowerEdge 2950 running Windows Server 2003 R2 SP2 with the latest updates and Virtual Server 2005 SP1 with 7 Windows Server 2003 VMs.

It took all of 20 seconds for my dell tech to send me an email with instructions that were posted here a year ago. I guess I'll update you in a month to say if it worked.

Description

Microsoft Virtual Server 2005 may produce hardware errors on Intel® based 9th generation PowerEdge Systems. In particular, the PowerEdge 1950III, 2950III, and 2900III. The hardware fault being reported in most instances is E1420 - CPU BUS PERR.

 
Solution
 

When working on the issue perform the following steps to get into a Microsoft supported configuration:

1. Update to the latest Service Pack (SP) or patch level for Microsoft Virtual Server 2005.

2. Enable Intel Virtualization Technology in the BIOS.

3. Verify if host operating system is supported.

4. Verify if guest operating system(s) are supported. Refer to the following Microsoft KB Article: 948515

5. Exclude virtual machine directories from any anti-virus scanning.

6. Dell hardware updated BIOS or Base Management Controller (BMC) / Dell Remote Assistant Card (DRAC) (latest versions).

 

Note:          
        Dell has not tested or validated Microsoft Virtual Server to run on Dell PowerEdge servers. The information put together here is a compilation of Microsoft and Dell requirements and recommendations. Where applicable external links have been provided to the main source of the information for further information.

4 Posts

December 7th, 2009 11:00

Hi SpiderPlant, so have you run into this error lately?

December 7th, 2009 13:00

Actually the Virtual Host has now been running with no issues since I made this post.

Thank god!

10 Posts

May 10th, 2017 15:00

A PowerEdge 2950 II running VMware ESXi, 6.0.0, 5050593 Image Profile (Updated) ESXi-6.0.0-20170202001-standard ran without issue for quite some time, and the underlying hardware has had no issues for several years.  Recently, an Intel 350T2V2 NIC was installed and configured for use, then a Dell SAS 6 GB HBA External Controller Card 7RJDT was installed.  Neither installation had a negative impact on system stability.

Next, upon replacing four (4) Crucial 4GB 240 Pin 512Mx72 DDR2 PC2-5300 CL5 ECC DIMMs with eight (8) A-TECH 8G DDR2 PC2-5300 ECC FULLY BUFFERED DIMMs, the BIOS memory check passed, but seemed to proceed very (very) slowly.  ESXi started to boot, but took an extraordinarily (very) long time at the /sb.v00 and /s.v00 steps of the "Loading VMware Hypervisor" stages.  Eventually, and a (very) long time later, a message appeared stating "Relocating modules and starting up the kernel...".  Again, a significant amount of time transpired.  Then, the screen blacked out and this:

 VMB: 398: Unexpected exception 2 @0x41800e06957e

VMB: 405: cr0 0x8001003d cr2 0x0 cr3 0x100803000 cr4 0x30

VMB: 407: error code 0x2 rip 0x41800001eee0 cs 0x8

VMB: 409: rflags 0x86 rsp 0x42800001eee0 ss 0x0

VMB: 411: rax 0x12345678 rcx 0x101ffff rdx 0xffff4c000

VMB: 413: rbx 0x0 rbp 0x0 rsi 0x1000

VMB: 415: rdi 0xffff81100004c000 r8 0x2 r9 0x23

VMB: 417: r10 0x8000000000000003 r11 0x0 r12 0xffff4c

VMB: 419: r13 0x420000045221 r14 0xd r15 0x0

VMB: 420: gs 0x10 fs 0x10

VMB: 422: FSbase:0x0 GSase:0x417rce236200 kernelGSbase:0x0

VMB: 139: [0x42800001eee0] 0x41800e06957e

VMB: 139: [0x42800001ef00] 0x41800e06a0ad

VMB: 139: [0x42800001ef900] 0x41800e814c24

VMB: 139: [0x42800001efc0] 0x41800e000fb8

VMB: 85: Halting.

At the same time, the PowerEdge 2950 front panel LCD switch from blue to amber and reported:

   E1420 CPU BUS PERR

At this point the system is dead and must be powered off.

The RAC System Event Log shows entries like:

   Entry 007 of 007

  Severity: Non-Recoverable

  Date and Time: Wed May 10 13:48:12 2017

  Description:

  CPU Bus PERR: Processor sensor, transition to

  non-recoverable was asserted.

It was noted that the BIOS version was 2.0.1 (much, much older than the apparent latest 2.7.0).

After updating the BIOS to 2.7.0 via a CentOS 6.5 i386 Live DVD in conjunction with `yum install compat-libstdc++-33.i686` and the December 2015 Dell Server Update Utility DVD, the problem is gone.  (The SUU was used to update other system components later - after finding out that BIOS update is what fixed it.).  The release notes mention CPU microcode changes and better compatibility with "some DIMMs".

No Events found!

Top