T-420 server 2012 R2 backup causes disk failure eventIds: 153/140/14

Question

Running Windows Server Backup causes the following disk errors, we have a Raid 1+0 (4 disks) array on a Perc 310 controller: EventID: 153 - System, diskThe IO operation at logical block address ef0 for Disk 1 (PDO name: \Device\00000044) was retried. EventID: 140 - System, Microsoft-Windows-NtfsThe system failed to flush data to the transaction log. Corruption may occur in VolumeId: C:,DeviceName: \Device\HarddiskVolume7.(The I/O device reported an I/O error.) EventID: 14 - System, volsnapThe shadow copies of volume C: were aborted because of an IO failure on volume C:.

DELL-Geoff P · Answer

NickC_UK,

I found a good article concerning the event 153 errors: http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx

You may have a disk that needs to be replaced; I would run diagnostics on the drives just to confirm.

EventID: 140, http://social.technet.microsoft.com/Forums/en-US/85dda2c8-485a-45f1-b438-80720fb10a7e/ntfs-warning-id-50-id-140-user-profile-disk?forum=winserverTS also points to a hard drive issue.

EventID: 14 - System, volsnap; looks like this follows suit with the previous 2 errors. Since it detects errors on the drive, it aborts the write the shadow copies.

Regards,

NickC_UK · Answer

There are brand new SAS disks.  If there was a problem surely the Raid controller would have detected them wouldn't it? How do I check individual disks when they are part of a Raid array?

DELL-Geoff P · Answer

You can use the online diagnostic package that will test the drives individually. They can be found here:

http://www.dell.com/support/drivers/us/en/19/DriverDetails/Product/poweredge-t420?driverId=TRWYD

Regards,

NickC_UK · Answer

Thanks Geoff I have downloaded that and installed update to 'Dell 64 Bit uEFI Diagnostics', but how do I run this online.  Also what happens to the raid array if I test one disk offline, I assume that will then need to be rebuilt will it?

NickC_UK · Answer

In the meantime we have run a Disk Consistency check on the Raid array which failed with:

The Check Consistency found inconsistent parity data. Data redundancy may be lost.: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC H310 Adapter)

Raid controller is now resynching, been running for about 17 hrs and is 75% of the way through.

Issue is that the Raid controller was reporting that all disks were operating fine, without doing the the Raid controllers Disk Consistency check we wouldn't have known there was a problem. This is a brand new server, how can it be that the Raid array is not in sync? Would it not have been synchronised before it left the factory?

NickC_UK · Answer

This problem is being caused by Windows Server Backup trying to backup from the hyper-V host partition. Backup from a virtualised server works fine, just the hyper-V host that has the problem.

We have also identified that this only happens when backing up to a non-raid disk in the same disk chassis as the raid array disks. Backup to an external disk and all works fine.

Backup fails as follows and then leaves the Raid array corrupted!

Backup failed as shadow copy on source volume got deleted. This might caused by high write activity on the volume. Please retry the backup. If the issue persists consider increasing shadow copy storage using 'VSSADMIN ShadowStorage' command.

EventID: 140 – System, Microsoft-Windows-Ntfs
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: V:, DeviceName: \Device\HarddiskVolume9.
(The I/O device reported an I/O error.)

EventID: 153 - System, disk
The IO operation at logical block address c60 for Disk 0 (PDO name: \Device\00000043) was retried.

EventID: 157 - System, disk
Disk 2 has been surprise removed.

EventID: 517 – Application, Microsoft-Windows-Backup
The backup operation that started at '‎2014‎-‎03‎-‎20T15:37:24.984150300Z' has failed with following error code '0x8007045D' (The request could not be performed because of an I/O device error.). Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

This is obviously an incompatibility between the Perc 310 Raid controller and Windows Server Backup 2012 R2.

Any known fixes?

DELL-Geoff P · Answer

You will need to increase the VSS cache size for backups. The error that is occurring Backup failed as shadow copy on source volume got deleted. This might caused by high write activity on the volume. Please retry the backup. If the issue persists consider increasing shadow copy storage using 'VSSADMIN ShadowStorage' command.) Is that the data that is stored in the cache before it writes it to the disk is being deleted before it can be written to the backup location. Because the data is being deleted before it is written & acknowledged that it has been written you get what is known as dirty cache & when the cache buffer fills up it does a force flush & that is suppose to flush out all the old data that has been written to the drive but for some reason it is flushing most or all the data in cache & not checking to just flush the old cache data.

Let us know how it works.

NickC_UK · Answer

Just found a message elsewhere which again was on a Dell Perc 310 controller. As no one else other than Dell Perc 310 owners are reporting this problem this strongly suggests it is down to the Perc 310 or its driver. Are there any driver or firmware updates in the line which might cure this?

http://serverfault.com/questions/566591/windows-server-backup-keeps-failing-after-upgraded-my-os-from-windows-server-201/583909#583909

Rgds,

NickC_UK · Answer

All drives set to unlimited space for Shadow Copies. Latest event log errors below. It seems the root of the problem is that VSS is causing 'Disk 3 has been surprise removed', any idea why that is happening?

Rgds,
Nick

EventID: 157 - System, disk
Disk 3 has been surprise removed.

EventID: 153 - System, disk
The IO operation at logical block address 00 for Disk 1 (PDO name: \Device\00000044) was retried.
EventID: 140 - System, Microsoft-Windows-Ntfs
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: C:, DeviceName: \Device\HarddiskVolume7. (The I/O device reported an I/O error.)
EventID: 14 - System, volsnap
The shadow copies of volume C: were aborted because of an IO failure on volume C:.
EventID: 1001 - Application, Windows Error Reporting
Fault bucket , type 0
Event Name: Windows Server Backup Error
EventID: 1001 - Application, Windows Error Reporting
Fault bucket , type 0
Event Name: Windows Server Backup Error
EventID: 519 - Application, Microsoft-Windows-Backup
The backup operation that started at '‎2014‎-‎03‎-‎22T16:44:46.777575300Z' has failed to back up volume(s) 'C:,RECOVERY,X:'. Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

NickC_UK · Answer

Hi Geoff,

Have now set Shadow Copies Limit to approx 20% of max size for all disks and volumes.

Not sure that has helped as we seem to be getting the following error more often now:

EventID: 157 - System, disk
Disk n has been surprise removed.

Just a thought but most of these virtual disks are dynamically resizing .vhdx which don't yet have a lot of data written to them. Could it be that VSS is trying to use this limit but the Vdisks have not been resized to have that much space yet?

Nick

NickC_UK · Answer

Have been doing more testing and the source problem is eventid: 157 - Disk n has been surprise removed.

Why is this happening?

I have enabled 'Disk' event logging but no events seem to get written into those logs. How can we establish why this disk is being surprise removed.

Log Name:      System
Source:        disk
Event ID:      157
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Description:
Disk 2 has been surprise removed.
Event Xml:


    157
    3
    0
    0x80000000000000

    20275
    System


    \Device\Harddisk2\DR4
    2
    0000000002003000000000009D000480000000000000000000000000000000000000000000000000

Followed by a whole load of:

Log Name:      System
Source:        disk
Event ID:      153
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Description:
The IO operation at logical block address d90 for Disk 1 (PDO name: \Device\00000044) was retried.
Event Xml:


    153
    3
    0
    0x80000000000000

    20276
    System


    \Device\Harddisk1\DR1
    d90
    1
    \Device\00000044
    0F0104000400340000000000990004800000000000000000000000000000000000000000000000000028042A

Disk 2 has been surprise removed.

NickC_UK · Answer

It seems that the 'EventID: 157 - System, disk, Disk n has been surprise removed' error is not the problem many others have seen this elsewhere. The error that is the real problem is:EventID: 153 - The IO operation at logical block address d90 for Disk 1 (PDO name: \Device\00000044) was retried. The disk drive has now been tested in the same senario in a spare HP server, also running 2012 R2, and WSB backs-up to it fine so this is strongly looking like a problem specific to this server or the Perc 310 controller.

NickC_UK · Answer

This has now become a lot more serious, been testing out backups to a completely different USB attached SSD drive and that has now suffered from the same problem:

Log Name:      System
Source:        disk
Event ID:      153
Task Category: None
Level:         Warning
Keywords:      Classic
Description:
The IO operation at logical block address 8 for Disk 8 (PDO name: \Device\0000008d) was retried.

robertps73123 · Answer

Go ahead an laugh, I was having the same thing, my fault was in the power options on the USB Power tab. Robert

robertps73123 · Answer

◦EventID: 519 - Application, Microsoft-Windows-Backup The backup operation that started at '‎2014‎-‎03‎-‎22T16:44:46.777575300Z' has failed to back up volume(s) 'C:,RECOVERY,X:'. Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.. I have researched it for 3 days, Microsoft kept Recovery partition at 300 MB so they made changes to WinPE Recovery partition, I cannot get anyone to tell me what or how they did it. The work around is to give recovery 400 MB using diskpart and disk administrator. I have eval copies and the backups are flawless and the Recovery partition is 300 MB. Until they get a fix that does not hack up the partitions on the boot drives I'm using reagentc /disable to run the backup which skips the Recovery partition and when the backup is finished use reagentc /enable, each from an elevated command prompt. Robert

PowerEdge Hardware General

Was this post helpful?