Intermittent SQL backup issue

Question

Hi

We have this issue on aprox 100-300 SQL machines each month and it's never re-occuring. If you re-run the backup it will complete. The problem exists in both win2k8 and win2k12 machines.

Backup failes with the following error in the nsrsqlsv.raw.

A nonrecoverable I/O error occurred on file "Legato#1978fba8-cf4b-4d3b-8cec-bf2316364b27:" 995(failed to retrieve text for this error. Reason: 15105)..

I found that the 995 is a ERROR_OPERATION_ABORTED signal sent to Networker/VDI writer but can't figure out what's sending the signal.

When looking into the SQL logs i find entry's for the same time and db.

07/23/2013 14:19:05,spid54,Unknown,BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'Legato#097d3227-43a8-4b2e-ad0a-5cf35a661b63'. Operating system error 995(failed to retrieve text for this error. Reason: 15105).

07/23/2013 14:19:05,spid54,Unknown,Error: 18210 Severity: 16 State: 1.

07/23/2013 14:19:05,Backup,Unknown,BackupIoRequest::ReportIoError: write failure on backup device 'Legato#097d3227-43a8-4b2e-ad0a-5cf35a661b63'. Operating system error 995(failed to retrieve text for this error. Reason: 15105).

07/23/2013 14:19:05,Backup,Unknown,Error: 18210 Severity: 16 State: 1.

07/23/2013 14:19:05,Backup,Unknown,BACKUP failed to complete the command BACKUP LOG RP_DBadminOp. Check the backup application log for detailed messages.

07/23/2013 14:19:05,Backup,Unknown,Error: 3041 Severity: 16 State: 1.

So i can see that there is an issue with the RequestDurableMedia operations and that it fails on doing a flush. But i have no idea on what it mean and what the Flush operation does.

in the file xbsa.messages

_nwbsa_is_retryable_error: received a retryable network error (Severity 0 Number -13): busy

Is this due to packetloss, latency or other Network related issues?

And please don't ask me for FQDN checks, license or other default troubble shooting. Backup works 5minutes after it fails so there is no DNS failure. We're running a capacity license so can have as many clients i wish to have.

Servers are mostly virtual and having their data stored on EMC VMAX 10K disk over SAN. All physical have the same disk setup.

bingo.1 · Answer

I don't understand: 'We have this issue on aprox 100-300 SQL machines each month and it's never re-occuring.' Does this mean it will only occur once for each client? - per month? 'If you re-run the backup it will complete.' This could points to another process that runs at the same time but has ended later. A scheduled task? I do not expect a NetWorker timeout issue as it will timeout after 30 mins. Talk to EMC support. If there is a server where the appearance rate is higher i would suggest that you switch on debug mode for more info.

petterssonnl · Answer

I was trying to give some background. We have this issue on almost all clients but not re-occuring. If this fault infact occur you can restart the backup and it will be successfull. So by metioning that my goal was to avoid this thread beeing filled with suggestions regarding DNS or other basic troubbleshooting that's already been done.

I don't understand your theory of another process that ends the backup. Then there would be something else that takes the session, ownership of the shadowcopy and then close the connection. A bit far fetched if you ask me.

EMC support not yet provided any usefull solutions for the last two SR's we've created. Mostly they ask for logfiles and no advice is given.

I'd like to know what the exception "BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device" is thrown for and what's beeing done before that so we can troubbleshoot this more. Need to narrow it down more.

Is it SAN-disk, Network latency/packetdrop, SQL server running out of resources, shadowcopy space is low and OS deletes the copy before backup is completed or something else?

petterssonnl · Answer

I might have missunderstood you regarding the process. I get that there is something abort the backup and that is the root cause for the problem.

I've done som more searching today and found a symantec KB stating that the server is running out of resources and abort the the process itself.

http://www.symantec.com/business/support/index?page=content&id=TECH5970

So i'm going to add a script to our alert management to set the networker parallelism to 2 instead of 4 if i get this problem. It's not a solution but a possible workaround. Would be nice if it was able to do more logging on this issue to see the real problem but don't know where to start and the D9 level on nsrsqlsv.exe don't show anything interesting.

Would be terrible if it's the OS that can't manage it's own resources properly and need to abort operations just because it's allocated to much resources to one process...

petterssonnl · Answer

yea, i know that's the issue and we've conclude that too.

bingo.1 · Answer

'I don't understand your theory of another process that ends the backup.' I did not mean that 'it will take the session'. However, it might influence the network behavior/load which will affect the backup indirectly. For example, it could be a SQL maintenance process which uses more RAM than necessary. I am not a SQL specialist. I don't have to because we export the databases and backup the export files. In fact we have dropped using the SQL module. It worked fine so far but it is backup-admin friendly, not db-admin friendly. So my SQL experience is limited. However, if i see a message like 'Operating system error 995' i would first check what it stands for. For Windows 2008 the message is clear: D:\>net helpmsg 995 The I/O operation has been aborted because of either a thread exit or an application request. D:\> Assuming the message code is true then you can search the internet and you may also find this page which refers to the same code with the OS text: Dell Software - Knowledge Base Articles. So this might you/your db-admin give a clue.

NetWorker

Intermittent SQL backup issue

Was this post helpful?