Avamar: SQL Backup and Restore of Large Databases With Multistreaming Enabled Fails
Summary: Multistreaming backups of SQL server databases fail with VDS::Getconfig timeouts and stalled threads. The cause is a Virtual Device Interface (VDI) semaphore change in SQL 2016 and 2019 that creates a deadlock in configuration. ...
Symptoms
Environment:
- Avamar client 19.3
- SQL 2016
- SQL 2019
- Windows servers having large Databases more than 1 TB
Avamar SQL backup configured to use multistreaming with six parallel streams and failed as:
- With multistreaming enabled backup would fail and log:
2021-09-16 15:01:54 avsql Error <6478>: VDS::Getconfig failed with 'The api was waiting and the timeout interval had elapsed' 2021-09-16 15:01:54 avsql Error <6478>: VDS::Getconfig failed with 'The api was waiting and the timeout interval had elapsed' 2021-09-16 15:01:54 avsql Error <6478>: VDS::Getconfig failed with 'The api was waiting and the timeout interval had elapsed' 2021-09-16 15:01:54 avsql Error <6479>: Timed out. Was Microsoft SQLServer running?
All spawned avtar threads become unresponsive and display no byte progress for hours, as illustrated in the snippet below.
- From
avtar thread #1,logs "2pm-1631815200078#1," snip:2021-09-16 14:02:02 avtar Info <19155>: - Establishing a connection to the Data Domain system with certificate authentication (Connection mode: A:2 E:2). 2021-09-16 14:02:02 avtar Info <18120>: DDR trace is enabled. 2021-09-16 14:17:02 avtar Info <8688>: Status 2021-09-16 14:17:02, 0 bytes (0 bytes, 0.00% new) 37MB 0% CPU (1 open files) (local)\ARCUSYM000\f-0.ARCUSYM000.stream0 2021-09-16 14:32:05 avtar Info <8688>: Status 2021-09-16 14:32:05, 0 bytes (0 bytes,
Cause
This is a known issue with Microsoft VDI Interface, and Avamar Engineering delivered a patch.
Microsoft provided
Root Cause Analysis (RCA) for this issue identified as:
- For SQL 2019
- The behaviour of the VDI successful device configuration from an event (
SetEvent) is to use a semaphore unlike earlier SQL versions.
- The behaviour of the VDI successful device configuration from an event (
To summarize the change in behaviour:
- When "Configuration" of the backup set completes on the server in SQL 2016, we set the event using
SetEvent. - When the configuration completes on the server in SQL 2019, we increase the semaphore count by 1 (signaling it).
- There is a manual reset event in SQL 2016, so the event stays signaled and all waiting threads on the client-side return.
- The semaphore count decreases by 1(back to zero) in SQL 2019, when one of the threads waiting on
WaitForSingleObjectis signaled. The other waiting threads on the client wait forever as nothing ever increments or signals the semaphore again.
These changes break "avsql" VDI device configuration when multistreaming is enabled for backup or restore.
Resolution
Avamar implemented the following solution:
- During
GetConfigurationfor each VDI device to move from the Configurable state to Initializing - Then OK to Open VDI Devices for I/O once get the confirmation of device Configuration completed from SQL Server.
This resolved this potential deadlock scenario and is known to impact SQL 2019 and 2016 releases only.
To resolve these failures when backup and restore of large Databases with multistreaming enabled is not working, upgrade the Avamar Server to the following software releases:
- 19.3-100-149
- 19.4.100-116
- 19.7.100-82
Once Avamar server upgrade completes.
- For SQL plug-in version 19.3.100-149, download the hotfix 334445 and follow the README instructions to install it on the affected SQL server.
- For SQL plug-in version 19.4.100-116, download the hotfix 334589 and follow the README instructions to install it on the affected SQL server.
- For SQL plug-in version 19.7.100-82 and later releases, there is no need for any HFs since these versions already contain fix
To download above mentioned HFs, follow the instructions in the below KB article to access the Dell support page.