SRDF for SQL Server Database

Question

I'm new to SRDF - I took a class several months ago but have not used it outside of the class. We will be buying 2 new DMX4s later this year. We will do SRDF/A from California to Texas. We have mostly AIX and Windows/SQL data. The AIX file systems are pretty straight forward to me. It seems like we would setup device groups that contain the sym devices that make up our AIX file systems. By default the file system data would move to Texas every 30 seconds. I realize we could change the SRDF cycle time to a different value if it makes sense. The changed tracks would go from the source R1 devices to the target R2 devices each cycle. If we have a disaster at in Ca (the source R1 site) we would run the symrdf failover command. On the target side in TX we would then run importvg on the AIX volume groups and mount the file systems. We would then be up and running in TX. The SQL Server environment seems more challenging. Each database is made up of a data file and a transaction log file that are on different sym devices (actually a 24-way meta device for the data and a smaller 12-way meta for the log). Obviously the data and log info must be consistent. Do I need to use some special Time Finder product (someone referenced TSIM)? How does this process work?

bodnarg · Accepted Answer

Having worked with numerous DR scenarios on different OSes I can offer that I feel SRDF/A or SRDF/S (if distance allows) is the simplest and most efficient DR process because it minimizes the "moving parts" involved with the process.

Since it is "crash consistent" as has been pointed out the biggest risk revolves around two main points for recovery timing:

1) How busy your application is at the "crash" point. The more pending updates the more work to rollback the changes and start the database.

2) How your application is written. If the application is poorly written it may have a lot of data "pending" and not "committed" which could result in long recovery times. Have seen this on numerous database types (Oracle, SQL, Sybase, etc.) so this is a common issue.

SRDF/AR has its place, but the biggest issues I have with seeing that recommended over SRDF/A are:

1) You have more to manage (scripts, software tools, etc.) which are more likely to break down in a real disaster solution. I have seen numerous multi-step DR processes fail due to lack of maintenance or lack of understanding on how to implement them. Also recovery tends to require more steps than crash recovery.

2) You will increase your RPO (i.e. data-loss) because you will automatically be talking about minutes if not hours depending on your process.

The only reasons I could find to argue for SRDF/AR over SRDF/A are:

1) Your database truly may not recover from a crash consistent copy (if so why in the world are you using that application?)

2) You have a batch intensive application which only does updates at a certain period. It may be more efficient to take a consistent image after the update process and ship via SRDF/AR.

3) You want to use the image for something other than DR recovery. In that case having a "database consistent" copy may be necessary. Some examples for this may be making a copy for your development environment which may require database consistency to do things such as rename your database, apply additional transaction logs, etc.

Big point (in my mind at least) to remember is that a DR copy is NOT the same as a backup copy. You are generally taking these to recover from a major disaster which means the simpler the process the more likely you are to succeed in recovering your applications.

xe2sdc · Answer

You can find a nice paper on this topic .. It's about implementing a Disaster Restart solution for SAP/R3 using MS/SQL and SRDF. Even if not exactly new (about 2 years ago) it covers SRDF/A and SQL setup. It's called "White Paper: Using SRDF for Disaster Restart of SAP R/3 Enterprise Systems - Implementation Guide for Windows and SQL Server" and you can find it in Powerlink.

Having that said, you don't need any TF trick since SRDF/A will guarantee that all the volumes in the same RDFG working in ASYNC mode will be "consistent", meaning that every 30 seconds R1 devices will "capture" writes and transmit them to the R2 side. On the other side of the RDF link R2 devices will apply changes received from R1 side as soon as they receive all the changes. The host that see R2 devices can see an image somewhere from 30 to 60 seconds in the past of R1 devices.

AranH1 · Answer

But won't that only be write order consistency, which to SQL is a database that it will have to perform crash recovery on? To guarantee SQL recovery of the data you would need a process to create SQL consistent points in time in the replicated data wouldn't you? That would require a call to the SQL VDI to create the SQL consistent data.

AranH1 · Answer

Thanks for clarifying that Stefano, that was what I figured: crash consistent is all that is created.

And yes it is the same type of consistency as a host crash will give you. But will the database recover from it? It depends on the database. Sometimes it requires a dba to roll back transactions until they find a point in the database that they can recover from. I have seen with really large SQL databases (greater than 1TB in size, which is large for SQL) that crash recovery does not always work and requires rolling back to get the database online.

And in Microsoft white papers on Disaster Recovery they do not guarantee crash consistent databases for DR and recommend a method for creating SQL consistent snapshots for guaranteed recovery.

AranH1 · Answer

We ran into this issue after implementing RecoverPoint and found that the write order consistency of the sql data being replicated was not always sufficient to perform crash recovery on. So we use a combination of Consistency Groups to provide to provide asynchronous crash recovery snapshots and scheduled tasks to create SQL consistent snapshots every four hours.

And the source production cluster has always been able to recover from failovers and crashes, but the replicated data in a crash consistent format has not seemed as resilient at crash recoveries. Not sure why, that is why we have the SQL snapshots every four hours: for a guaranteed RPO.

xe2sdc · Answer

AranH you need to use VDI when you need to create a snapshot of your data using Timefinder (snap/clone/mirror). You need VDI (or VSS) since you need to have a point in time where your data is consistent, becouse creating such snapshot may involve different volumes that may be cloned/snapped at different times.

But what happens if you can't use VDI or VSS ?? What can you do ?? You can use ECA (-consistent option) and trust the storage.

Think to SRDF/A like a consistent snap every 30 seconds (capture) and later you move the snap elsewhere (on the R2 devices).

SRDF/A will guarantees that the image of your data on R2 devices is a "consistent" image, a snapshot taken 30 to 60 seconds ago.

xe2sdc · Answer

And yes it is the same type of consistency as a host
crash will give you. But will the database recover
from it?

Will your database restart if your main host crashes and your cluster switches the application on the other host ?? I hope it will recover. If it won't recover, ask M$ to polish their filesystem and their database

xe2sdc · Answer

Let me take your words and bring them further .. What happens when your host crashes ?? VDI will grant you a "restartable" image on the disks ??

SRDF/A will give you the same image of your data that a crash would give you

.. Do you think that your database will recover from an host crash ?? I hope it will

xe2sdc · Answer

EMC have plenty of choices for you ..

If you trust EMC (and SAP) you can use SRDF/A and have recent data at your R2 site.. If you want "consistent snapshots" you have SRDF/AR that will combine TimeFinder and SRDF to move your data at disaster recovery site. But now your data will be less recent since SRDF/AR will work in cycles and every cycle takes some time. You have to establish BCV volumes at primary site, split them in a consistent way and then you can move data from R1-BCV to R2 devices. SRDF/AR will take care of all the steps .. But they will take time and R2 image will be older.

You have to choose .. If you can't loose any data, use SRDF/S. If you can tolerate to loose data, use SRDF/A if you want to loose only a few seconds of data. If you can tolerate to loose minutes or hours of data, go SRDF/AR

Now you can choose.. But don't ask for discounts !!

xe2sdc · Answer

AFAIK RP is not 'exactly' like SRDF/A .. We are talking of two different products, two different architectures. I don't think we can apply to SRDF/A what we discover about RP

AranH1 · Answer

Of course, but replication of a database will encounter the same challenges with ensuring data consistency regardless of the transport.

I would think that whether we are talking about SRDF/A, MirrorView/A, or RecoverPoint CRR, each transport protocol needs to ensure that there is write order consistency at the target site. And write order consistency is still only crash consistent whether it is SRDF/A or RecoverPoint CRR. Each of the technologies mentioned would still need an additional mechanism to provide SQL consitency.

AranH1 · Answer

Not trying to pick on you about this Stefano, I am enjoying the conversation. It is just that I have been presented with similar challenges as the poster and am trying to understand it deeper to come up with other options for my infrastructure.

As always I appreciate all the information you share so frequently on this forum

brad12341 · Answer

Guys,

Excellent info! As an ex-DBA I can relate to some of the issues Aran is talking about.

We use Microsoft clustering with our SQL databases. If the active host fails the passive node will come alive and take ownership of the database on the symm (basically the .mdf and .ldf files). As the passive node starts SQL Server it will do some kind of transaction backout/cleanup that usually takes 30 seconds to three minutes depending how many transactions were in flight. I don't fully understand this process but it seems to work ok because the DB comes up on the second node. It's very likely that some of the last transactions may be lost.

It sounds like regular old srdf/a of the .ldf and .mdf files would give us the same kind of recovery that occurs on a local cluster node failure. Is this correct? That is SQL would come up in Texas pointing to the meta devices in Texas that make up the .ldf and .mdf files - some transaction recovery/backout would occur and hopefully we would have a good database in Texas.

If this is the case than it seems that SRDF could protect me as well as the current cluster environment protects me locally. It sounds like we are calling this "crash" consistent.

If I wanted to do better - it sounds like I need to use SRDF/AR to combine Time finder clones with SRDF - get a consistent mirror and then replicate it to Texas.

It seems like I can take it one step further by using TimeFinder/SQL Server Integration Module (TF/SIM). Are there additional costs for SRDF/AR and TF/SIM?

Thanks for the great info.

brad12341 · Answer

Execellent points. I think it comes down to this - Any disaster technique (SRDF, Log Shipping, SQL clustering etc.) can handle the SQL tran log (.ldf) and the data portion (.mdf) of the SQL database.

What these techniques don't have access to is the SQL buffer/memory on the active node. Upon failure of the active node everything in memory/buffers is lost.

This can cause some problems but it sounds like most of the time when SQL comes up on the failover node attached to the .mdf and .ldf it can usually do some tran log recovery that will allow the database to come up ok. Obviously it won't be exactly the same as before the failure because in flight transactions that a user thought were committed may end up getting rolled back.

That's ok as long as we have a working database at the DR site and we can resume business. Hopefully we won't see corruption when the DB comes up on the failover box. If corruption does occur you may have to go back to a snapshot or other backup.

xe2sdc · Answer

I hope that SQL server will keep track of "in flight" transactions logging them on disk before going on and servicing the transaction

..

I expect that any DB will log each transaction (update) even before servicing it, keeping tracks on the disk of what's going on ..

I think that the only transactions that will be "only in cache" are selects and similar "read only" operations.

Host Systems

Was this post helpful?