Having worked with numerous DR scenarios on different OSes I can offer that I feel SRDF/A or SRDF/S (if distance allows) is the simplest and most efficient DR process because it minimizes the "moving parts" involved with the process.
Since it is "crash consistent" as has been pointed out the biggest risk revolves around two main points for recovery timing:
1) How busy your application is at the "crash" point. The more pending updates the more work to rollback the changes and start the database.
2) How your application is written. If the application is poorly written it may have a lot of data "pending" and not "committed" which could result in long recovery times. Have seen this on numerous database types (Oracle, SQL, Sybase, etc.) so this is a common issue.
SRDF/AR has its place, but the biggest issues I have with seeing that recommended over SRDF/A are:
1) You have more to manage (scripts, software tools, etc.) which are more likely to break down in a real disaster solution. I have seen numerous multi-step DR processes fail due to lack of maintenance or lack of understanding on how to implement them. Also recovery tends to require more steps than crash recovery.
2) You will increase your RPO (i.e. data-loss) because you will automatically be talking about minutes if not hours depending on your process.
The only reasons I could find to argue for SRDF/AR over SRDF/A are:
1) Your database truly may not recover from a crash consistent copy (if so why in the world are you using that application?)
2) You have a batch intensive application which only does updates at a certain period. It may be more efficient to take a consistent image after the update process and ship via SRDF/AR.
3) You want to use the image for something other than DR recovery. In that case having a "database consistent" copy may be necessary. Some examples for this may be making a copy for your development environment which may require database consistency to do things such as rename your database, apply additional transaction logs, etc.
Big point (in my mind at least) to remember is that a DR copy is NOT the same as a backup copy. You are generally taking these to recover from a major disaster which means the simpler the process the more likely you are to succeed in recovering your applications.
You can find a nice paper on this topic .. It's about implementing a Disaster Restart solution for SAP/R3 using MS/SQL and SRDF. Even if not exactly new (about 2 years ago) it covers SRDF/A and SQL setup. It's called "White Paper: Using SRDF for Disaster Restart of SAP R/3 Enterprise Systems - Implementation Guide for Windows and SQL Server" and you can find it in Powerlink.
Having that said, you don't need any TF trick since SRDF/A will guarantee that all the volumes in the same RDFG working in ASYNC mode will be "consistent", meaning that every 30 seconds R1 devices will "capture" writes and transmit them to the R2 side. On the other side of the RDF link R2 devices will apply changes received from R1 side as soon as they receive all the changes. The host that see R2 devices can see an image somewhere from 30 to 60 seconds in the past of R1 devices.
But won't that only be write order consistency, which to SQL is a database that it will have to perform crash recovery on? To guarantee SQL recovery of the data you would need a process to create SQL consistent points in time in the replicated data wouldn't you? That would require a call to the SQL VDI to create the SQL consistent data.
Thanks for clarifying that Stefano, that was what I figured: crash consistent is all that is created.
And yes it is the same type of consistency as a host crash will give you. But will the database recover from it? It depends on the database. Sometimes it requires a dba to roll back transactions until they find a point in the database that they can recover from. I have seen with really large SQL databases (greater than 1TB in size, which is large for SQL) that crash recovery does not always work and requires rolling back to get the database online.
And in Microsoft white papers on Disaster Recovery they do not guarantee crash consistent databases for DR and recommend a method for creating SQL consistent snapshots for guaranteed recovery.
We ran into this issue after implementing RecoverPoint and found that the write order consistency of the sql data being replicated was not always sufficient to perform crash recovery on. So we use a combination of Consistency Groups to provide to provide asynchronous crash recovery snapshots and scheduled tasks to create SQL consistent snapshots every four hours.
And the source production cluster has always been able to recover from failovers and crashes, but the replicated data in a crash consistent format has not seemed as resilient at crash recoveries. Not sure why, that is why we have the SQL snapshots every four hours: for a guaranteed RPO.
AranH you need to use VDI when you need to create a snapshot of your data using Timefinder (snap/clone/mirror). You need VDI (or VSS) since you need to have a point in time where your data is consistent, becouse creating such snapshot may involve different volumes that may be cloned/snapped at different times.
But what happens if you can't use VDI or VSS ?? What can you do ?? You can use ECA (-consistent option) and trust the storage.
Think to SRDF/A like a consistent snap every 30 seconds (capture) and later you move the snap elsewhere (on the R2 devices).
SRDF/A will guarantees that the image of your data on R2 devices is a "consistent" image, a snapshot taken 30 to 60 seconds ago.
And yes it is the same type of consistency as a host crash will give you. But will the database recover from it?
Will your database restart if your main host crashes and your cluster switches the application on the other host ?? I hope it will recover. If it won't recover, ask M$ to polish their filesystem and their database
Let me take your words and bring them further .. What happens when your host crashes ?? VDI will grant you a "restartable" image on the disks ??
SRDF/A will give you the same image of your data that a crash would give you .. Do you think that your database will recover from an host crash ?? I hope it will
If you trust EMC (and SAP) you can use SRDF/A and have recent data at your R2 site.. If you want "consistent snapshots" you have SRDF/AR that will combine TimeFinder and SRDF to move your data at disaster recovery site. But now your data will be less recent since SRDF/AR will work in cycles and every cycle takes some time. You have to establish BCV volumes at primary site, split them in a consistent way and then you can move data from R1-BCV to R2 devices. SRDF/AR will take care of all the steps .. But they will take time and R2 image will be older.
You have to choose .. If you can't loose any data, use SRDF/S. If you can tolerate to loose data, use SRDF/A if you want to loose only a few seconds of data. If you can tolerate to loose minutes or hours of data, go SRDF/AR
Now you can choose.. But don't ask for discounts !!
AFAIK RP is not "exactly" like SRDF/A .. We are talking of two different products, two different architectures. I don't think we can apply to SRDF/A what we discover about RP
Of course, but replication of a database will encounter the same challenges with ensuring data consistency regardless of the transport.
I would think that whether we are talking about SRDF/A, MirrorView/A, or RecoverPoint CRR, each transport protocol needs to ensure that there is write order consistency at the target site. And write order consistency is still only crash consistent whether it is SRDF/A or RecoverPoint CRR. Each of the technologies mentioned would still need an additional mechanism to provide SQL consitency.
Not trying to pick on you about this Stefano, I am enjoying the conversation. It is just that I have been presented with similar challenges as the poster and am trying to understand it deeper to come up with other options for my infrastructure.
As always I appreciate all the information you share so frequently on this forum
Excellent info! As an ex-DBA I can relate to some of the issues Aran is talking about.
We use Microsoft clustering with our SQL databases. If the active host fails the passive node will come alive and take ownership of the database on the symm (basically the .mdf and .ldf files). As the passive node starts SQL Server it will do some kind of transaction backout/cleanup that usually takes 30 seconds to three minutes depending how many transactions were in flight. I don't fully understand this process but it seems to work ok because the DB comes up on the second node. It's very likely that some of the last transactions may be lost.
It sounds like regular old srdf/a of the .ldf and .mdf files would give us the same kind of recovery that occurs on a local cluster node failure. Is this correct? That is SQL would come up in Texas pointing to the meta devices in Texas that make up the .ldf and .mdf files - some transaction recovery/backout would occur and hopefully we would have a good database in Texas.
If this is the case than it seems that SRDF could protect me as well as the current cluster environment protects me locally. It sounds like we are calling this "crash" consistent.
If I wanted to do better - it sounds like I need to use SRDF/AR to combine Time finder clones with SRDF - get a consistent mirror and then replicate it to Texas.
It seems like I can take it one step further by using TimeFinder/SQL Server Integration Module (TF/SIM). Are there additional costs for SRDF/AR and TF/SIM?
Execellent points. I think it comes down to this - Any disaster technique (SRDF, Log Shipping, SQL clustering etc.) can handle the SQL tran log (.ldf) and the data portion (.mdf) of the SQL database.
What these techniques don't have access to is the SQL buffer/memory on the active node. Upon failure of the active node everything in memory/buffers is lost.
This can cause some problems but it sounds like most of the time when SQL comes up on the failover node attached to the .mdf and .ldf it can usually do some tran log recovery that will allow the database to come up ok. Obviously it won't be exactly the same as before the failure because in flight transactions that a user thought were committed may end up getting rolled back.
That's ok as long as we have a working database at the DR site and we can resume business. Hopefully we won't see corruption when the DB comes up on the failover box. If corruption does occur you may have to go back to a snapshot or other backup.
bodnarg
2 Intern
•
385 Posts
0
March 31st, 2008 06:00
Since it is "crash consistent" as has been pointed out the biggest risk revolves around two main points for recovery timing:
1) How busy your application is at the "crash" point. The more pending updates the more work to rollback the changes and start the database.
2) How your application is written. If the application is poorly written it may have a lot of data "pending" and not "committed" which could result in long recovery times. Have seen this on numerous database types (Oracle, SQL, Sybase, etc.) so this is a common issue.
SRDF/AR has its place, but the biggest issues I have with seeing that recommended over SRDF/A are:
1) You have more to manage (scripts, software tools, etc.) which are more likely to break down in a real disaster solution. I have seen numerous multi-step DR processes fail due to lack of maintenance or lack of understanding on how to implement them. Also recovery tends to require more steps than crash recovery.
2) You will increase your RPO (i.e. data-loss) because you will automatically be talking about minutes if not hours depending on your process.
The only reasons I could find to argue for SRDF/AR over SRDF/A are:
1) Your database truly may not recover from a crash consistent copy (if so why in the world are you using that application?)
2) You have a batch intensive application which only does updates at a certain period. It may be more efficient to take a consistent image after the update process and ship via SRDF/AR.
3) You want to use the image for something other than DR recovery. In that case having a "database consistent" copy may be necessary. Some examples for this may be making a copy for your development environment which may require database consistency to do things such as rename your database, apply additional transaction logs, etc.
Big point (in my mind at least) to remember is that a DR copy is NOT the same as a backup copy. You are generally taking these to recover from a major disaster which means the simpler the process the more likely you are to succeed in recovering your applications.
xe2sdc
4 Operator
•
2.8K Posts
0
March 27th, 2008 17:00
Having that said, you don't need any TF trick since SRDF/A will guarantee that all the volumes in the same RDFG working in ASYNC mode will be "consistent", meaning that every 30 seconds R1 devices will "capture" writes and transmit them to the R2 side. On the other side of the RDF link R2 devices will apply changes received from R1 side as soon as they receive all the changes. The host that see R2 devices can see an image somewhere from 30 to 60 seconds in the past of R1 devices.
AranH1
2.2K Posts
1
March 28th, 2008 10:00
AranH1
2.2K Posts
0
March 28th, 2008 11:00
And yes it is the same type of consistency as a host crash will give you. But will the database recover from it? It depends on the database. Sometimes it requires a dba to roll back transactions until they find a point in the database that they can recover from. I have seen with really large SQL databases (greater than 1TB in size, which is large for SQL) that crash recovery does not always work and requires rolling back to get the database online.
And in Microsoft white papers on Disaster Recovery they do not guarantee crash consistent databases for DR and recommend a method for creating SQL consistent snapshots for guaranteed recovery.
AranH1
2.2K Posts
0
March 28th, 2008 11:00
And the source production cluster has always been able to recover from failovers and crashes, but the replicated data in a crash consistent format has not seemed as resilient at crash recoveries. Not sure why, that is why we have the SQL snapshots every four hours: for a guaranteed RPO.
xe2sdc
4 Operator
•
2.8K Posts
0
March 28th, 2008 11:00
But what happens if you can't use VDI or VSS ?? What can you do ?? You can use ECA (-consistent option) and trust the storage.
Think to SRDF/A like a consistent snap every 30 seconds (capture) and later you move the snap elsewhere (on the R2 devices).
SRDF/A will guarantees that the image of your data on R2 devices is a "consistent" image, a snapshot taken 30 to 60 seconds ago.
xe2sdc
4 Operator
•
2.8K Posts
0
March 28th, 2008 11:00
crash will give you. But will the database recover
from it?
Will your database restart if your main host crashes and your cluster switches the application on the other host ?? I hope it will recover. If it won't recover, ask M$ to polish their filesystem and their database
xe2sdc
4 Operator
•
2.8K Posts
0
March 28th, 2008 11:00
SRDF/A will give you the same image of your data that a crash would give you
xe2sdc
4 Operator
•
2.8K Posts
0
March 28th, 2008 11:00
If you trust EMC (and SAP) you can use SRDF/A and have recent data at your R2 site.. If you want "consistent snapshots" you have SRDF/AR that will combine TimeFinder and SRDF to move your data at disaster recovery site. But now your data will be less recent since SRDF/AR will work in cycles and every cycle takes some time. You have to establish BCV volumes at primary site, split them in a consistent way and then you can move data from R1-BCV to R2 devices. SRDF/AR will take care of all the steps .. But they will take time and R2 image will be older.
You have to choose .. If you can't loose any data, use SRDF/S. If you can tolerate to loose data, use SRDF/A if you want to loose only a few seconds of data. If you can tolerate to loose minutes or hours of data, go SRDF/AR
Now you can choose.. But don't ask for discounts !!
xe2sdc
4 Operator
•
2.8K Posts
0
March 28th, 2008 13:00
AranH1
2.2K Posts
0
March 28th, 2008 13:00
I would think that whether we are talking about SRDF/A, MirrorView/A, or RecoverPoint CRR, each transport protocol needs to ensure that there is write order consistency at the target site. And write order consistency is still only crash consistent whether it is SRDF/A or RecoverPoint CRR. Each of the technologies mentioned would still need an additional mechanism to provide SQL consitency.
AranH1
2.2K Posts
0
March 28th, 2008 13:00
As always I appreciate all the information you share so frequently on this forum
brad12341
207 Posts
0
March 28th, 2008 14:00
Excellent info! As an ex-DBA I can relate to some of the issues Aran is talking about.
We use Microsoft clustering with our SQL databases. If the active host fails the passive node will come alive and take ownership of the database on the symm (basically the .mdf and .ldf files). As the passive node starts SQL Server it will do some kind of transaction backout/cleanup that usually takes 30 seconds to three minutes depending how many transactions were in flight. I don't fully understand this process but it seems to work ok because the DB comes up on the second node. It's very likely that some of the last transactions may be lost.
It sounds like regular old srdf/a of the .ldf and .mdf files would give us the same kind of recovery that occurs on a local cluster node failure. Is this correct? That is SQL would come up in Texas pointing to the meta devices in Texas that make up the .ldf and .mdf files - some transaction recovery/backout would occur and hopefully we would have a good database in Texas.
If this is the case than it seems that SRDF could protect me as well as the current cluster environment protects me locally. It sounds like we are calling this "crash" consistent.
If I wanted to do better - it sounds like I need to use SRDF/AR to combine Time finder clones with SRDF - get a consistent mirror and then replicate it to Texas.
It seems like I can take it one step further by using TimeFinder/SQL Server Integration Module (TF/SIM). Are there additional costs for SRDF/AR and TF/SIM?
Thanks for the great info.
brad12341
207 Posts
0
March 31st, 2008 09:00
What these techniques don't have access to is the SQL buffer/memory on the active node. Upon failure of the active node everything in memory/buffers is lost.
This can cause some problems but it sounds like most of the time when SQL comes up on the failover node attached to the .mdf and .ldf it can usually do some tran log recovery that will allow the database to come up ok. Obviously it won't be exactly the same as before the failure because in flight transactions that a user thought were committed may end up getting rolled back.
That's ok as long as we have a working database at the DR site and we can resume business. Hopefully we won't see corruption when the DB comes up on the failover box. If corruption does occur you may have to go back to a snapshot or other backup.
xe2sdc
4 Operator
•
2.8K Posts
0
April 1st, 2008 06:00
I expect that any DB will log each transaction (update) even before servicing it, keeping tracks on the disk of what's going on ..
I think that the only transactions that will be "only in cache" are selects and similar "read only" operations.