Rollback when using Solaris ZFS mirrors

Question

This may be of use to others who have Solaris ZFS mirrors active across two CLARiiONs, when rolling back.  The test created an Oracle database update, a Solaris application directory being moved and the Apache conf directory being deleted.  When we rolled back using a non-mirrored system, it always behaved as expected.  The problem comes when using host based ZFS mirrors.  We do not use MirrorView. What we noticed, was the following, when rolling back was : The snapshot session always had the correct information and we could manually copy back (Solaris directories and Files), to the correct place.  To do this we mounted the snapshot on another Solaris server and RCP'd between the two. The Oracle table change was always restored correctly. We had intermittent results with the moved directory. The Apache conf never comes back. We tried a number of tests and it became apparent that ZFS, on import, is altering the rollback.  The sequence we used to complete the rollback was : Remove NFS and LOFS connections. Detach the ZFS mirror not being EMC rolled back.  The one in machine room 2. Export the ZFS pool. Do a naviseccli rollback.  The one in machine room 1. Import the ZFS pool which automatically reattached the mirror. Make the NFS and LOFS connections. This gave us different results and was never consistent.  After much testing, we detected that the problem was started at the snapshot point.  When we 'offlined' the mirror, took the snapshot and then 'onlined' it again, we found the rollback became consistent.  Finally we 'onlined' the mirror at the end of the rollback.  On every test the data was rolled back correctly.  Without this we had various results.  We also added in a sync on the virtual and global servers before the sync and rollback, to flush the buffers. The online mirror, only copies across changes since the mirror was offlined.  An EMC rollback will bring the partition state to the point of the snapshot session, which had the mirror offline but not detached.  Detattaching and reattaching would, in our case, take four hours but the offline and online, wrapped around the snapshot, takes seconds.  If we had to rollback a few days of changes, then the online process may be longer and we'll test this further. Hope this helps someone.

SKT2 · Answer

Thanks for sharing this. You metnioned this is applicable for 'Host based ZFS mirrors'. Are there other options?

solvme · Answer

Hi there. Its difficult to say about other file system types because this is the only we have to test upon. We have moved over 100% to ZFS because it suits us. Other options, we ran the snap with a single ZFS plex, which gave us RAID5 resilience on the hardware with hot spares etc. We would lose the HA safety net should that room lose power, get destroyed etc. Using a non mirrored system the problem doesn't occur.

What we have noticed, since the initial blog, was that the offline process on a 400GB database, that has been rolled back, takes 30 minutes to become consistent. This is far better than a full detach and re-attach which takes 4 hours or so. We noticed that the initial ZFS projection was 10 hours to online but this quickly corrects its estimate and flys through. Offlining and onlining around the session snaps is seconds, so that is no problem.

The one thing we can guarantee is without doing this, you get very different results on rolling back. It may be something EMC wish to test in more detail and I have sent a fuller description to them. What worries me is that this could have left us with a corrupted system if we hadn't tested it. ZFS is very clever but this proves one step too far for it.

I suggest you try some simple benchmarked roll backs without offlining before the snap and you'll see the issue. We are on the latest Solaris 10 patch state, so its not an older issue raising its head.

Hope this helps.

CLARiiON

Was this post helpful?