hanacd
2 Bronze

Experience / Best Practices with long time VM snaps for huge VMFS stores?

We have a huge global customer from the logistics sector who is looking into offering more flexibility to app owners when doing patches and alike.

Right now they leverage VMAX with 1.2TB LUNs in their vSphere environment with dozens of VMs on top.

There is a project driving this: offering to their customers warehouses (real ones, remember logistics sector) with the portal as a service.

What is the challenge?

The end user might keep snaps for several weeks (up to 5) until they commit changes. The IT team is skeptical to leverage VMware snapshots for this, On VMAX to snap the whole LUN with a lot of VMs on top… not a good idea.

Think 1000’s of VMs, from 50 to 500GB size.

We received some really valuable input from EMC's vSpecialist and VMware Presales Minor group. We would like to share that here and give anybody else the chance to contribute to the discussion and / or get ideas for their challenges out there.

The following we discussed:

  • NFS with Isilon
    • + granularity
    • – workload and performance so far not recommended for a 100% VMware environment because of random IO
  • XtremIO
    • + performance
    • – unpredictable deduplication ration as customer WILL put anything on there – and as you see there is a lot of DATA going to hit the storage

In general our assumption at this customer is: if you can imagine a worst case scenario, this is going to happen here!

0 Kudos
6 Replies
hanacd
2 Bronze

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

Josh Atwell providing some detail why the customer has a GOOD reason NOT to use vSphere Snapshots:

It looks like they’re looking for guidance.  They want to initiate a mandatory 2 week limit on VM snapshots for systems.  I can provide a few reasons and arguments we can share with them.

  1. They should implement a 1-2 week maximum on snapshots. 
  2. Snapshots are not backup.  Failure to commit or rollback snapshots is either laziness or an oversight.
  3. Snapshots kill performance over time.  Period.  Performance degradation leads to outages/support calls which leads to egg on face and face-palms.
  4. Snapshots are usually used to handle application upgrade rollbacks.  This usually means there is a predefined change window.  In 10+ years of operations I never saw a change window last more than a weekend.  There is almost always a go or no-go point in the process within a much smaller window. 
  5. If an issue arises post change that was not foreseen (bug for instance) it will usually show its head within a few hours of regular usage of the system(s).
  6. Snapshots hinder the operation team’s ability to perform some tasks that may be critical to their change windows. 
  7. Storage outages/issues may potentially create data consistency issues for VMs with snapshots.  I saw this once at Cisco but we were never able to fully root cause why it only happened to VMs with snapshots.  This can be mentioned but I personally do not have documentation to back up the situation.  In the end we were not able to remove the snapshot and we lost delta data.
  8. In order to enforce the maximum there should be automation that
    1. Reports snapshots approaching the limit
    2. Removes the snapshot when the 1-2 week limit is reached.
    3. This should never be a manual process (existing scripts are out there or other orchestration options)
    4. Automation process will have a check file for approved exceptions.  These exceptions should require director level approval which includes a breakdown of potential impacts if they circumvent this.  If you make people spend time defending their choice to do it and outlining the risks they are less likely to even ask and realize they don’t really need.  This also allows IT to track these requests and respond accordingly if it is determined that a longer time is actually needed and deemed acceptable by key stakeholders.

In the end they should ask how long a system’s patch window is.  Snapshots should not be longer than that + a day or two. I’d be happy to talk with you this further if you have any follow up questions.  Also open to discussion if there are any points outlined above someone has a different perspective on.

0 Kudos
hanacd
2 Bronze

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

From Richard Anderson:

Why not suggest Avamar for incremental-forever deduplicated backups of the VMs. Even can do an incremental restore using block change tracking for fast recovery to a previous backup.

Just run a new incremental backup before a change and you are all set, no snapshots to worry about, no performance problems when it's time to cleanup snaps, or disk space consumed on the datastore, plus it's an actual backup and can be replicated offsite for DR recovery.

0 Kudos
hanacd
2 Bronze

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

From Josh Atwell:

I would support this concept as well.  The key to this is having them dictate the culture of the service they provide.  I have yet to see a reasonable use case for long term snapshots of any kind.

Even in that scenario it was less than 5 weeks. 

I'd suggest simply have the team push back and have all requests for longer snapshots defend a business case for allowing that feature. (risk mitigation, long term roll back, etc.)  If a business case is justified and approved by upper management (based on legitimate analysis of impact vs gain) then this could be implemented in an isolated environment as part of their service.  Dedicating specific datastores/luns where long term snapshots are allowed.  These would be thick provisioned to minimize performance impact during snapshot maintenance tasks, etc.

Just because you can, doesn't mean you should.  I ran into this a lot while at Cisco working behind their portal (CITEIS).  Here's how I approached feature requests.

  1. Identify the business objectives and benefits of the portal/feature (need business justification for dangerous activities in portal)
  2. Identify design that is operationally sustainable that meets those objectives.  What impact to non-standard capabilities have on ability to recover from failure, impact of environment, impact to neighbors, etc.
  3. Implement automation to maintain design and prevent people from "going rogue" or trying to go around the portal process.

In the end if the behavior might impact SLAs of other tenants those customers were not given the full experience.  They would get isolated to a non-standard offering which extended deployment cycles and costs.  Limited portal capabilities.  That would typically force the app owners to think more critically about whether they really need an offering such as long term snapshots or not.  98/100 they decide they can live within the stricter constraints and you never hear from them about it again.

0 Kudos
hanacd
2 Bronze

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

And from Rich Barlow:

Use RecoverPoint for long term snaps.  At the hypervisor level I'm in total agreement that snaps beyond a couple of days is asking for disaster because the technology is so different. I think we should make sure that we don't conflate these two technologies.

0 Kudos
hanacd
2 Bronze

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

Josh Atwell's response:

Fair enough.  I speak solely from a portal/service standpoint.  The task is then to identify what the use case is really truly asking for, and if we anticipate that array snapshots and/or recover point can meet that objective.  Then it becomes a separate infrastructure related option that they can offer as separate "service", or ideally on an isolated tier w/ that capability.  Standard service offering would be virtual layer option that can be controlled in the portal that puts a limit on virtual layer snapshots at no additional cost since it uses native architecture and toolsets.  This will then limit the impact of needing/wanting to alter LUN sizes and isolate impact.

I also dealt with some long term snaps but in the end we found that we reclaimed data from them so infrequently that it wasn't worth making a service around.  Naturally I found this out after spending a week or two working on figuring out how to specifically enable users to do independently. 🙂  We still kept the snap policy but requests for mounting and using the snapshots came with a best effort SLA.

0 Kudos
Highlighted
dpironet
1 Copper

Re: Experience / Best Practices with long time VM snaps for huge VMFS stores?

My .02 cent

>offering more flexibility to app owners when doing patches and alike.

First thing to do is to define clearly the offering here. And then use the best technology to address the offering.

Here is an idea of the offering. Thinking out loud here so I'm positively missing things

and don't forget that the journey is more important than the destination

Functionals:

1- the offering should allow a point in time (PIT) recovery state

2- the offering should allow rollback to a previous point in time (PIT) recovery state

Non-Functionals:

1- PIT recovery should be easy and quick to enable (manageability)

2- PIT recovery should should have the minimum impact on VM performance

3- In case of consolidation process, it should have the minimum impact on VM performance as well

4- PIT recovery, rollback and eventually consolidate functions should take a minimum of time to process

Let create some abbreviations:

1- TTE = Time To Enable

2- TTR = Time To Rollback

3- TTC = Time To Consolidate

Here is a table to summarise my point

VMW SnapshotVMW aware Backup
TTE Snapshot is fast

Backup takes longer

than a snapshot

TTR

Restore takes longer

than roll back of a snapshot

TTC

Variable. It depends on

the size of the delta which usually is high since it captured the many changes.

There is no consolidation per se here

and thus TTC=0. Note: snapshots are being used and consolidated during backup process. Though the goal is not the same.

PIT recovery should be

easy and quick to enable

PIT recovery should should have

the minimum impact on VM performance

Snapshot hinders VM performance

as soon it is turned on

Once the backup is done, there

is no impact on the VM performance

In case of consolidation process, it should

have the minimum impact on VM performance

Consolidation of snapshot puts a great

stress on the storage and thus hinders VM performance

There is no consolidation here

You can of course add non-functional qualities, such disk space requirement, security and cost to further add granularity to this table.

Feedback are welcome of course

Rgds,

Didier