We have a huge global customer from the logistics sector who is looking into offering more flexibility to app owners when doing patches and alike.
Right now they leverage VMAX with 1.2TB LUNs in their vSphere environment with dozens of VMs on top.
There is a project driving this: offering to their customers warehouses (real ones, remember logistics sector) with the portal as a service.
What is the challenge?
The end user might keep snaps for several weeks (up to 5) until they commit changes. The IT team is skeptical to leverage VMware snapshots for this, On VMAX to snap the whole LUN with a lot of VMs on top… not a good idea.
Think 1000’s of VMs, from 50 to 500GB size.
We received some really valuable input from EMC's vSpecialist and VMware Presales Minor group. We would like to share that here and give anybody else the chance to contribute to the discussion and / or get ideas for their challenges out there.
The following we discussed:
In general our assumption at this customer is: if you can imagine a worst case scenario, this is going to happen here!
Josh Atwell providing some detail why the customer has a GOOD reason NOT to use vSphere Snapshots:
It looks like they’re looking for guidance. They want to initiate a mandatory 2 week limit on VM snapshots for systems. I can provide a few reasons and arguments we can share with them.
In the end they should ask how long a system’s patch window is. Snapshots should not be longer than that + a day or two. I’d be happy to talk with you this further if you have any follow up questions. Also open to discussion if there are any points outlined above someone has a different perspective on.
From Richard Anderson:
Why not suggest Avamar for incremental-forever deduplicated backups of the VMs. Even can do an incremental restore using block change tracking for fast recovery to a previous backup.
Just run a new incremental backup before a change and you are all set, no snapshots to worry about, no performance problems when it's time to cleanup snaps, or disk space consumed on the datastore, plus it's an actual backup and can be replicated offsite for DR recovery.
From Josh Atwell:
I would support this concept as well. The key to this is having them dictate the culture of the service they provide. I have yet to see a reasonable use case for long term snapshots of any kind.
Even in that scenario it was less than 5 weeks.
I'd suggest simply have the team push back and have all requests for longer snapshots defend a business case for allowing that feature. (risk mitigation, long term roll back, etc.) If a business case is justified and approved by upper management (based on legitimate analysis of impact vs gain) then this could be implemented in an isolated environment as part of their service. Dedicating specific datastores/luns where long term snapshots are allowed. These would be thick provisioned to minimize performance impact during snapshot maintenance tasks, etc.
Just because you can, doesn't mean you should. I ran into this a lot while at Cisco working behind their portal (CITEIS). Here's how I approached feature requests.
In the end if the behavior might impact SLAs of other tenants those customers were not given the full experience. They would get isolated to a non-standard offering which extended deployment cycles and costs. Limited portal capabilities. That would typically force the app owners to think more critically about whether they really need an offering such as long term snapshots or not. 98/100 they decide they can live within the stricter constraints and you never hear from them about it again.
And from Rich Barlow:
Use RecoverPoint for long term snaps. At the hypervisor level I'm in total agreement that snaps beyond a couple of days is asking for disaster because the technology is so different. I think we should make sure that we don't conflate these two technologies.
Josh Atwell's response:
Fair enough. I speak solely from a portal/service standpoint. The task is then to identify what the use case is really truly asking for, and if we anticipate that array snapshots and/or recover point can meet that objective. Then it becomes a separate infrastructure related option that they can offer as separate "service", or ideally on an isolated tier w/ that capability. Standard service offering would be virtual layer option that can be controlled in the portal that puts a limit on virtual layer snapshots at no additional cost since it uses native architecture and toolsets. This will then limit the impact of needing/wanting to alter LUN sizes and isolate impact.
I also dealt with some long term snaps but in the end we found that we reclaimed data from them so infrequently that it wasn't worth making a service around. Naturally I found this out after spending a week or two working on figuring out how to specifically enable users to do independently. 🙂 We still kept the snap policy but requests for mounting and using the snapshots came with a best effort SLA.
My .02 cent
>offering more flexibility to app owners when doing patches and alike.
First thing to do is to define clearly the offering here. And then use the best technology to address the offering.
Here is an idea of the offering. Thinking out loud here so I'm positively missing things
and don't forget that the journey is more important than the destination
1- the offering should allow a point in time (PIT) recovery state
2- the offering should allow rollback to a previous point in time (PIT) recovery state
1- PIT recovery should be easy and quick to enable (manageability)
2- PIT recovery should should have the minimum impact on VM performance
3- In case of consolidation process, it should have the minimum impact on VM performance as well
4- PIT recovery, rollback and eventually consolidate functions should take a minimum of time to process
Let create some abbreviations:
1- TTE = Time To Enable
2- TTR = Time To Rollback
3- TTC = Time To Consolidate
Here is a table to summarise my point
|VMW Snapshot||VMW aware Backup|
|TTE||Snapshot is fast|
Backup takes longer
than a snapshot
Restore takes longer
than roll back of a snapshot
Variable. It depends on
the size of the delta which usually is high since it captured the many changes.
There is no consolidation per se here
and thus TTC=0. Note: snapshots are being used and consolidated during backup process. Though the goal is not the same.
PIT recovery should be
easy and quick to enable
PIT recovery should should have
the minimum impact on VM performance
Snapshot hinders VM performance
as soon it is turned on
Once the backup is done, there
is no impact on the VM performance
In case of consolidation process, it should
have the minimum impact on VM performance
Consolidation of snapshot puts a great
stress on the storage and thus hinders VM performance
|There is no consolidation here|
You can of course add non-functional qualities, such disk space requirement, security and cost to further add granularity to this table.
Feedback are welcome of course