For anyone that hasn’t had a chance to review Aaron Patten’s EMC Proven Solutions Guide regarding VDI for a Windows 7 environment, it is truly a great piece of work. What I liked about it was that we now have a proven guide for VDI that minimizes idle hardware by leveraging the latest FAST suite from on the NS midrange array.
Since I used the guide extensively a few times and got extremely familiar with it I wanted to write a few things highlighting some of the great takeaways and key points. The guide gives a best practices approach to running a VMware View VDI environment for 500 users while meeting SLAs during boot storms, View refresh operations, View recompose operations, full antivirus scans, installation of security updates, and steady state user load. Technologies highlighted in the guide are electronic flash drives for replica storage, FASTCache (Mega Cache) to absorb bursts of activity, and FAST (performance driven storage tiering) to minimize backend disk needs by mixing SATA and FC for linked clone growth, and the NAS side of the NS for the storage of user profiles and data on CIFS.
My disclaimer here is that this write up is partly in theory. You can however absolutely use the PSG as it is written, this write up is however an abstraction and applies some theories to make design assumptions. This being said I tried to practice conservatism with facts that I did not have, or scenarios which are too complex to be relied on for predictability.
500 users in a NS120
The guide sets SLAs for 500 users during steady state IO and 95th percentile bursts (desktop refreshes, antivirus, etc). The guide does not provide guidance for a maximum. This can be found below along with the method at which I arrived at the estimates.
How about the maximums per scale point?
See below for an explanation, but here are my ball park figures. The first number is a minimum, but can be any point lower than the stated value. Keep in mind when selecting the array and scale point whether the array will be fully dedicated to VDI users or not as it will affect this number.
NS120 500 – 800
NS240 1000 – 1600
NS480 1200 – 2000
NS960 2000 – 3000
How can one use the guide strategically..
During a discovery meeting we presented the design guide, however the customer’s initial response was that the 500 user scale point was far too low; they wanted a guide for 10,000 users. This is all fine and dandy, but for how many companies does a design guide referencing 10,000 users actually make sense? Sure it proves the technology, but everything that’s been published in those numbers to this date is a based on a scale out model—simply marketing fluff. Basically a single design repeated over and over. I agree with that methodology, but feel that a design guide at 500 users is actually much more appealing for the general customer base looking for a rock solid scalable VDI solution. Using the guide strategically would mean you can take the recipe and replicate it inside an array until you hit the minimum scale point of the array or maximum based on your SLA needs at 95th percentile bursts.
Yes, the PSG uses it! I can’t highlight enough that CPU cycles are uber-critical for successful and fault tolerant VDI deployments. Why do I say that RAID10 is good here? It is important to note that raid penalty for RAID 5 (4 IOs per IO), RAID 6 (6 IOs per IO), and RAID 10 (2 IOs per IO) has a huge effect on backend disk performance and more importantly in this design, storage processor CPU utilization. Replicas are all reads, so forget about the penalty there. The penalty has to do with the FASTCache protected linked clone datastore growth, which is shown to be very busy mainly during desktop refreshes. Overall in this solution, RAID10 minimizes backend IO which minimizes CPU cycles. A cost benefit debate shows lost useable capacity from RAID10 but a huge benefit in reduced CPU cycles. At the higher scale points this means a lot. A 70:30 read/write ratio RAID 10 can be modeled to decrease CPU utilization by approximately 10%. If the ratio is 90:10 then the decrease is only around 2-4%. How many disk is a 10% savings in storage processor utilization worth on writes? So if RAID10 is great then what capacity saving technology will help to get back the lost capacity?
Ahha – Fully Automated Storage Tiering (FAST)
FAST allows you to take different types of disk and pool them together. FAST will then move data between tiers based on what data is the busiest. So use your most expensive disk for your most performance critical apps automagically. FAST is a scheduled routine that runs to analyze and perform data moves between tiers. How does this compare to other space saving technologies, ie. compression and deduplication? Those technologies are inline to operations and cause CPU cycles on a transaction by transaction basis. This goes back to the crux of the design.. minimize cpu cycles. FAST is different space savings technology in its ability to not cause unneeded operational overhead, just overhead during scheduled times to reprioritze data among tiers.
More specifically to the design guide, it allows for 5GB of linked clone growth for every 500 users purely on FC disk (with more space automagically expandable to SATA drives) in the FAST storage pool. If there is a desire to scale up to a higher user count or more space per user, then more disk is needed. The design includes SATA drives to accomplish just this.
While we are discussing linked clone growth, it is important to note that this area can take a beating. Due to its write heavy ratio it is definitely a good candidate for RAID10. An important note about FASTCache is that it has been seen to reduce contention in the backend disk by as much as 70%. This makes some of the RAID discussion mute since backend disk can handle inefficiencies in raid penalties. However, again the downside to RAID5/6 compared to RAID10 is still increased IO at the interface levels which cause CPU overhead.
Another important point is that the solution is planned for 100% concurrent users. If you have less concurrency, then using FAST is critical to expand the user base beyond the minimum scale point.
Where does the NAS side of the EMC NS fit into the solution?
In the PSG, the NAS is leveraged to host user profiles and data. It is important to note that without hosting this data via CIFS shares, there is no single file instance storage, and compression can only be applied at the block level with a critical penalty in CPU cycles as discussed above. An even bigger point is that if the data is not redirected to CIFS shares there is no efficient disaster recovery strategy for user profiles or user data unless an agent is installed per guest.
How do I a solution out based on the PSG?
First of all pick an array model that meets the SLA needs at a proper scale point. From there it simply a matter of taking the backend disk design and scaling it up internally within that array model. The main difference between the models where there is a choice is going to be around FASTCache. A linear scale up method based on the 500 user design is appropriate to ensure there is FASTCache to meet the performance of the PSG. Otherwise, scale the drives up in pods supporting 500 users as a time.
Once you have the solution scaled up and are appropriately minimizing idle capacity on the array, the design can be scaled out to your heart’s content. There is really no limit, since there don’t need to be dependencies outside of a “pod.” It is important to note that individual best practices for View deployments should be maintained especially when straying from the design guide’s 500 users.
One of the most critical pieces among for any VDI deployment..
Make sure to reference the Applied Best Practices document for deploying View on Windows 7. It has critical information about how to minimize the IOs produced per user. This goes a long way! Keep in mind that if you deploy a solution at 10 IOs per user, and the solution could be done with a little tweaking at 5 IOs per user it equates to a 100% increase in cost or a 50% reduction in the amount of users per array.
I go on below to discuss the method I used to ball park scaling up and out and minimums/maximums per array model..
Ok, so how can we use the guide as a model..
I went through the guide many different ways looking for the best method to scale it up in different array models. With a bit of research I found some guides that gave some ballpark estimates for linear scaling of processing power among EMC mid-range arrays. This was great, but what workload should be chosen and modeled across arrays? The peaks, 95th, averages? Yes there are stats in the PSG from each special circumstance (desktop refreshes, etc), but how appropriate is it going to be to scale based on that number? And how repeatable? The most basic approach seemed to be to look at the steady state workload (with pdf opens etc) because that’s the metric that we want the solution to perform optimally for at all times. The guide is meant to allow a VDI designer to choose a best case scenario user load which meets the expectations of SLAs per the PSG, and then tack on storage and leverage FAST to scale the model up to higher levels. Yes, this means start with a high cost of VDI per user, with the disclaimer that it can be scaled up with a steady state load meeting SLAs. As the amount of users increases, the cost per user becomes less and so does the ability to meet SLAs outlined during 95th percentile situations. This is up to the consumer though to decide on what is acceptable, and as well heavily reliant on optimizations in the guest OS to minimize unnecessary IOs.
Ok, we know what we’re modeling now let’s make it useful..
My thoughts were that the model should give two numbers. The low number being a minimum user count that matches the performance of the workload (steady state, logins, refreshes, etc) exactly as depicted in the guide, but at a different scale. Based on EMC performance gurus to scale from a NS120 to an NS960, in a conservative manner you can multiply by 3.333 (based on small block IO) for processing power differences. Five hundred users times 3.333 yields (rounding down to a easily number to remember) roughly 1500 users at a minimum scale point for the larger mid-range NS960. This is the basic logic, Excel was used to help with modeling which generated a higher minimum scale point of 2,000 user son the NS960. Doing this math allows to leverage the information about performance from the PSG and apply it to another array. Not a perfect science, but a conservative number for matching the guide with a different array model.
Now comes the tricky part..
Now that we have a minimum, let’s find the maximum. Something to keep in mind is that we want a solution that isn’t going to completely fall apart during special bursts of activity. We can’t run the array at 100% during steady state activity because any bursts would be infinite for response times (queue theory), this takes you down to somewhere around the 80% level. And best practices for an active-passive design is to allow for headroom in case all IO load is on one storage processor (some customers may differ here and opt for not planning for single SP scenarios), so this takes us down to around a 60% target (similar requirement for non-disruptive upgrade operations). You may be asking why 60%, shouldn’t it be below 50%? Some of the overhead on a SP is due to an active-passive configuration where CPU cycles are used because of the dual headed nature of the storage processors, ie replicating cache. With a single node online, this overhead requirement is reduced. As well there are CPU cycles for basic functionality in the SP that aren’t replicated running all load on one SP, ie. statistics gathering per object. Keep in mind, that with 60% as a target, if there is a single SP running, the array is in a degraded state and will have challenges CPU cycle needs for bursts. The maximum uses 60% as the target per SP, and would be approaching saturation in a single SP condition.
Ok, so we have a target percentage per SP..
The PSG shows that during steady state with 500 users each SP is pegged at 40%. However, the metric is most likely inflated since SP utilization numbers from NAR data aren’t always the most accurate (other methods available if accuracy is extremely critical). In comes another method of estimating utilization all based around a somewhat repeatable “model.” It is based on traffic being handled at different IO interfaces within the array and what this means for utilization as a whole. For example, if in practice a workload of 50,000 IO caused the array to go to 100% utilization, 40,000/50,000 would predict a 80% utilization in the array. The numbers calculation is a bit more complicated, taking into account read IO, write IO, read BW, write BW, RAID type, IO profile, snapshots, replication, and FASTCache.
So we have real life NAR based SP utilization shown in graphs in the PSG. And we have the IO profile that we can input into the model as described above to model storage processor utilization. The PDF goes on to reference that the delta between the two utilization numbers (Performance as a function of Utilization on a Clariion p15) is based how much coalescing is going on, cache hits, etc. This means we have a working model and NAR storage processor utilization (which is typically reported as higher than actual). However, it is skewed due to a lack of information relevant to metrics in the model. After entering the data into the model, as expected (due to factors not accounted for during steady state) it showed higher SP utilization than NAR did. What do you know, this actually was predicted (coalesced backend writes, multi-strip-element prefetch, short seek distance, under-utilized drives, large percentage of cache hits, write cache near or below lo-water—some of these reasons more likely than others for the steady state IO load with FASTCache to cause the delta). With all that being said, let’s simplify the model and essentially make the missing data a constant in our equation. We can do this by using a repeatable, predictable load—this being the steady state IO load. So we can accept that the model will only be applied to repeatable and predictable data and will eliminate our need to include other complicated information such as cache hit rate and other things listed above. We are then able to scale the model down linearly until the NS120 model met the NAR utilization for the NS120. Now we can start scaling up the IOs in the adjusted model until we reach our 60% target utilization on the NS960.
For example, at 40% utilization there is about 10% or so that isn’t directly tied to IO load. Meaning, to scale the solution by double, as an example and a rough estimate, take 10% overhead + 30% for 250 users + 30% for 250 users (per sp at NS120).
So what’s the outcome?
We can provide a conservative low number of users based on the PSG for any mid-range array model. And we can estimate a maximum best practices number based on a single head being able to handle steady state of VDI sessions. With these two numbers we get a ball park numbers for array scale points based on best practices and meeting performance SLAs in steady state with an option for customers to scale up users solely with an impact to SLAs during 95th percentile bursts.
In summary, if a designer wants the cheapest solution per seat then the solution can be planned based on the higher scale point in the appropriate array platform and users scaled down if there is a need to meet 95th percentile loads. Conversely, the solution can be planned conservatively to meet SLAs and then scaled up to meet an appropriate storage cost per VDI seat.
You know what they say about butterflies and hurricanes though..
I will be out office on vacation Friday 9/24/2010 on vacation, returning Monday 9/27/2010.
Should you need immediate assistance please contact the Help desk at 1-877-362-4843 or at extention 5370.
>>> clintonskitson <firstname.lastname@example.org> 09/27/10 10:46 >>>
"Summary and useful takeaways from the EMC Windows 7 FAST VDI Proven Solutions Guide"
To view the discussion, visit: https://community.emc.com/message/503111#503111