I wrote the following blog a few months ago and wanted to include it in this community in hopes to generate more discussion on the topic. I'm interested in different opinions and also any descriptions of implementation success or failures for different EMC products.
Best practice for VMware LUN size
September 21, 2012
I was asked this question today. Its one of my favorite questions to answer but I’ve never wrote it down. Today I did so here it is. Let me know if you agree or if you have other thoughts.
For a long time VMware’s max LUN size was 2TB. This restriction was not a big issue to many but some wanted larger LUN sizes because of an application requirement. In these cases it was typically one or only a few VM’s accessing the large datastore/LUN. vSphere 5 took the LUN size limit from 2 TB to 64TB. Quite a dramatic improvement and hopefully big enough to satisfy those applications.
For general purpose VMs, prior to vSphere 4.1, the best practice was to keep LUN sizes smaller than 2TB (i.e. even though ESX supports 2TB LUNs, don’t make them that big). 500GB was often recommended. 1TB was OK too. But it really depended on a few factors. In general, the larger the LUN the more VM’s it can support. The reason for keeping the LUN sizes small in the past was to limit the number of VM’s per datastore/LUN. The implication of putting too many VM’s on a datastore/LUN is that performance would suffer. First reason is that vSphere’s native multipathing only leverages one path at a time per datastore/LUN. So if you have multiple datastores/LUN’s then you can leverage multiple paths at the same time. Or, you could go with EMC’s PowerPath/VE to better load balance the IO workload. Second reason is with block storage for vSphere 4.0 and earlier there was a hardware locking issue. This meant that if a VM was powered on, off, suspended, cloned,… then the entire datastore/LUN was locked until the operation was complete thus freezing out the other VM’s utilizing that same datastore/LUN. This was resolved in vSphere 4.1 with VAAI Hardware Offload Locking assuming the underlying storage array supported the API’s. But before VAAI, keeping the LUN sizes small helped administrators limit the number of VM’s on a single datastore/LUN thus reducing the effects of the locking and pathing issues.
OK, that was the history, now for the future. The general direction for VMware is to go with larger and larger pools of compute, network, and storage. Makes the whole cloud thing simpler. Thus the increase of support from 2TB to 64TB LUN’s. I wouldn’t recommend going out and creating 64TB LUN’s all the time. Because of VAAI the locking issue goes away. The pathing issue is still there with native multipathing but if you go with EMC’s PowerPath/VE then that goes away. So then it comes down to how big the customer wants to make their failure domains. The thinking is that the smaller the LUN the less VM’s placed on it thus the less impact if a datastore/LUN were to go away. Of course we go through great lengths to prevent that with five 9’s arrays and redundant storage networks, etc. So, the guidance I’ve been seeing lately is 2TB datastores/LUNs is a good happy medium of not too big and not too small for general purpose VM’s. If the customer has specific requirements to go bigger then that’s fine, it’s supported.
So, in the end, it depends!!!
Oh, and the storage array behavior does have an impact on the decision. In the case of an EMC VNX, assuming a FAST VP pool then the blocks will be distributed across various tiers of drives. If more drives are added to the pool then the VNX will rebalance the blocks to take advantage of all the drives. So whether it’s a 500GB LUN or 50TB LUN, the VNX will balance the overall performance of the pool. Lots of good info here about the latest Inyo release for VNX:
This is a great summary.
One thing not mentioned that I think is crucially important is a discussion of queues.
Lets say everything you are doing fits into a single 2TB LUN. Great for you. However, I'd strongly suggest you DON'T build just a single LUN (ever!). Why? Queues.
For every LUN being accessed, vSphere has (effectivly) a single queue, and so there's basically 1 command at a time that can be run. Most backend arrays have a similar scenario. So, with a 2TB device, all your (maybe 30? 50?) VMs are having to wait in the same line.
If you break that up into 3-4 LUNs, suddenly each line is 4x shorter, and every one gets serviced faster. In this case, 500GB / LUN would be a better choice.
This is important on both mid-tier arrays and higher end Symmetrix style systems.
So while I think Peter's guidance of 2TB is overall good, I would say that you should consider a minimum of 4 LUNs, regardless of the amount of data being stored. Second, talk to your local expert on your storage array. Certain arrays have certain best practices. For example, on a Symmetrix, we'd want to see the number of LUNs be 2-10x the number of FA ports servicing the workload.
Recently after attending a VMware webcast where we discussed this topic I made the decision to increase our standard LUN sizes. We went from 600 GB to 1 TB LUN's. Looking back I regret that decision as performance has noticeably decreased and I mean decreased to the point that one can experience the difference when using one of the VM's. When I looked closer I've found that our write latencies have increased as well as the command queues after having converted everything over.
Something hinted at above yet not discussed is the work required to balance out the workloads happening on these large datastores. If one has or can afford the Enterprise Plus licensing then storage DRS is available, for those of us that don't have it then it means a lot work to try and balance out workload IO on a datastore with 20+ VM's.
Another option that I've been using on our account is to create RDM drives so that a VM get's it's own message queue. This has proven too work quite well and it also has a few other management benefits. Namely, it allows for more consistent VM sizes in our main datastores which makes moving VM's around when re-sizing or adding a new drive much easier since we don't have to move a bunch of VM's off a datastore to grow a single VM by 20 GB. The one downside to this becomes apparent if you have Avamar and try to implement VADP with VMware Imaging which will not work on a Physical RDM due to the lack of snapshot capability.
Another thing not mentioned above (yet I agree that it is a great discussion started and post) is that even though we can create 64 TB datastores the maximum VMDK size remains at 2 TB on VMFS. This is rumored to change in vSphere 5.5 but for now that is something to keep in the back of your minds. I've had many customers upgrade expecting to be able to get rid of alot of RDM's yet they ran into trouble because of this limitation.
As for lastpick's issue, I'd be curious to hear what the back end storage is. A VNX licensed with FAST-VP or with a little FASTCache would go a long way to mitigate the latency and queueing. If those features or hardware are not available in the array then those are things you need to consider up front before making decisions to increase your VM density on your datastores.
This is a great and ever-evolving discussion so keep it going!
Just adding to the discussion based on a customer question yesterday... they expressed concern about an issue with Heap Size for large VMFS volumes. Here's the article with a good description of the problem and resolutions.
Monster VMs & ESX(i) Heap Size: Trouble In Storage Paradise
In summary, this is an issue with but VMware released a patch to increase the max heap size and has baked the solution into vSphere 5.1 Update 1. The only concern is that hosts that have been upgraded (Vs. fresh installs) will need the manually reconfigure the heap size.
Just adding to the discussion... here's an article that essentially states that you should avoid extents unless you have a specific performance requirement. If you need to grow/expand the LUN then you can use the VMware volume grow facility. If you need the performance then you may be able to tune the queue depth rather than aggregate the queue depth for multiple extents.
Since we mentioned queue depth in the comments,... I came across another vSpecialist email discussion asking about enabling the Adaptive Queue Depth setting. The response were:
ADQ is actually off be default, unless you enable it explicitly by setting values for the 2 variables (QFullThreshold, QFullSampleSize).
Our recommendation is to leave it off unless specifically directed by support (Vmware's or EMCs). Its functionality is fully subsumed by SIOC, and SIOC is always better. The next version of the techbooks will contain a comment to this effect.
So, we recommend leaving ADQ off and unset.
You may also be thinking of the iops=1 setting- that only applies to round robin using NMP. The recommendation to set it to 1 from the default 1000 is for symmetrix only.
I've been looking for this kind of information, but VMware isn't exactly forthcoming on the subject - which is understandable, since the answer vastly depends on your specific needs.
I ended up building 2 TB datastores and grouping them in datastore clusters with storage DRS enabled, and I'm very satisfied with the performance so far. The backend might help, though, we're using VNX5300 with FAST VP enabled behind a pair of VPLEX.
I'll monitor the evolution of queue depth over the next few months.
This blog is from 2012, does this recommendation apply today with vSphere 5.5/6?
I've read somewhere that for block there is no difference between 10x500GB LUN or 1x5TB LUN. Is this correct in vSpehre 5.5/6?
There IS definitely a difference in block storage between 10x500 and 1x5000. There are more queues and as such with the same workload you should have shorter queues, ie: lower latency.
Richard J Anderson
Sr Systems Engineer, Software Defined Storage, West Globals
EMC Emerging Technologies Division
VMAX SPEED, Unified SPEED, ScaleIO SPEED, Proven EMCTA