8 Krypton

Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Does anybody have real world experience with 32vCPUs monster VMs?

We have a customer in Austria that got some feedback that beyond 16vCPUs the VM performance doesn’t scale as expected.

I received some more information: primarily it's Oracle on a 32core machine - performing unexpectedly less on the VM.

On the VMworld 2013 panel I have heard the recommendation for Oracle to turn off NUMA and set HT = 1.

there is more information in these sessions:

VAPP4679 - Software-Defined Datacenter Design Panel for Monster VM's: Taking the Technology to the Limits for High Utilisation, High Performance Workloads

Sam Lucido and Jeff Browning, did a session on Oracle (VAPP5180- Extreme Virtualized Oracle 12c Performance in a Proven Architecture).

if you are interested, however, it doesn't answer my question completely, so still lookin for real world experience:

It’s one of our loyal customers with VMware, VPLEX, VMAX,… and they are in the process to move some of the biggest workloads (DB and more) to vSphere.

  • I thought about EMC IT (with Propel / SAP, Oracle) experience
  • Any sessions from VMworld 2013?
  • anybody else who had the need and ran huge VMs

Thanks in advance!

-      

Message was edited by: David Hanacek

Labels (1)
0 Kudos
8 Krypton

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

So here is where I am with my research - welcome any comments, hints and tips, real world exeprience:

Does anybody know who created the document on the VMware Oracle landing page? It appears to be from 2011 and I was wondering if there are any findings / news that might be considered:

http://www.vmware.com/business-critical-apps/oracle-virtualization/index

http://www.vmware.com/files/pdf/solutions/oracle/Oracle_Databases_VMware_Best_Practices_Guide.pdf

And then I found a blog on a case that sounds very similar to what EMC presented as best practice at VMworld 2013 by Sam Lucido and Jeff Browning:

http://cloudarchitectmusings.com/2012/12/17/the-impact-of-numa-on-virtualizing-business-critical-app...

But the recommendation is to build smaller 6vCPU VMs – that’s what I have recommended to the customer previously. In the specific case this is not a way to go. It's a fairly small 1.6TB Oracle DB, but they need to do some big data analytics on it and the application scales only in a monolithic enviornment. Right now on 32 core physical system.

8 Krypton

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Good news: the customer is running benchmark tests on a system with 64 cores (4x4650 with 8 cores) and 768GB RAM. A monster VM scales fine up to 28vCPUs now and only drops to performance of 24 cores with 32vCPUs. We haven't run yet the application, but it looks like that in this case Hyperthreading (HT) on the previous machine didn't work well. We are looking for the vSphere 5.5 release to try advanced features as well to apply best practices for Oracle tuning on a VMware VM. As posted earlier the document on VMware's Oracle landing page seems to be outdated (2011) so hope to find more recent hints and tips. Will post any new findings.

0 Kudos
Daniel_Polombo
6 Indium

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

The main point to keep in mind is that, to schedule a VM, vSphere needs to reserve one logical CPU (one core = one LCPU w/o HT, one core = 2 LCPU with HT) per vCPU in the VM. In other words, to run a 32 vCPU VM for a time slice, the ESXi will require the simultaneous availability of 32 LCPU.

If you have 32 physical CPU, it means that the ESXi kernel can't even schedule itself during the same time slice, which leads to significant performance degradation. Hyperthreading will somewhat mitigate the problem since you will benefit from extra-fast context switching, but it's still not equivalent to 64 physical CPU.

Even running a 24 vCPU VM on a 32 CPU server will generate significant latency, as the VM will need to wait for the simultaneous availability of 24 CPU before it can be scheduled.

Have a look at the CPU Ready statistics for the VM, and also look at the %CSTP value with esxtop (see VMware KB: Determining if multiple virtual CPUs are causing performance issues for more details about %CSTP).

8 Krypton

Re: Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Hello Daniel - your point is something we are looking at, but rather in the context how to make an ideal 3-node cluster that allows shared utilization of various workloads as the monster VM is running a nightly job and during the day the resources would be hardly utilized otherwise.

  • vCPU coscheduling:

However, "16 core VM needs 16 cores *at the same time* to execute”

This hasn’t actually been true since 4.0.  The development of ‘relaxed coscheduling’ in 4.x and further refined in 5.x makes this far more rare.  *sometimes* its needed in order to reduced the core clock skew in the VM to prevent excessive SMP slip, but not usually.

Everything you could want to know: http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

Thanks mattcowger

  • vSphere 5.5 improvments expected

We are eager to test vSphere 5.5 low latency feature to bypass the virtualization layer that should be available once the bits are GA.

Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5.

http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf

Thanks itzikr

Daniel_Polombo
6 Indium

Re: Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Thanks for the tech paper, it makes for a very interesting read.

0 Kudos
8 Krypton

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Please see the original post - I am in no way recommending to go for every Oracle VM workload to this monster VM - in fact a best practice is to divide whenever possible. But if you can't go this way - that's true in our case - read on for more details I was able to capture.

Good news: customer is running benchmark test on a system with 32 cores (HP B660 with 4x4650 with 8 cores) and 768GB RAM. A monster VM scales fine up to 28vCPUs now and only drops to performance of 24 cores with 32vCPUs. We haven't run yet the application, but it looks like that in this case Hyperthreading (HT) on the previous machine didn't work well. We are looking for the vSphere 5.5 release to try advanced features as well to apply best practices for Oracle tuning on a VMware VM. As posted earlier the document on VMware's Oracle landing page seems to be outdated (2011) here are more recent hints and tips.In the scenario we run Oracle 11.2.0.2 at this point.

  • Oracle HT recommended
  • Memory
    • Oracle – Reserve memory directly at VM reservation for a 32vCPU systems - possibly up to 100%, optimally we are using 128GB VM RAM, but this value may be different for you - follow formula from best practices to share some resources (in our case the may performance is needed for a nightly 8 hour job).
      • Utilize Memory Reservation–Size of the SGA + 2 times Aggregate PGA Target + 500MB for the OS (assuming some flavor of Linux.
      • These should be for Production Clusters only, as Development and Test databases do not usually require peak performance.
    • Faster RAM - possibly more physical memory is counter-productive if it’s cross architecture (CPU accesses memory of another CPU or the memory will have higher latency due to the memory architecture). Cisco B440 physically with 4x8 cores and 256GB memory have proven to be a pretty good sweet spot in terms of price / performance / resources.
    • NUMA locality
    • Linux huge pages (typical default 4K, large / huge 2MB, can be adjusted) - Oracle memory lookup, linear search can improve up to 15% - see as well the EMC At VMworld 2013 Orcale session – more detail there
  • Reservation / prioritization generally
    • CPU: Prod, Test, Dev in the same ESX cluster - eg Prod priority 1 / priority 2 test / dev Priority 3 ...

These are the key topics we identified in our situation - you might want to visit the following resources that provide further detail and we were looking at:

Thanks to Darryl Smith, Bart Sjerps, Sam Lucido, itzikr, Jeff Browning who provided input and insight.

8 Krypton

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

David:

Exceptional discussion, and thanks very much for your kind comments concerning the recent VMworld presentation that Sam Lucido and I co-presented. You should know about a similar project that VMware and EMC are involved in, which was also discussed during the VMworld presentation. We are creating a workload characterization using a relatively large VM. We got to this level of VM on this test:

6 vCPUs

128 GB of vRAM

That's much smaller than what you are talking about though. Having said that, I will post as much information on that study as possible on this discussion.

Regards,

Jeff Browning

0 Kudos
vcdxnz001
1 Copper

Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Hi David, I'm one of the speakers at VAPP4679 at VMworld. There are a number of factors that could explain the behaviour seen when going to 32 vCPU for that VM. Firstly the vCPU's of the VM are not the only threads executing on the host. IO and other functions (MKS and other host threads) are executed in outside of the VM's vCPU's. These will require CPU time when scaling up. On a host with only 32 Cores (4 x 8 in this case), with HT enabled at least there will be some LCPU's for the non vCPU threads to execute on. But we all know an HT is not a full core, it's like 20% performance boost in some cases. Unless you use the PreferHT=1 option at the guest or Host level the VM is going to prefer to use pCPU Cores and cross NUMA boundaries by default and not use the HT threads unless absolutely necessary. This will cause non-vCPU threads to contend with the vCPU's when executing. You could experiment with the PreferHT option and see if that improves performance or if it changes performance at all. But really the problem is that there was not enough physical host resources on that hardware platform chosen to support the VM and host threads at the load levels when it was configured with 32 vCPU's.

From your experiments it appears the host needed around 4 cores to process all the other threads to support the VM and host functions, so you might want to configure the VM for 24 or 28 vCPU, or 32 vCPU with the PreferHT option so that there is enough pCPU Cores available for the host. Remembering that the best option will be divisible by the number of LCPU's on the host. Alternatively you could configure the VM to use all LCPU's on the host and just have it contend with the other threads and VM's when it's running after hours. DRS will kick in and move the other VM's off the host to ensure it gets the performance it needs. This might ultimately provide the best performance, assuming the app can scale to this number of vCPU's. You'd need to check the impact during the normal day when the other VM's are running on the host. Some smart power management settings in the VM and on the host might also help improve the performance of the other VM's outside of this special case. The advantage of going to 64 vCPU is that it'll be able to make use of all threads on the host, but it could introduce ready time across the VM also when the other non vCPU threads execute on the host. So you should validate that.

Additionally the way you configure the vSockets and vCores can have an impact on performance based on our testing of between 1% and 3%. Our recommendation provided the OS supports it, is to use 1 vCore per vSocket. So if you wanted to make use of all LCPU's this would be a 64 vSocket VM.

Also you may recall our guidance from the Monster VM Panel that a single VM should usually be only half as big as the host, and I mentioned that in some cases it could be up to 75% depending on individual requirements etc. This is to allow for optimal performance and also HA failover. Once you know how big your biggest VM is going to be you can then design your host platform accordingly. So 75% on your platform would be 24 vCPU's for a single VM if you're basing your calculations solely on pCPU cores (which is my recommended starting point).

Lastly setting the Oracle init parameter of Parallel Threads Per CPU = 1 and making sure your OS has the most efficient IO scheduler (NOOP) among other settings will make a big difference to the performance in addition to huge pages etc. There will be tuning you can do to limit the overhead on the hosts to get better performance in the after hours 1 VM to Host scenario, but without digging into the specific configuration of the OS and App it's not possible to make specific recommendations.

I have a lot of other information you might find useful on my blog. http://longwhiteclouds.com/oracle.

8 Krypton

Re: Re: Monster VMs performance 32vCPUs experience? beyond 16vCPUs drops?

Good news! Virtual beats physical on a monster Oracle machine!

Here the details from this week's test run before preparing for production

physical system to complete job: ~11h16

virtual system to complete job: ~10h21

The comparison isn't 1:1 as

  • physical system are 2 AMD Opteron with a total of 24 cores
  • virtual system is VMware vSphere 5.1 with 4x Intel e5-4650 with a total of 32 cores

However, the vSphere host had some smaller workloads still on it. For that the customer is happy with the results so far.

  • For those tests we have chosen to P2V the original test system.
  • The main key for the positive experience in this scenario is the VM memory reservation.
  • Here a nice graph on the utilization CPU ready goes down as CPU usage goes up with the run of the big data analytics job.

cpu-ready-vs-usage.jpg

0 Kudos