Hi David, I'm one of the speakers at VAPP4679 at VMworld. There are a number of factors that could explain the behaviour seen when going to 32 vCPU for that VM. Firstly the vCPU's of the VM are not the only threads executing on the host. IO and other functions (MKS and other host threads) are executed in outside of the VM's vCPU's. These will require CPU time when scaling up. On a host with only 32 Cores (4 x 8 in this case), with HT enabled at least there will be some LCPU's for the non vCPU threads to execute on. But we all know an HT is not a full core, it's like 20% performance boost in some cases. Unless you use the PreferHT=1 option at the guest or Host level the VM is going to prefer to use pCPU Cores and cross NUMA boundaries by default and not use the HT threads unless absolutely necessary. This will cause non-vCPU threads to contend with the vCPU's when executing. You could experiment with the PreferHT option and see if that improves performance or if it changes performance at all. But really the problem is that there was not enough physical host resources on that hardware platform chosen to support the VM and host threads at the load levels when it was configured with 32 vCPU's.
From your experiments it appears the host needed around 4 cores to process all the other threads to support the VM and host functions, so you might want to configure the VM for 24 or 28 vCPU, or 32 vCPU with the PreferHT option so that there is enough pCPU Cores available for the host. Remembering that the best option will be divisible by the number of LCPU's on the host. Alternatively you could configure the VM to use all LCPU's on the host and just have it contend with the other threads and VM's when it's running after hours. DRS will kick in and move the other VM's off the host to ensure it gets the performance it needs. This might ultimately provide the best performance, assuming the app can scale to this number of vCPU's. You'd need to check the impact during the normal day when the other VM's are running on the host. Some smart power management settings in the VM and on the host might also help improve the performance of the other VM's outside of this special case. The advantage of going to 64 vCPU is that it'll be able to make use of all threads on the host, but it could introduce ready time across the VM also when the other non vCPU threads execute on the host. So you should validate that.
Additionally the way you configure the vSockets and vCores can have an impact on performance based on our testing of between 1% and 3%. Our recommendation provided the OS supports it, is to use 1 vCore per vSocket. So if you wanted to make use of all LCPU's this would be a 64 vSocket VM.
Also you may recall our guidance from the Monster VM Panel that a single VM should usually be only half as big as the host, and I mentioned that in some cases it could be up to 75% depending on individual requirements etc. This is to allow for optimal performance and also HA failover. Once you know how big your biggest VM is going to be you can then design your host platform accordingly. So 75% on your platform would be 24 vCPU's for a single VM if you're basing your calculations solely on pCPU cores (which is my recommended starting point).
Lastly setting the Oracle init parameter of Parallel Threads Per CPU = 1 and making sure your OS has the most efficient IO scheduler (NOOP) among other settings will make a big difference to the performance in addition to huge pages etc. There will be tuning you can do to limit the overhead on the hosts to get better performance in the after hours 1 VM to Host scenario, but without digging into the specific configuration of the OS and App it's not possible to make specific recommendations.
I have a lot of other information you might find useful on my blog. http://longwhiteclouds.com/oracle.