Start a Conversation

Unsolved

This post is more than 5 years old

2246

December 14th, 2017 12:00

ScaleIO 2.0.1.3 on VMWare - abysmal write performance while uploading to datastore

Hello, guys!

We are experiencing a rather weird behavior of our ScaleIO deployment. Firstly, let me describe the setup:

6 Dell FX2 blocks with two FC630 and FD332 nodes each.

Every node is connected with 4 10G links to an MLAG group of two Arista DCS-7150S-52-CL switches. Two of the uplinks are used for VM data and two - for ScaleIO

In every FX2 block one FC630 node is running ESXi 6.5U1 and one - Windows 2012R2 Core with Hyper-V. Both are contributing to ScaleIO storage and both are in the same and only PD.

In the FD332 storage nodes we are using 10 Dell 1.64 Tb SFF SAS spinning drives with two 800 Gb Mixed Use SSDs for rfcache.

The system runs quite smoothly with pretty much satisfying performance:

Capture.PNG.png

Yet there is a problem, and it's quite bad.

When uploading a file to the datastore or replicating a VM to the vmware part of the cluster - the speed never exceeds 16 MB/s per one session, i.e. if I were to copy two files simultaneously - the aggregate speed would be 31 MB/s more or less. As you might guess, when moving VMs with disks over 1Tb - such speed becomes a pain. Restoring from a backup (we use veeam B&R) also hits this limit. And all of this only applies to vmware part of the cluster - with Hyper-V we easily exceed 120 MB/s with the same activities. Once again, the ScaleIO cluster is the same on vmware and hyper-v. The hardware is also the same down to each and every P/N.

The most interesting part, however, is that Storage vMotion to the ScaleIO cluster from the same cluster we have been copying the VMs with veeam works fine!

Here is what we have checked:

  • Everything network-related - MLAGs, Arista configs etc.
  • ESXi configuration - LUN depth, vSwitch configs, SVM configuration etc.
  • We tried to avoid using vmk for veeam traffic by using two proxy VMs (Guest OS is windows 2016) - one in source esxi cluster (not scaleio) and one in destination (with scaleio). In such setup veeam directly attaches the snapshot to the proxy VM on both source and destination and thus traffic never gets to vmk - the destination proxy VM writes directly to the scaleio LUN. In such setup we are seeing "Highest active time: 100%" in windows resource monitor

So, any help would be greatly appreciated as we are really stuck with this situation.

January 30th, 2018 23:00

Hi,

Apologies for the late response.  Please could you open a service request in order to look into this further.

Please provide the get_info from the master MDM.  We may need to engage Engineering, so a formal service request will be required.

Thanks for your help.

4 Posts

February 14th, 2018 04:00

We have created an SR 10142536 and uploaded the dump there.

5 Practitioner

 • 

274.2K Posts

February 15th, 2018 10:00

I'm preparing to do a deployment soon. Had successful deployments in the past you never know what you may run into the the future. Can you keep us posted with your findings?

Also, are you using the snap & grafana for monitoring? How is that working out.

Thanks,

4 Posts

February 15th, 2018 15:00

Oh and by the way - we have investigated further and were able to isolate the issue.

It only happens when the write IO is made to an unallocated space inside the vmfs datastore - that is, if we use thin-provisioned vmdk, do a restore or a replication with veeam (by design such operations create a snapshot of the target vmdk and write to it. The snapshot is essentially empty so a write IO to it triggers our issue).

We have also opened an SR with Veeam. They did some tests on our system using VMware's vixlib as well as their own software and came to a conclusion that the issue is with the storage:

    99 - Target Write Busy % AKA “Target”

4 Posts

February 16th, 2018 00:00

Sure, I will report any updates on the case here.

As for monitoring setup - we are using ScaleIO exporter + prometheus + grafana

5 Practitioner

 • 

274.2K Posts

March 12th, 2018 12:00

Interesting. I need to test Avamar restores and see if the same issue happens. However the ReadyNodes we are using came with 2.5, so it may have been addressed by now.

No Events found!

Top