Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

6167

November 21st, 2016 03:00

ScaleIO 4k qd1 performance in Windows 2012 R2

Hello,

We are currently evaluating EMC ScaleIO 2.x on Microsoft Windows Server 2012 R2.

We are testing in a 3-node cluster, each node has 4x Samsung PM863 512GB SSD. The nodes have the latest Intel XEON CPU and 64GB of internal DDR4 memory (quad-channel).

Each node has 5x Intel 1Gbit network connection (5Gbit total) which are used for both SDS and SDC, we are using ScaleIO load-balancing.

(I know, networking could be better like 10 Gbit, but this should work fine for a small environment I guess. We are happy with the sequential performance like this, we don't need more)

When doing performance measures using different tools (CrystalDiskMark / ASSD Benchmark / IOMeter) we are seeing extremely slow random 4k reads and writes at queue depth 1, while the rest seems fine.

Here are our benchmark results:

Sequential read (qd 32) : 788 MB/s

Sequential write (qd 32) : 560,9 MB/s

Sequential read (qd 1) : 217,9 MB/s

Sequential write (qd 1) : 220,2 MB/s

Random read (qd 32) : 70.000 IOPS

Random write (qd 32) : 45.000 IOPS

Random read (qd 1) : 190 IOPS

Random write (qd 1) : 1500 IOPS

To us the random 4k IO with QD1 seems extremely slow, while random performance at queue depth 32 is fine (peaks at 80k IOPS which seems impressive to me). 190 IOPS however is what one would expect from an old harddisk using mechanical components!

When testing the drives standalone the performance seems fine, but when testing a ScaleIO volume on a Windows node we are seeing this.

We already tried the following:

- Disable all power saving features

- Change the performance profile in ScaleIO for both SDS and SDC (note: when the performance profiles are set to high, performance will be actually lower, weird!)

- Disable a node

Any suggestions would be more then welcome! Thanks in advance and best regards,

Paul

14 Posts

November 22nd, 2016 06:00

Hi Pawel,

Thanks again for your response, that was a little bid stupid from me . Yeah I could execute the command now however I don't see any improvements.

I've put the performance profile on High and after I did that I executed the command and it says it was successfully (I take it it will change this parameter from the High profile in this case). I ran another test and no change in results.

I hope I did it alright.

@Coredump, thanks for this tip! I must say i already looked into it and I thought that it wouldn't really help, since my SSD - when locally tested - is much more faster already. So I thought the bottleneck would not be the SSD, but the network instead. However I understand your point and will test it in any case! Will let u know the results.

Thanks all again.

306 Posts

November 21st, 2016 05:00

Hi Paul,

Can you please let me know the following:

1. How many threads do you use for testing? single vs. many threads?

2. Did you try any other IO size than 4k?

3. Is there any particular reason you want to keep the queue depth at 1?

4. What are the exact versions of Windows and ScaleIO you are testing?

Many thanks,

Pawel

14 Posts

November 21st, 2016 06:00

Hi Pawel,

Thanks allot for your quick response. I am happy to answer your questions and also let u know we made a massive improvement already.

1. I think with threads you mean the queue depth? We tested both QD=1 and QD=32.

2. Yes, we tried sequential writes with 512k, no problems there.

3. Not really, but some applications just uses QD=1 i think (we have different virtual servers running like Exchange, SQL, Domain Controller and Remote Desktop Servers).

4. Windows Server 2012 R2 Datacenter

Then about the improvement we made, I've read the ScaleIO performance tuning guide and did the following changes on 2 from 3 nodes (the 3rd node is currently in "production" and will get updated tonight):

1. HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Tcpip \ Parameters \ Inter-faces \

2. Entries: TcpAckFrequency TcpNoDelay

3. Value type: REG_DWORD, number

4. Value to disable: 1

The result from the benchmark is now as follow:

Random read (qd 1) : 777 IOPS

Random write (qd 1) : 1600 IOPS

When looking at the read performance, we can see a massive improvement there. The system (virtual servers) also runs MUCH more faster now. So it looks obvious the performance drops results from the ethernet network and it needed tuning.

However 777 IOPS is still relatively slow compared to the SSD performance without ScaleIO. A local SSD without ScaleIO could achieve like 7000 IOPS 4k read performance instead of 777 IOPS.

My guess is that mostly ethernet latency is the cause for these low numbers here, but it seems that its improvable quite much.

Are there any other performance improvements to be made?

Thanks, Paul.

68 Posts

November 21st, 2016 22:00

Hello Paul,

4k random QD1 reads/writes throughput is highly releated to the network latency, that's why you can see improvements disabling TCP delayed ACK and nagle algorithm. There are a lot network cards parameters that can improve this specific workload, for example interrupt moderation (or coalescing) that is useful to decrease load on the CPU also increase the network latency so it helps with high speed workloads (sequential ones for example) but leads to a performance loss with 4k random QD1 workloads. Disabling it can help to achieve better result with 4k random QD1 workload. All the parameters that raise latency have a negative impact on that kind of workload (TSO, LRO, TCO and so on). To increase that kind of workload 10 GbE Ethernet helps because of the lower latency and also switches with very fast backplanes. Probably Infiniband hardware is ideal but I read only some post on the forum but I didn't find a case history. But remember that 4k random QD1 can be considered a synthetic workload.

We bought ScaleIO for our ISP infrastructure. We have three SIO nodes in production with 1920GB Samsung SM863. We kept a higher overprovision area on our SSD to get better performance and to extend the life of the SSD. The 2TB Samsung SM863 comes with more or less 7% fabric preconfigured overprovisioning in fact the real size is 2048GB but only 1920GB are usable because of overprovisioning. If you partition the disk keeping free 28% overprovisioning area (the free area must be trimmed) you will gain performance because the disk have a bigger area for garbage collecting. You won't see differences when the disk is new and completely free but when is full the increased overprovisioning area helps. Samsung states that you can get 29k IOPS 4k random writes QD32 with 2TB SM863. That data is true in the worst case scenario: I get 29k IOPS when the disk is full with 7% overprovisioning. With 28% overprovisioning area I get 50k IOPS 4k random writes QD32 also when the disk is full. Obviously only 1600 GB are available on each disk and 320GB are left free and must be trimmed.

Our infrastructure has 3 Samsung 2TB SM863 every node. Every node is connected with two 10 GbE interfaces to two different S4810 Dell Force10 Switches.

Our storage is disaggregated (not hyperconverged), ScaleIO MDM and SDS are installed on three Ubuntu Linux 16.04 servers. The SDC role is installed on four Dell M620 blade servers that hosts more or less 60 VMs.

The VMs hosted on the 4 blades are mixed: Linux and Windows. More or less 20 VMs are Windows Servers 2012 R2 the remaining 40 VMs are Linux. We sell for example Exchange as a service and so we have a Exchange DAG (MBOX+CAS roles) replicated on different VMs, redundant domain controllers and a witness server. We also have some different VMs that run MSSQL server with high loaded databases and some domain controllers for different domains. On linux we have some postfix mail servers, dovecot server, a big NFS server, some LAMP servers, we widely use mysql, postgres, mongo and influxdb as databases.

ScaleIO is in production from 1 month and we are very satisfied with the results: it is lighting fast.

Before we had a ZFS based solution with a 24 SAS disks array with two FusionIO replicated ZIL and 3 Samsung SV843 as L2ARC but ScaleIO is from another planet. The storage performances won't be a problem anymore, we can scale vertically adding disks to the 3 nodes or horizontally adding new SIO nodes. Probably we will add 2 new nodes in order to switch the SIO cluster to 5-node mode in the future deploying also SDS role on the new nodes and adding 3 disks on every new node, then probably we will scale vertically a bit.

Kind regards,

Davide

14 Posts

November 22nd, 2016 03:00

Hi Davide,

Thanks allot for your detailed reply! And yes you must be right that random 4k QD1 performance depends heavily on network latency, it is exactly what we are seeing right now.

I think its important to keep that type of IO optimized since its something a end user will really "feel" when he is working with the system (RDS / Citrix). We temporarily put some production servers on the ScaleIO infrastructure to test the end user feeling, and before our network changes users were reporting some slow repsonses in the system. However once we changed those settings, the users were reporting a lightning fast system as well (and yes it "feels" really fast indeed now)!

I was already looking into other improvements on the networking side like receive and send buffers, you just mentioned some others as well, thanks for that and we will look into it.

We can currently pull off 1k IOPS 4k random read QD1 through the network, while 7k IOPS is achievable when the disk is locally tested, so 1k including the relatively slow network is also getting impressive to me.

We when are done tuning we will probably end up with even more 4k QD1 performance.

So our question is actually answered now and we can consider this solved. ScaleIO is a great product.

There is one strange thing left which I don't understand, when we put the performance profile to "high", we end up with lower performance instead of higher performance! I thought this performance profile was made for systems who were able to pull off like 80k IOPS QD32, well our system can quite easily do that as we have seen. Still setting the performance profile to high LOWERS our performance by quite a bit. Well, especially the sequential performance will lower, random performance will stay more or less the same (or even a bit more).

So it seems the performance profile are made for random IO instead of sequential IO.

Isn't it possible to get both optimized? I will link some benchmark results below:

Sequential read (qd 32) : (default profile) 863 MB/s  (high profile) 531 MB/s

Sequential write (qd 32) : (default profile) 563 MB/s  (high profile) 315 MB/s

*note there is quite a massive decrease in sequential performance when SDS profile is set to high

Sequential read (qd 1) : (default profile) 302 MB/s  (high profile) 225 MB/s

Sequential write (qd 1) : (default profile) 238 MB/s  (high profile) 168 MB/s

*note there is quite a massive decrease in sequential performance when SDS profile is set to high

Random read (qd 32) : (default profile) 63.000 IOPS  (high profile) 67.000 IOPS

Random write (qd 32) : (default profile) 45.000 IOPS  (high profile) 46.000 IOPS

Random read (qd 1) : (default profile) 925 IOPS  (high profile) 1000 IOPS

Random write (qd 1) : (default profile) 1800 IOPS  (high profile) 1700 IOPS

My conclusion from these benchmarks is that I could better run the default profile, since the increase in random IO when set to high is minimal and the loss to sequential IO is quite massive. Not even sure if random IO gets improved, numbers will always differ a bit each benchmark we run.

I still find it weird that generally the high performance profile gives lower performance.

In any case, our test system is running great now and we are happy with this product.

Best regards,

Paul

14 Posts

November 22nd, 2016 04:00

Thanks Pawel, I looked into the release note and this could well be it!

However I cannot use CLI, I tried it from all nodes and I get this error one 1 node:

Error: MDM failed command.  Status: Invalid session. Please login and try again.

On the other 2 nodes I get this error:

Error: Failed to connect to MDM 127.0.0.1:6611

Any suggestion? Thanks in advance. Paul

306 Posts

November 22nd, 2016 04:00

Hi Paul,

I see that you mostly fixed the problem yourself and Davide almost came with a great explanation, so thanks for that :-)

By threads, I actually meant number of threads in the test process - I think it's called 'numjobs' in FIO, in general, ScaleIO behaves much better with multiple threads. Do you mind to give it another try - with the same test parameters - but increase a number of threads (jobs) in FIO?

We quite often see a similar testing problem here, many customers use basic 'dd' tool to test ScaleIO volumes speed - and sometimes the results are not great - that's because 'dd' is a single-threaded application and simply cannot put enough load on the device - so the results are much more impressive if you try with 4, 8 or even 16 threads with FIO or any other specialized IO traffic generator.

Regarding the high performance results, I am wondering if you are not experiencing the problem described in ScaleIO Release Notes under SCI-17001; can you please try to change the number of sds_sds sockets to 4?

scli --set_performance_parameters --tech --all_sds --sds_number_sockets_per_sds_ip 4

and re-test?

Thank you,

Pawel

306 Posts

November 22nd, 2016 05:00

Hi Paul,

On the primary MDM - where will you execute all the scli commands - you need to log into the ScaleIO system first:

scli --login --username admin --password

and then you can run all the commands.

Thanks!

Pawel

68 Posts

November 22nd, 2016 06:00

Hi Paul,

I forgot to mention another test you can do to improve 4k random (only reads).

I know that the EMC fine tuning guide suggests to use XCACHE module /RAM cache) only with mechanical disk array, probably becausea all flash array made of fast SSD can be faster than memory with read workloads. But I think that with 4k random reads QD1 it can help a lot because the RAM latency is lower than disk lantency so you can get an improvement using RAM read cache. I think also that with smaller disk arrays offloading some reads with memory can help because the SSD should handle mostly writes and only a lower amount of read IOPS (the EMC optimized LRU algorithm is very efficient). I think it could be an interesting test in your scenario. I hope you have available an amount of RAM to be used for this benchmark. Please remember that if you have more than one processor on your SDS only a percentage (50% or 75% depending on memory size) of the NUMA node 0 RAM can be used as cache.

Obviously to make a benchmark at 4k random reads QD1 and to see benefits from this setting you have to be sure that the content you are accessing is RAM cached.

Kind regards,

Davide

68 Posts

November 22nd, 2016 07:00

Hi Paul,

since the performances in this specific scenario depends on the total latency you have from SDC to the SDS and the total latency is function of the network latency + storage latency it is possible, not for sure, that lowering the data access latency can help to increase the performances. That's not for sure but let me know if you will test this configuration.

Davide

306 Posts

November 22nd, 2016 08:00

Hi Paul,

BTW, is Windows a must? Since you are still evaluating, would you consider another OS, Linux or ESXi?s

Best,

Pawel

14 Posts

November 23rd, 2016 04:00

Hi Pawel,

Yes we are still evaluating and we can still change the underlying system. However we prefer Microsoft since we plan to run a hyperconverged Hyper-V cluster. But if Linux give much more performance in this scenario it might be interesting!

But as said for now the performance looks more then acceptable actually, so we are happy trying to keep up the optimizations for now.

Is Linux (much) more faster for ScaleIO then Windows? I take it its faster for sure, but is it that much?

Thanks, Paul

14 Posts

November 23rd, 2016 04:00

Hi Davide,

We enabled the RAM read cache, we change the default memory from 128MB to 10GB (10240MB) for RAM cache and enabled it for the testing volume.

And yes, we can see a improvement for sure! Especially the sequential QD1 workload increased ALLOT (from 200 MB/s to 320 MB/s).

Also the 4k QD1 workload improvement a little bit. I've seen 1k IOPS now instead of 700 IOPS (instead of 7k IOPS when testing an SSD locally without ScaleIO).

The weird thing however is that I don't see any higher memory usage on the nodes, also the RAM hit rate is 0%.

Maybe the system needs some time to define the "hot data blocks" before it will start really working, which would be obvious.

We are pretty sure that ScaleIO would be the right system for our customer! Performance is already really good at this moment and we will consider upgrading the network (maybe infiniband).

Let me know if you have any more suggestions, would be great. Otherwise thanks allot for the tips you have, you were right all the way about the latency. Different latencies stacks indeed it seems.

Best regards,

Paul

306 Posts

November 23rd, 2016 06:00

Hi Paul,

I'm afraid I don't have any exact figures in this matter - that's why I asked you to see if you can try that :-)

But if you want to use HyperV, then it's out of the questions.

Anyway, I am happy to hear that performance is good Let us know if you have any other questions!

Best,

Pawel

110 Posts

November 23rd, 2016 07:00

My understanding is there should be very little performance difference, if any, between Linux and Windows deployments.

No Events found!

Top