pmuskee

14 Posts

6167

November 21st, 2016 03:00

ScaleIO 4k qd1 performance in Windows 2012 R2

Hello,

We are currently evaluating EMC ScaleIO 2.x on Microsoft Windows Server 2012 R2.

We are testing in a 3-node cluster, each node has 4x Samsung PM863 512GB SSD. The nodes have the latest Intel XEON CPU and 64GB of internal DDR4 memory (quad-channel).

Each node has 5x Intel 1Gbit network connection (5Gbit total) which are used for both SDS and SDC, we are using ScaleIO load-balancing.

(I know, networking could be better like 10 Gbit, but this should work fine for a small environment I guess. We are happy with the sequential performance like this, we don't need more)

When doing performance measures using different tools (CrystalDiskMark / ASSD Benchmark / IOMeter) we are seeing extremely slow random 4k reads and writes at queue depth 1, while the rest seems fine.

Here are our benchmark results:

Sequential read (qd 32) : 788 MB/s

Sequential write (qd 32) : 560,9 MB/s

Sequential read (qd 1) : 217,9 MB/s

Sequential write (qd 1) : 220,2 MB/s

Random read (qd 32) : 70.000 IOPS

Random write (qd 32) : 45.000 IOPS

Random read (qd 1) : 190 IOPS

Random write (qd 1) : 1500 IOPS

To us the random 4k IO with QD1 seems extremely slow, while random performance at queue depth 32 is fine (peaks at 80k IOPS which seems impressive to me). 190 IOPS however is what one would expect from an old harddisk using mechanical components!

When testing the drives standalone the performance seems fine, but when testing a ScaleIO volume on a Windows node we are seeing this.

We already tried the following:

- Disable all power saving features

- Change the performance profile in ScaleIO for both SDS and SDC (note: when the performance profiles are set to high, performance will be actually lower, weird!)

- Disable a node

Any suggestions would be more then welcome! Thanks in advance and best regards,

Paul

Responses(28)

pmuskee

14 Posts

1

November 23rd, 2016 12:00

Hi Davide,

I found that RAM Cache needed to be enabled on the lowest level as well using CLI (scli --set_rmcache_usage --protection_domain_name default --storage_pool_name defaultSP --use_rmcache).

So RAM Cache needs to be configured, enabled for each SDS, enabled for the storage pool (using CLI) and last but not least enabled on the specific volumes..

We also increased the NIC send and receive buffers to maximum possible (2048).

It works great! We have 1.4k random IOPS QD1 now! Getting impressive...

A short list of improvements we made so far:

(everything is 4k random IO QD1)

Default config: 112 IOPS

TcpAckFrequency and TcpNoDelay set in registry: 770 IOPS

Updated NIC buffers and enabled RAM cache: 1400 IOPS

While a local disk gets like 7000 IOPS. We are getting closer and closer! I have a feeling even more performance is possible and we keep on getting the improvements! Will keep this thread updated with everything we find. It might help someone else!

Thanks again all. Best regards, Paul

C

c0redump

68 Posts

0

November 23rd, 2016 20:00

Hello Paul,

I read your post today but I couldn't reply immediately: RAM read cache have to be enabled at different levels:

- StoragePool

- SDS

- Volume

I imagine that now you can see the hits in RAM, could you confirm?

Raising TX and RX buffers helps a lot. If you want to optimize this specific workload you can try to disable interrupt coalescing on all the network cards installed on your ScaleIO nodes, also the consumers (SDC). This helps because in order to moderate interrupts the frames are buffered and then a single IRQ is issued for multiple frames (the buffering causes latency). I can't evaluate exactly the improvement you can get changing this parameter on 1 GbE network. I don't recomend to disable completely coalescing because this can raise CPU load because of the higher interrupt rate, some network cards drivers permit to change the coalescing rate: normally by default the coalescing rate is adaptive (regulated by the driver).

I imagine you are using RSS to distribute IRQ on multiple cores. How many RSS queues are you using? Leave it enabled, RSS is good but sometimes using too many queues, also if you have a lot of cores, can cause a drop in the performances. I don't now how many cores you have on the servers but for example if the network driver is configured to use 6 or 8 RSS queues you can try to limit this setting to 4. With some network cards I noticed improvements using 4 queues instead 6 or 8 (driver default).

@osaddict: I didn't noticed substantial differences comparing performances of ScaleIO on Windows / Linux. Maybe a better hardware driver implementation, for a specific hardware component, can lead to different results between the two operating systems. With my hardware there is no proven difference (I tested both). Since I work with HyperV but my storage is disaggregated (dedicated servers for ScaleIO) I chose linux because I had a better felling during complete power outage simulations. I wrote on the forum about this specific simulation here: ScaleIO - SDS on Windows 2012 R2 - Problem after complete power outage simulation

The problem I faced was unrelated to ScaleIO but to a bad Windows behaviour.

Paul, thanks a lot to keep us updated. Performance tuning is an art.

Kind Regards,

Davide

pmuskee

14 Posts

0

November 24th, 2016 03:00

Hi Davide,

Its an art indeed and I love it! Yes I can confirm the cache works great now, actually most IO seems to hit the cache right now and improvements are clearly visible. We are currently still testing but once we get into production we will increase the cache memory even more (like 20GB each node).

This weekend we will look into more network related settings like u mentioned, and keep this thread updated!

Kind regards, Paul

pmuskee

14 Posts

0

November 24th, 2016 05:00

Hi Davide,

By the way, we have 2 maximum RSS queues configured on the Intel NIC (default, options are 1, 2, 4 and 8). We will try 4 this weekend. The servers have more then enough CPU cores available (however using them can cause latency if I understand you correctly).

We can only ENABLE or DISABLE interrupt moderation, there is nothing in between. Oh wait, I found another setting that we can change the interrupt moderation rate using the following values:

- Minimal

- Low

- Medium

- High

- Extreme

- Adaptive (default)

Which value would u advise? Of course we can test it but we have 5 NIC ports each server, meaning it will take a while to test them all.

We have DMA coalescing on DISABLED in performance options (=default).

Thanks, Paul

C

c0redump

68 Posts

0

November 24th, 2016 20:00

Hi Paul,

I am happy to hear that you are improving daily the performance of your infrastructure

I have some suggestions for you. To optimize latency there is a lot of thing you can try and check. The first thing I suggest is to enter in BIOS of your server to disable all C-States in processor settings and to disable all power optimization features (set everywhere you can the profiles to "High Performance" avoiding every type of power optimization).

I noticed a huge boost optimizing BIOS settings especially in heavy load situations.

So this is a summary of the BIOS settings you can check:

- Processor Settings: disable all C-States

- If in memory configuration section you can select the performance profile choose "High Performance"

- If you can choose a power optimization plan in BIOS choose "High Performance" or eventually disable all the energy saving settings.

Then from a windows prompt launch this command:

powercfg /list

All the available power plans available are listed, the selected one is marked with an asterisk. If the selected plan is not "High performance" change it using the command:

powercfg /setactive

where is the "High Performance" power plan GUID you can see running "powercfg /list"

Interrupt Moderation

you have a lot settings here, the lower rate is the best from the latency point of view but I suggest to monitor the CPU load (of every core) during a storage benchmark. Then try to disable interrupt moderation and check again the load of the CPU with a benchmark running. If the load on the CPU is too high try to repeat the steps switching the "Interrupt Moderation Rate" to Minimal, Low, Medium until you find the best balance IOPS/CPU LOAD. Remember that probably the lower interrupt moderation rate (in respect of the CPU load) is the best setting for your needs.

RSS Queues

Regarding RSS queues there are two different approaches. I don't know how many cores you have on your ScaleIO nodes but I think that you can try to raise the value to 4 and run a benchmark to check IOPS. Another approch can be to use only 1 RSS queue for every NIC then you can set the affinity of every RSS queue with a different core using powershell.

I show you how it works. From a PowerShell run the command:

Get-NetAdapterRss

For every NIC you have the property "BaseProcessor" that is the starting processor for RSS Queue allocation. The property "NumberOfReceiveQueues" should be the RSS Queues numbers selected in the NIC driver (in you case now is 2), the property MaxProcessors represents the last processor to be used for RSS Queue allocation, the property "RSSProcessorArray" represents all the processor that can eventually be used for allocation and the property "IndirectionTable" show the really used cores for that adapter (the values are repeated, is a 128 entry table because of the algorithm used to select packet destination). The IndirectionTable can be gathered from the BaseProcessor and NumberOfReceiveQueues data.

For example these are the relevant settings of a ethernet with 4 RSS queues starting from core 0:0

NumberOfReceiveQueues : 4

BaseProcessor: [Group:Number] : 0:0

MaxProcessor: [Group:Number] : 0:6

RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:2/0 0:4/0 0:6/0

IndirectionTable: [Group:Number] : 0:0 0:2 0:4 0:6 0:0 0:2 0:4 0:6

The last line is repeated 15 more times.

As you can notice only even processor number are used because the odd processor numbers represents the hyperthreaded cores. So in a server with eight processor available with HyperThreading enables the RssProcessorArray value will be "0:0/0 0:2/0 0:4/0 0:6/0 0:8/0 0:10/0 0:12/0 0:14/0". In the previous example MaxProcessor is set to 0:6 so starting from 0:0 we have a processor for every queue. If in the previous example MaxProcessor had been set to 0:2 only two RSS queues would be really used.

An example: if you have a server with a single processor with 12 cores, five NICs and 2 RSS queues for every adapter the best settings regarding RSS Queue affinity are these (obviously if you expect to have the same load on all NICS):

NIC1 (NumberOfReceiveQueues: 2)

BaseProcessor: 2

Resultant Indirection Table: 0:2 0:4

NIC2 (NumberOfReceiveQueues: 2)

BaseProcessor: 6

Resultant Indirection Table: 0:6 0:8

NIC3 (NumberOfReceiveQueues: 2)

BaseProcessor: 10

Resultant Indirection Table: 0:10 0:12

NIC4 (NumberOfReceiveQueues: 2)

BaseProcessor: 14

Resultant Indirection Table: 0:14 0:16

NIC5 (NumberOfReceiveQueues: 2)

BaseProcessor: 18

Resultant Indirection Table: 0:18 0:20

With 12 cores I didn't used the BaseProcessor 0 because in this specific configuration I have 10 total RSS Queues and 12 available processors, if is possible is better to leave processor 0:0 unused because it is used by the OS for some primary functions. In this specific example also the last core is unused (because we have 12 cores and 10 queues). If in your setup you can't leave the BaseProcessor 0 that's not a major problem (you can use it). Remember that the same cores can be used for different NICs (overlapping is possible).

If in your current setup the 2 RSS Queues of your 5 NICs are allocated starting (all) from BaseProcessor 0:0 the IRQs are spread only on cores 0:0 and 0:2 so they will be overloaded during high network load situations.

So set the RSS Queues number from the NIC driver, than check the affinity allocation for every NIC using the command "Get-NetAdapterRss"; if you need to correct affinity settings use the command:

Set-NetAdapterRSS -name -BaseProcessorNumber -MaxProcessors

Example for a NIC with 2 RSS Queues:

Set-NetAdapterRSS -name NIC1 -BaseProcessorNumber 2 -MaxProcessors 4

So the resultant IndirectionTable will be: 0:2 0:4

After checking/tuning these settings you can check the CPU load executing a storage benchmark to see if the load is spreaded equally on the cores.

I suggest to benchmark the system with: 4 RSS Queues, 2 RSS Queues and 1 RSS Queues. When possible try to avoid overlapping setting manually the RSS affinity. When you have to configure overlapping RSS queues try to balance the IRQ load (probably not all networks carries the same amount of data). If you can leave free the BaseProcessor 0.

I hope to be enough clear explaining the section related to CPU affinity.

Kind regards,

Davide

pmuskee

14 Posts

0

November 26th, 2016 04:00

Hi Davide,

Wow... What an extremely detailed answer again . Really appreciated! Currently I am busy a few days with other stuff but I will look into everything next week again and will keep this thread updated! Cheers.

Kind regards, Paul

C

c0redump

68 Posts

0

November 26th, 2016 21:00

Hi Paul,

A question regarding your hardware: what SAS controller you have in your server? It is battery write-cache backed?

I'm asking because I noticed different behaviours with battery backed controllers and SSD. Before going in production I tested two different backend configurations. I tried to use a writeback backed controller and a generic HBA.

First Configuration (3 SDS nodes)

- Controller LSI 9271-8i (battery backed)

- 3 Samsung SM863

I configured every device in RAID-0 separately as suggested in ScaleIO Deployment guide. I leaved the controller write-cache enabled (hoping to lower write-latency)

Second Configuration (3 SDS nodes)

- Controller LSI 9207-8i

- 3 Samsung SM863

This controller is a HBA so the disk are exposed as individual devices (JBOD) and writeback cache is not available.

I benchmarked both configurations and with SM863 the second configuration offers faster performances and better latencies.

The ScaleIO Deployment Guide is very well written and states that writeback cache offers faster performances with HDD while "For flash devices (e.g. SSD): Depends on the device".

That results seems strange but I discovered the reason "behind the scenes". The LSI controller with writeback cache enabled automatically disables the "Disk Cache" option (it appears grayed in the controller BIOS) after enabling the writeback cache option. This is for protection because if you have a SSD without capacitor the data commitment from the disk cache to the NAND isn't protected in case of a sudden power loss. Working this way the controller protects your data disabling "Disk Cache" and using it's own write cache that is protected (battery backed).

But if you have an enteprise SSD with capacitor (as SM863 or PM863) the disk cache data commitment to the NAND is guaranteed also in case of a sudden power loss because the capacitor keep the disk powered for enough time to perform the commitment.

The controller isn't aware if you are using or not enteprise drives so it disables "Disk Cache" and don't leave you to reenable it manually.

The "Disk Cache" that is embedded on SM863 is well sized and results faster than disabling it and using the controller writeback cache: this is why with SM863 and the HBA (LSI 9207-8i) I get better results than using SM863 and the controller LSI 9271.8i with writeback cache enabled.

I bought directly HBA that are cheaper but if you have a writeback cache enabled (battery backed) controller you can try to disable the writeback cache from the controller BIOS and benchmark your ScaleIO environment (yours PM863 are enterprise SSD with capacitor).

I think that this configuration can give you better result also with your specific configuration.

Kind regards,

Davide

pmuskee

14 Posts

1

November 27th, 2016 06:00

Hi Davide,

Just a short feedback regarding your answer, coming week I will post a more detailed answer also regarding your other tips you gave me.

We are not using a RAID system. The drives (Samsung PM863) are configured in JBOD, as a result we are not using batteries or whatever.

We just added 4 SSD disks to each node using the built-in SATA controller (Intel X99), gave them a RAW partition and added them to ScaleIO as device.

We are using SuperMicro servers.

I was busy this weekend taking our testing one step further since the performance is great now, we have created a Hyper-V cluster now (on top of the ScaleIO cluster) using the 3 nodes and to my surprise I noticed 120.000 IOPS both read and write QD32. This is getting extremely fast... Wow....

We have already put some virtual servers on the cluster and its really blazing fast.

Will come back next week in more detail.

Thanks, Paul

C

c0redump

68 Posts

0

November 27th, 2016 20:00

Hi Paul,

thanks a lot for keeping this thread updated. It is useful to compare benchmark coming from different configurations in order to understand the best parameters for every type of needs.

Now I'm working to release a collectd plugin in order to use grafana to track historical metrics of a ScaleIO deployment. This can be useful to you to track the throughput during daytime (graphs are updated in real time and you have access to historical data). SwissCom made a great job and released the first plugin to do the same work but I modified the approach to collect ScaleIO data. I wrote in another thread about the way it works ( ScaleIO Historical Performance Monitoring ). My collecting plugin can be installed on a standalone VPS that must have only connectivity to the ScaleIO gateway. I'm testing it from one week and works perfectly. I wanted to release it the last monday but, writing to this thread I understood the importance to track latency, so I preferred to delay a little bit the release in order to add also a latency indicators.

I will post you the link to the GitHub project as soon as I will release it. It helps a lot to track performance.

Thanks,

Davide

pmuskee

14 Posts

0

November 29th, 2016 03:00

Hi Davide,

That's nice! We are running (or planning to run) Microsoft Operations Manager and I also already thought about creating my own custom Management Pack to keep track of ScaleIO performance but also generating (SMS) alerts to mobile phone in case of malfunctions.

I might look into your thread on how to collect the performance counters for ScaleIO, I guess we have to use SNMP for that.

Haven't look into it yet, I will share my Management Pack for ScaleIO as well once its done.

Keep up the great work! I am loving ScaleIO more each day .

Kind regards,

Paul

osaddict

110 Posts

0

November 29th, 2016 09:00

I've glad to here it!

If you create a Management Pack and want to share it, we can create a repo within DellEMC GitHub account so others can find it and contribute.

As an alternative to SNMP, you can use the REST API, which is accessed via the Installation Manager/Gateway server. The API details are in the User Guide that is in the download. You can look at ScaleIO Historical Performance Monitoring thread, mentioned before, for links to other projects that are using the API to gather those stats as a starting point too.

Thanks,

Jason

C

c0redump

68 Posts

0

November 29th, 2016 20:00

Hi Jason and Paul,

I just released my version of the collectd-scaleio plugin (project forked from the SwissCom repository) that relies on REST-API to get metrics from the ScaleIO infrastructure, I updated the thread "ScaleIO Historical Performance Monitoring" linked by Jason giving some detail.

I published it on our GitHub repository but I think that if it's possible to create a repository on DellEMC GitHub account it would be even better! Obviously there is much possibility to find it for others if on official DellEMC GitHub account.

The code can be surely improved, I'm not a developer. I'm planning to add also additional metrics in the future.

Paul, our main cluster is based on HyperV and managed using SCVMM, I'd like to have time to install and configure SCOM (since we have licensed all the System Center Suite). In that case I will surely develop a script to extract SIO metrics using REST-API. The SNMP approach is anyway possible.

Jason, the idea of the GitHub repository on DellEMC account is great. Contact me in case you will open it, I will surely use it to share my work.

Thanks in Advance,

Davide

osaddict

110 Posts

0

December 1st, 2016 13:00

Here is a blog post on How to contribute to {code} and add your project · codedellemc/codedellemc.github.io Wiki · GitHub

1
2

View All

No Events found!

PowerFlex

ScaleIO 4k qd1 performance in Windows 2012 R2