sakacc

28 Posts

7815

March 14th, 2011 11:00

Are you a customer using DART to host an NFS datastore?

This could be on an EMC Celerra, EMC Unified, or EMC VNX platform. We think we've developed something that might be an awesome performane improvement, but the truest test would be customer feedback.

This is very "hot off the presses" from engineering, and is before the formal beta phase.

This is EXPERIMENTAL CODE - and not supported in production!!! Do not deploy on production systems.

We would love informal feedback (but as formal as possible - including your before/after experiences). To encourage people to do this (but again, only on non-production environments!!), I will kick start another contest.

For the first 30 people who provide data that shows the "before/after" comparison of this new code, I will provide a 16GB Wifi iPad.

So – how do you get it?

For customers who would like to try the fix, here is the process (this is the preferred process, as it formalizes feeback):
1. the epatch is available to the tech support group, which is the usual method in which we release code.
2. The epatch is called 6.0.40.805 If your customer wishes to obtain and test with it, please open an SR with your local service representative and have them work with tech support to get the 6.0.40.805 e-patch so that you can schedule your upgrades.
3. Please provide any feedback based on the experience that results from this patch. Negative or positive, we'd like to hear it.
For non-customers (EMC Partners/employees) who would like to try the fix, the experimental DART epatch is here, with the MD5 here, and the release notes here.

Please use the EMC/VMware Powershell tools used in this earlier post to measure the effect: https://community.emc.com/thread/117654?start=0&tstart=0, but ALSO please capture your IOmeter results (see below)

Understanding this a little more:

The optimization is for the NAS write path – it has shown very large performance improvements in some test with very small IO random IO workloads on NFS datastores.
Multi-VMs is a must if you want to provide data. We know it's good with a single VM (4x better latency), but the way it's better is by reducing the serialization of writes through the NAS stack to the backend disk devices (which produces a lot less latency). The main questions are: a) how much it holds up with random, multi-vm workloads; b) whether performance regressed with other workloads (large IO sequential guest workloads)
I can't say it enough - it is experimental. This means: Don’t use it in production – PERIOD. If you have a non-production Celerra/VNX use case, give it a shot. We’ve been playing with it for a while, so it seems solid, but never use non-production code in production environments.

Since this will have before/after data - being a little more proscriptive is useful. Note - this test harness below is not a pre-requisite to winning an iPad. Any before/after data (even if the effect observed is negative) will register you to win an iPad.

Ideally, tests should share this configuration:

Capture your IOmeter workload results - the ultimate effect of this should be better guest-level latency and maximum IOps. Do this in ADDITION to the vSCSIstats, ESXtop, and array stats noted earlier.
Windows VM, one vCPU, 512MB RAM, Iometer installed, a second unformatted VMDK sized to 10GB attached to the VM
- The tests will run against the second VMDK.
- Be run for at least four minutes
32 outstanding IOs
NFS volume should be configured for as many disks as available. RAID0. The point here is to eliminate back-end bottlenecks - we're trying to stress the NAS stack, not the block stack.

Then I propose we run the following tests against the entire unformatted second VMDK. For each of these we should collect pre- and post-patch results.

4k IO, 100% random, 0% read
4k IO, 100% random, 50% read
4k IO, 66% random, 0% read
4k IO, 66% random, 50% read
4k IO, 33% random, 0% read
4k IO, 33% random, 50% read
4k IO, 0% random, 0% read
4k IO, 0% random, 50% read

Pause and analyze your results. What would be awesome would be to find the results for which there was the most (MOST) dramatic change and the one for which there was the least (LEAST) change. Based on those we should then add the following tests for pre- and post-patch configurations:

8k IO, MOST
8K IO, LEAST
16k IO, MOST
16k IO, LEAST
32k IO, MOST
32k IO, LEAST
4 VM configuration: 4xMOST
4 VM configuration: 4xLEAST
8 VM configuration: 4xMOST
8 VM configuration: 4xLEAST
16 VM configuration: 4xMOST
16 VM configuration: 4xLEAST

Post your results to this thread.

Thanks - any and all welcome data is welcome! Remember - DO NOT RUN THIS IN PRODUCTION!

Responses(15)

A

Anonymous

5 Practitioner

•

274.2K Posts

0

March 17th, 2011 12:00

Did some tests today with DART version 6.40.500 and later compared some of the results with 6.40.805. More results with different block size will be published tomorrow.

Wow! What a difference. This is just one indication out of several that there´s a lot of performance gains that can be achieved by utilizing clever algorithms in software.

Unfortunately I cannot run any powershell scripts here so I had to take another approach. All tests have run at least 4min and have been monitored and performance statistics for the virtual disk have been exported to Excel spreadsheets that I will clean up later and publish.

Atm lot more test have to be done but the results below speak for themselves in terms of performance gains between current and experimental DART code.

Another thing I noticed was on reboot time. A DM node seem to reboot faster than with current released code. Haven´t had time yet to measure the time difference and I will probably not have time to to that either.

First tests was done with 2x W2k8R2 VM´s. Why W2k8R2? The simple answer is it perform better than 2k3.

One the backside it need more memory to work so I had to do the tests with 2GB ram instead of 512MB.

Attached to each VM is a 10GB unformatted vmdk used with IOMeter.

IOMeter is configured with 6 workers each with 32 outstanding IO.

Note: Backend storage is not optimally configured with only 4x 4+1 raid groups that back NFS export.

Test	Block Size	Randomness	Read/Write Ratio	Ops/s [6.40.500]	Ops/s [6.40.805]	Improvement
Reference	4k	0%	100% read, 0% write	69336 [1x VM]	35918+35960	1,04x
4k blocks	4k	100%	0% read, 100% write	1040+1270	5303+5284	4,58x
	4k	100%	50% read, 50% write	2221+2305	7550+7526	3,33x
	4k	66%	0% read, 100% write	1377+1378	4227+4219	3,07x
	4k	66%	50%, read, 50% write	2326+2330	7046+7024	3,02x
	4k	33%	0% read, 100% write	1507+1511	3333+3243	2,18x
	4k	33%	50% read, 50% write	2296+2303	2859+2881	1,24x
	4k	0%	0% read, 100% write	1473+1490	3074+3274	2,14x
	4k	0%	50% read, 50% write	2024+2045	2933+3051	1,47x
8k blocks	8k	100%	0% read, 100% write	1229+1230	3929+3850	3,16x
	8k	100%	50% read, 50% write	1741+1624	5870+4672	3,13x
	8k	66%	0% read, 100% write	1427+1190	3447+3522	2,66x
	8k	66%	50%, read, 50% write	1819+1817	6082+6197	3,38x
	8k	33%	0% read, 100% write	1614+1540	3718+3463	2,28x
	8k	33%	50% read, 50% write	1742+1791	5738+5296	3,12x
	8k	0%	0% read, 100% write	2351+1460	4517+4731	2,43x
	8k	0%	50% read, 50% write	2313+2303	11302+11770	4,99x (strange)
32k blocks	32k	100%	0% read, 100% write
	32k	100%	50% read, 50% write
	32k	66%	0% read, 100% write
Entered wrong randomness. Still interesting result for bigger blocks.	32k	50%	50%, read, 50% write	1807+1514	3187+3375	1,98x
	32k	33%	0% read, 100% write
	32k	33%	50% read, 50% write
	32k	0%	0% read, 100% write
	32k	0%	50% read, 50% write

Added results for rest of 4k tests and 8k testing done. Also added a test for 32k block, but unfortunately during the hurry I entered wrong randomness, 50% instead of 66%. Still interesting result though as the patch seem to improve performance for bigger blocks too.

nikolayp

21 Posts

0

March 19th, 2011 23:00

Single ESX 4.1.0 build 260247 on Dell server with 8 quad CPU, 128Gb memory and 10G network
Celerra: NS-G8 with CX4-960
Network: 10Gb all the way between ESX and DM, 1500 MTU
Celerra doesn't work with RAID0, so FS was built on 14 FC disks (7 luns) 32K stripe across using RAID10
VMs were running WinXP with SP3 with 512MB RAM and second 10Gb unformatted drive againts which iometer was used.
IOmeter on VMs was configured with 1 worker having 32 outstanding IOs

Most difference between base 6.0.40.8 version and 6.0.40.805 patch I noticed when using single VM (with 32 outstanding IOs) - up to 3.5x for almost any IO size when having 100% writes (both sequential and random)

	6.0.40.805				6.0.40.8				Improvement (times)
	start time	IOPS	MB/s	response time	start time	IOPS	MB/s	response time	IOPS	MB/s	response time
4k IO, 100% random, 0% read	7:33	13659.55	53.36	2.342	9:31	4560.25	17.81	7.0167	3.00	3	3.00
4k IO, 100% random, 50% read	6:46	6078.9	23.75	5.2627	9:37	3857.68	15.07	8.2942	1.58	1.58	1.58
4k IO, 66% random, 0% read	6:55	13927.31	54.4	2.2968	9:42	4901.22	19.15	6.5282	2.84	2.84	2.84
4k IO, 66% random, 50% read	7:00	5127.75	20.03	6.2394	9:48	4095.92	16	7.8116	1.25	1.25	1.25
4k IO, 33% random, 0% read	7:06	12742.6	49.78	2.5104	9:52	4892.34	19.11	6.5401	2.60	2.6	2.61
4k IO, 33% random, 50% read	7:11	5155.74	20.14	6.2054	9:58	4243.46	16.58	7.54	1.21	1.21	1.22
4k IO, 0% random, 0% read	7:17	13753.42	53.72	2.326	10:03	5080.17	19.84	6.298	2.71	2.71	2.71
4k IO, 0% random, 50% read	7:22	12320.83	48.13	2.5953	10:19	7129.63	27.85	4.4874	1.73	1.73	1.73



8k IO, 100% random, 0% read	12:36	13646.88	106.62	2.3445	11:27	4616.11	36.06	6.9314	2.96	2.96	2.96
8k IO, 33% random, 50% read	12:42	4486.31	35.05	7.1313	11:20	3534.25	27.61	9.0533	1.27	1.27	1.27
16k IO, 100% random, 0% read	12:49	11908.82	186.08	2.6863	11:37	3440.26	53.75	9.3006	3.46	3.46	3.46
16k IO, 33% random, 50% read	12:56	4390.22	68.6	7.2878	11:43	3213.38	50.21	9.9575	1.37	1.37	1.37
32k IO, 100% random, 0% read	13:09	8266.88	258.34	3.8702	11:56	3023.74	94.49	10.5818	2.73	2.73	2.73
32k IO, 33% random, 50% read	13:14	3830.09	119.69	8.3537	12:02	2842.07	88.81	11.2575	1.35	1.35	1.35

I also noticed that there was some sort of limitation on ESX side preventing more than 64 simultanious. So really, 2 VMs with 32 outstanding IOs was doing great, but adding more same type of Vms caused delays in serving writes/reads. Is there anything can be tuned on ESX?

Just for comparison, another test was performed on FS built on 10 4+1 RAID5 FC disks using 4 VMs with 4 workers in each and 1 outstanding IO (32K in size). Below table represent numbers for 1 VM:

Version	Throughput (MB/s)	Response time (ms)
6.0.40.8	36	3.5
6.0.40.805	57	2.2

Total throughput for 4 VMs was:

6.0.40.805 patch:

32 seq writes - ~235MB/s

32 random writes - ~235MB/s

6.0.40.8 image:

32 seq writes - ~157MB/s

32 random writes - ~156MB/s

235MB/s on that FS was close to max throughput you can get out of it, so that was impressive!

So, I think, by changing IOmeter configuration it's possible to get much better numbers which I've got when having 1 worker and 32 outstanding IOs.

1 Attachment

vmware_test_results.zip

sile1

27 Posts

0

March 20th, 2011 09:00

2x ESX 4.1.0 on Dell server with 8xCPU, 32Gb memory and 1Gb network
Celerra: NS480, flare 30.511
Network: Single 1Gb connection, 1500 MTU
Storage: 4+1 400GB EFDs, 16x LUNs with a Celerra stripe created across all 16 LUNs 256KB stripe size, and then a single 200GB file system presented.

The file systems was mounted with pre-fetch disabled and Direct writes Disabled
VMs: Windows 2003 with 512MB RAM and second 10Gb unformatted drive againts which iometer was used. I inflated the vmdks on the NFS volume before I started testing.
IOmeter on VMs was configured with 1 worker having 32 outstanding IOs, default alignment used(sector boundaries)

I saw the biggest improvement with 0% random 0% Read, and actually the biggest drop in performance with 0% random 50% Read.

I am not seeing a difference between the two. My best theory is that by striping across 16 LUNs on the EFDs I was able to get very good concurrency on the GA code. So, the patch isn't giving much benefit. Only a few of the higher VM and larger block sizes was hitting my 100MB limit on the network ports. I have five more EFDs in another array. I could potentially pull those to give 10x EFDs in raid10, and then connect up some more network ports and re-test.

6.0.40.8	1xVM	2xVM	4xVM	8xVM	16xVM
4KB 0% random 0% Read	4665.7	7141.1	10124.3	13744.4	15064.4
4KB 0% random 50% Read	3906.7	4643.5	6060.3	8455.1	10344.4
8KB 0% random 0% Read	5808.6	8729.0	11526.8	12579.6	12654.0
8KB 0% random 50% Read	3322.7	3358.9	4677.4	6360.9	8037.8
16KB 0% random 0% Read	3526.0	5492.4	6465.4	6650.0	6574.7
16KB 0% random 50% Read	2241.3	2635.2	3503.3	4212.8	5240.5
32KB 0% random 0% Read	2366.9	3156.0	3340.4	3413.6	3398.4
32KB 0% random 50% Read	1438.1	1766.9	2354.2	2656.5	3216.9

6.40.0.805	1xVM	2xVM	4xVM	8xVM	16xVM
4KB 0% random 0% Read	4687.8	7091.7	10335.1	11041.2	11486.5
4KB 0% random 50% Read	4109.6	4613.7	6355.6	8796.5	10546.5
8KB 0% random 0% Read	5389.4	8689.1	11390.0	11941.3	11865.1
8KB 0% random 50% Read	3152.6	3173.4	4533.4	5858.9	7904.7
16KB 0% random 0% Read	3439.6	5369.5	6484.3	6628.6	6547.1
16KB 0% random 50% Read	2011.1	2542.1	3423.3	4038.9	4903.4
32KB 0% random 0% Read	2279.3	3077.3	3402.4	3433.1	3377.5
32KB 0% random 50% Read	1399.4	1662.1	2255.0	2638.3	2922.2

Relative	1xVM	2xVM	4xVM	8xVM	16xVM
4KB 0% random 0% Read	1.00	0.99	1.02	0.80	0.76
4KB 0% random 50% Read	1.05	0.99	1.05	1.04	1.02
8KB 0% random 0% Read	0.93	1.00	0.99	0.95	0.94
8KB 0% random 50% Read	0.95	0.94	0.97	0.92	0.98
16KB 0% random 0% Read	0.98	0.98	1.00	1.00	1.00
16KB 0% random 50% Read	0.90	0.96	0.98	0.96	0.94
32KB 0% random 0% Read	0.96	0.98	1.02	1.01	0.99
32KB 0% random 50% Read	0.97	0.94	0.96	0.99	0.91

1 Attachment

perf.zip

RafaelNovo

92 Posts

0

March 20th, 2011 21:00

1 x ESX 4.1
- 2 x Intel Nehalem (4 cores) 2.4 Ghz
- 32 Gb Memory
- 1 GB Etherner

NS-120
2 x 73Gb EFD

In the Attached file
- an Excel with an overview of IOmeter results
- All iometer csv output files
- Complete performance grabs from ESX (vscsistat & esxtop) and from Celerra using PowerCli scripts

Two main findings:
- Huge performance improvements with this new patch, specially in the write intensive workloads (up to 3x)
- Amazing IOPS number from a single pair of EFD Drives (more than 8.000)

You guys did a great job! Can't wait to have this patch GA!

1 Attachment

DART Patch.zip

nikolayp

21 Posts

0

March 21st, 2011 09:00

Hi Sile,

Could you clarify what do you mean by "direct writes disabled"? Or could you show your server_mount output for the FS used in testing. In order to get benefit of the patch it HAVE to be mounted with "uncached" option.

Regards,

Nick

jasonboche-2015

44 Posts

0

March 21st, 2011 15:00

Chad... thank you and my BETA patch results are posted on my blog here: http://www.boche.net/blog/index.php/2011/03/21/emc-celerra-beta-patch-pumps-up-the-nfs-volume/

I'll perform additional "postpatch" testing in accordance with the guidelines outlined above. I wasn't aware of the specific test parameters a month ago when I originally embarked on the patch upgrade mission.

sile1

27 Posts

0

March 21st, 2011 20:00

Yes, uncached mode (The gui calls it Enable Direct Writes). That was it.

Here are my new results.

Environment:

2x ESX 4.1.0 on Dell server with 8xCPU, 32Gb memory and 1Gb NIC for NFS
Celerra: NS480, flare 30.511
Network: 4x 1Gb connection using LACP, 1500 MTU (only 2 ports actually used, because I had two hosts each using single 1Gb adapter for NFS)
Storage: 2x 4+1 400GB EFDs RGs, 2x LUNs per RG using Celerra AVM, and then a single 200GB file system presented.
The file system was mounted with Direct writes Enabled
VMs: Windows 2003 with 512MB RAM and second 10Gb unformatted drive which iometer used. I inflated the vmdks on the NFS volume before I started testing.
IOmeter on VMs was configured with 1 worker having 32 outstanding IOs, default alignment used(sector boundaries)

1 Attachment

perf.zip

clintonskitson

116 Posts

0

March 21st, 2011 22:00

Sile,

Thanks for the comprehensive set of results. See the attached images below for a summary of your vscsistats results base don T1/T2 (1st phase) and T3/T4 for the deep dive into IO sizes. For anyone curious, the thread below shows you how to create this kind of data.

https://community.emc.com/thread/118723

2 Attachments

vscsistats-summ2.png

vscsistats-summ1.png

A

Anonymous

5 Practitioner

•

274.2K Posts

0

March 28th, 2011 17:00

Great results, thanks for posting!

What did your VM look like? Did you do I/O to a 10GB unformatted VMDK?

And ... to verify ... this is with a single 1GbE ethernet connection?

sile1

27 Posts

0

March 28th, 2011 20:00

Increased network connections, 2 active NICs now. Updated previous post with more details. (see above)

sile1

27 Posts

0

April 4th, 2011 06:00

Another round of testing. I increased the test duration from 4 to 10 mintues to get some more stable numbers. I added another NIC to my ESX hosts giving a total of 4x NICs for the testing. I confirmed this with a read test that got 333MB/s, I also changed over to R10.

Environment:

ESX: 2x ESX 4.1.0 on Dell server with 8xCPU, 32Gb memory and 2x 1Gb NIC for NFS
Celerra: NS480, flare 30.511
Network: 4x 1Gb connection using LACP, 1500 MTU

Storage: 5x 1+1 400GB EFDs RGs, 2x LUNs per RG using Celerra MVM, A single stripe was created across all 10 dvols with 256KB stripe size. Then it was added to a pool and the test file system was created from the pool

VMs: Windows 2003 with 512MB RAM and second 10Gb unformatted drive which iometer used. I inflated the vmdks on the NFS volume before I started testing.

IOmeter on VMs was configured with 1 worker having 32 outstanding IOs, sector alignment used. Then all the access specifications were duplicated except with proper alignment. 4KB was aligned at 8KB, 8KB @ 8KB, 16KB @ 16KB ...

3 Test senarios:

B1 cached mode, B2 uncached mode, and B3 uncached mode with epatch 6.0.40.805 were tested with all access specifications and with 1,2,4,8,16 VMs.

4 Attachments

b3.zip

b1.zip

b2.zip

b_collects.zip

pharford57

115 Posts

0

May 12th, 2011 14:00

Hi Sakacc

I assume you need to be on dart version 6 to apply this patch ? will this patch be intergrated into a code update at some stage ?

Thanks

Paul

pharford57

115 Posts

0

May 12th, 2011 14:00

Hi Dynamox

Yes thats what i was hoping for, we have had some NFS issues and the customer is on 5.6 so i was hoping the upgrade plus all the recommended Vmware NFS and Celerra tweeks we have made would solve our NFS performance problems.

have you upgarded from 5.6 to 6 before ?

Paul

dynamox

1 Rookie

•

20.4K Posts

0

May 12th, 2011 14:00

or could be already integrated into 6.0.41.3 ?

dynamox

1 Rookie

•

20.4K Posts

0

May 12th, 2011 17:00

not yet and probably won't ..we are replacing our aging NS80 with a VNX5700

View All

No Events found!