dean.leong

10 Posts

97744

May 17th, 2011 20:00

Slow replication over 10G link

Hi storage expert,
I'm having performance issue on replication between 2 equallogic. It is a 10G network but the replication is only run as around 94MB/s.

This is my test performed,

Performance Test:
1.
Scenario: HV guest direct attached SAN – SAN
Location:
HV guest – Production
SAN – Production
Average speed: 180 MB/s

2.
Scenario: HV guest direct attached SAN – SAN
Location:
HV guest – Production
SAN – Office
Average speed: 180 MB/s

3.
SAN replication
Average speed: 94.6 MB/s
Replicate 47GB of data took about 8min 24sec.

I only suffer performance issue on volume replication, the speed of direct write to the SAN is ideal.. Please advise.

Regards
Dean

Responses(12)

mcarlton054

1 Message

0

May 18th, 2011 05:00

Can you tell us what model your array is. Not all arrays have 10G interfaces.

Thanks

DJ

DELL-Joe S

7 Technologist

•

729 Posts

0

May 18th, 2011 08:00

Dean,

Hi, it's Joe with Dell EqualLogic. Replication MB's = TCP Windows Size/Latency. The two major factors for replication speed are absolute bandwidth and packet latency. It's possible for a high-bandwidth link to be limited for the maximum replication speed it can support if the packet latency is sufficiently high.

The basic formula for latency-limited bandwidth is that the maximum bandwidth you can get from a TCP/IP session is equal to the TCP window size divided by the latency. If that max bandwidth is greater than the WAN bandwidth, then you're fine. If your latency is high enough, your max bandwidth could be lower than the WAN bandwidth.

TCP Window / latency == Max B/W
72KB / 1ms == 72MB/s
72KB / 10ms == 7.2MB/s
72KB / 20ms == 3.6MB/s
72KB / 100ms == 720KB/s
256KB / 1ms == 256MB/s
256KB / 10ms == 25.6MB/s
256KB / 20ms == 12.8MB/s
256KB / 100ms == 2.5MB/s

I you have a 100Mbit WAN (which is approximately equal to 10MBytes per second), but you have a 10ms latency, your maximum throughput is limited to 7.2MB/s, not 10MB/s. To measure the packet latency, simply login to an array on the source side and ping an array on the destination side. Logging into the group IP address and pinging the other group IP address is fine.

Group> ping 172.23.49.110
PING 172.23.49.110 (172.23.49.110): 56 data bytes
64 bytes from 172.23.49.110: icmp_seq=0 ttl=54 time=40.000 ms
----209.68.2.181 PING Statistics----
round-trip min/avg/max/stddev = 30.000/32.500/40.000/5.000 ms

Look at the "avg" field in the summary line.

The default array receive TCP window for replication is 72KB. This can be adjusted up as high as 2MB per connection, at the cost of performance on the receiving side (cache memory is decreased to handle the new requirements of receiving replication data).

Regards,
Joe

dean.leong

10 Posts

0

May 18th, 2011 16:00

Hi,
There are 2x Equallogic PS6010. 4x 8024 switches (LAG). NO WAN! switch-to-switch connect using FIbre (LAGged, total speed =20GB). I tried to replicate a single volume and the speed i got is as below:

Average speed: 94.6 MB/s
Replicate 47GB of data took about 8min 24sec.

Interestingly, copy files from Windows server to SAN volume is much much faster. With this, i do believe is the replication process is slow. Please advise.

Regards
Dean

JOHNADCO

847 Posts

0

May 19th, 2011 10:00

"Hi,
There are 2x Equallogic PS6010. 4x 8024 switches (LAG). NO WAN! switch-to-switch connect using FIbre (LAGged, total speed =20GB). I tried to replicate a single volume and the speed i got is as below:

Average speed: 94.6 MB/s
Replicate 47GB of data took about 8min 24sec.

Interestingly, copy files from Windows server to SAN volume is much much faster. With this, i do believe is the replication process is slow. Please advise.

Regards
Dean
"

Interesting.. With different manufacturer arrays we are seeing similar issues. We went to default frame size (1500) with no chnage in this at all. No switches either, our 10G is direct connected from San to San.

Aaron Hamlett

17 Posts

0

May 19th, 2011 10:00

Dean,

That is interesting. One thing that might be getting in your way is that local iSCSI has a higher priority than replication iSCSI so if you are moving a lot of data from the array to the servers it might slow down replication.

Normally, replication will take as much bandwidth as you give it.
Try this, do a telnet session to each array and ping the other array. See if you have any latency in the pings.
You can also run the command: su exec traceroute (IP address) to make sure the route to the other array is direct.

I would say to call into tech support so we can get the diagnostics from both arrays and see if replication is having problems.

mikemooney

11 Posts

0

May 19th, 2011 10:00

Dean,

Are you using Jumbo Frames at all? If JF is in use I wonder is there a possibility that there is a Frame Size Mismatch on the various network ports in use in the environment?

In an ideal world you would have the following.

1. EQL Controller ports set to MTU 9000 (or 9216)
2. All LAG ports used on the 8024s set with MTU 9000 (or 9216)

If there is a mismatch with the frame size on the LAGs then you will see degraded performance due to packets being dropped etc...

Also, are you using the EQL MEM? (if using VMware)

Thanks,
Mike.

chimeranzl

3 Posts

0

May 20th, 2011 19:00

Hi,

I'm getting exactly the same issue however its not with replication - its slow performance from VMware to EqualLogic due to high read latency. I have logged a call with EqualLogic support. I have a single EQL PS6010XV (latest firmware 5.0.5) connected via 2 x Powerconnect 8024F switches, hosts are connected via the same with dual port Broadcom 10GbE Netxtreme II 57711 NIC's. If I look at SAN HQ, I can see write latency is perfectly fine - read latency is through the roof (50ms+, even hit 1500ms at one stage!). I'm running ESX 4.1 and software iSCSI intiator / vSwitch / vmNic etc setup as recommended (jumbo frames enabled all the way through, 9000 on VMWare side, 9216 on Powerconnects and 9000 on EqualLogic)

Dell so far have said firmware on switches needs updating to at least 3.4.8 A3 (done this no difference) They have also said the PowerConnect should have the iSCSI traffic in anything BUT the default VLAN, as VLAN1 doesn't support jumbo frames. I have got an outage to do this on Monday night, so will post results after this and whether it resolves the issue.

Oh, and I dont think its VMware specifically, because I also have a backup server with the same NIC running Windows 2k8 with Veeam and iSCSI initiator to the EQL, I get poor backup performance in Veeam when its backing up direct off the SAN. So, its either the Dell NIC (firmware probably), a switch configuration issue (firmware already updated to the same that Dell recommend/test with) or potentially the latest firmware 5.0.5 on the EQL maybe the cause.

Cheers

dean.leong

10 Posts

0

May 22nd, 2011 20:00

Hi all,
Yes, Jumbo frame enabled on switch and Eqlogics, LAG ports. During my test, there are no any other traffic, only the replication with a single volume and i'm disappointed with the performance.

SAN replication
Average speed: 94.6 MB/s
Replicate 47GB of data took about 8min 24sec.

I'm running on firmware v5.0.5 and the speed is slow, not sure about previous firmware, would be appreciate if someone has older firmware to try this replication performance.

Interestingly, if i schedule a group of 5 volumes(each contained 20Gb data) to replicate at once, they operate at ideal speed, each volume replicate at the speed of around 3300MB/min,

Test 1:
Replicate 1 volume of 20GB data
Time: 8 min 15 sec

Test2
Replicate 5 volumes with 20GB data each, total 100GB
Time: 6 min 15 sec

I do believe there is something not going well with the Eqlogic replication itself.

Regards
Dean

DJ

DELL-Joe S

7 Technologist

•

729 Posts

0

May 24th, 2011 10:00

Dean,

In version 5.0.x performance improvements were made to replication multiple volumes running simultaneously. So, the faster times in your test for the 5 volumes would be exactly what you should be seeing.

Also, version 5.0.7 will be released soon (no release date yet, because it is still in testing phase), that will have additional performance improvements/fixes for replication. I’ll keep you posted when I learn more.

As indicated by Aaron@Dell, do a telnet session to each array (telnet into each array separately using one of the eth port IP address, not the group IP address) and ping thru each specific eth interface to each other eth interface of the other array to see if you have any latency in the pings (then do the same from the other array).

To Ping from the PS Array using a specific ETH port interface
Usage: ping "-I "
Note: the above is a Capital eye "I"
The sourceIP is the IP address of a specific array ETH port. This is done from a group prompt after logging into the array. The quotes are needed from the group prompt.

Example:
Here is an example going from one network '10.1' to another '10.3' which might be the case if you are checking the connectivity for replication:
groupname>ping " -I 10.1.20.11 10.3.20.100"

Another example would be a local test. In this case, we're just staying on the same subnet.
groupname>ping " -I 10.1.20.11 10.1.20.100"

Again, do all eth ports to all other eth ports. You may also want to ensure that the latancy from each switch to each eth port on both arrays is within acceptable limits.

Regards,
Joe

mattslavin

55 Posts

0

May 24th, 2011 11:00

Also to throw this out there too, was flow control enabled or you using the default vlan. You could be looking at jitter on line and running into retries. On the VMKernel what are you using as your nic teaming rule. Are you using IP Source Hash or Mac?

dean.leong

10 Posts

0

May 26th, 2011 18:00

Hi,

flow control enabled, SAN traffic using the default vlan - vlan 1. We are running hyper V. It is clear showing the performance loss only happened during the replication, When writing to the the san volume directly, no performance loss.

From JoeSatDell - hope v 5.0.7 will fix/improve the replication performance.

Regards
Dean

chimeranzl

3 Posts

0

June 6th, 2011 23:00

Just a quick update to the thread I posted - unfortunately it won't be of help to your replication performance problems - but just incase it helps with anyone else, I ended up updating to the latest revision of the Broadcom 57711 10GbE NIC drivers on the ESX4.1 hosts and also on the Windows backup server it resolved the read latency issues.

Good luck Dean!

View All

No Events found!

FluidFS

Slow replication over 10G link