SyncIQ - Cranking its performance up

Question

Isilon Community, I know how to create rules to limit the amount of resources the cluster gives to SyncIQ, but I am trying to figure out if I don't have any limits set why the cluster will not give SyncIQ the resources to use the bandwidth that's available?  In the beginning I had limit set on SyncIQ because I didn't want it to kill my remote clusters during the initial data transfer.  Now that I have a backup cluster that can handle the bandwidth I am trying to figure out why the module is not sucking up all the bandwidth transferring data during the initial transfer like I am expecting? Thank you,

crklosterman · Accepted Answer

Good Morning,

A number of points,SyncIQ can chew through a ton of bandwidth with no problem, but the amount of bandwidth it'll consume is going to be limited by a number of things.

1. On the first full, the number of files matters. If there are a large number of small files, up the worker count per node on the policy to 6,7, or 8. On subsequent block-level snapshot-based incrementals, this is far less important, and you can probably default back to the default of 3.

2. How many nodes are participating in the job? The more nodes, the better.

3. What version of OneFS are you using? If both clusters are running Jaws, AKA OneFS 7.1.1 or above, SyncIQ will automatically do file-splitting. So on older versions, each worker gets one file to process at a time. With small to medium sized files that's fine, but with bigger files, what you'll see is the performance tail-off towards the end, because one worker may be stuck on a 100GB file, while the other workers may be finished with their 100MB files. With file-splitting it'll automatically split files over 20MB and pass them off to workers, so the performance will stay a lot more even throughout the job.

4. Source and target node restrictions. If you have source subnet/pool or target smartconnect zone restrictions in place, although it helps limit SyncIQ impacts on production IO, it can negatively impact SyncIQ throughput, because you're limiting the amount of resources that SyncIQ has at its disposal.

5. Separate WAN circuits for storage replication. If you find yourself in this boat, and have a separate WAN circuit for SyncIQ, be certain that the traffic is actually going across that link. This usually means a different subnet for replication on the cluster, and a static route, or alternatively Source Based Routing on OneFS 7.2 (Moby).

Hope this helps, I'll also forward this thread to a colleague who has even more experience with SyncIQ performance tuning.

~Chris Klosterman

Senior Solution Architect

EMC Isilon Offer & Enablement Team

chris.klosterman@emc.com

twitter: @croaking

peglarr · Answer

You may have anticipated this answer, namely: "it depends" :-)

Shooting issues like this is a 'chase the bottleneck' exercise. For SIQ itself, there is an optimal balance of SIQ jobs (policies simultaneously executing), SIQ nodes engaged and SIQ workers per job and per cluster. You have to look at both 'sides' of the SIQ job, as well; it's not enough to look at the source cluster. Often times, it is the target cluster that is the bottleneck.

Still, there are dozens of things to check. Obtaining a packet trace is the key first move, to see first-hand where the latencies are not what you expect. This is the fine art of blending WAN/LAN and filesystem expertise.

It may be helpful to halt all SIQ activity and restart the appropriate daemons as well, to establish a baseline.

There is no substitute for a hop-by-hop analysis with packet traces when shooting issues involving a network.

Cheers

Rob

mattashton1 · Answer

Hi Chjatwork,(Adding to the two excellent replies above, and I am the colleague Chris is referring to)Give us some more info on how you are currently configured:How many nodes (both source and destination)?How many workers set?&#xa0; (sounds like you've upped them...)How many interfaces available on both source and destination?How many jobs?How many concurrent jobs?Typically what I see is that the target cluster is the limiting factor; i.e. You have a 12 node source cluster, but only 3 nodes on the destination.&#xa0; Guess where the bottleneck (typically) is?&#xa0; Cheers,Matt

Isilon

Was this post helpful?