A number of points,SyncIQ can chew through a ton of bandwidth with no problem, but the amount of bandwidth it'll consume is going to be limited by a number of things.
1. On the first full, the number of files matters. If there are a large number of small files, up the worker count per node on the policy to 6,7, or 8. On subsequent block-level snapshot-based incrementals, this is far less important, and you can probably default back to the default of 3.
2. How many nodes are participating in the job? The more nodes, the better.
3. What version of OneFS are you using? If both clusters are running Jaws, AKA OneFS 7.1.1 or above, SyncIQ will automatically do file-splitting. So on older versions, each worker gets one file to process at a time. With small to medium sized files that's fine, but with bigger files, what you'll see is the performance tail-off towards the end, because one worker may be stuck on a 100GB file, while the other workers may be finished with their 100MB files. With file-splitting it'll automatically split files over 20MB and pass them off to workers, so the performance will stay a lot more even throughout the job.
4. Source and target node restrictions. If you have source subnet/pool or target smartconnect zone restrictions in place, although it helps limit SyncIQ impacts on production IO, it can negatively impact SyncIQ throughput, because you're limiting the amount of resources that SyncIQ has at its disposal.
5. Separate WAN circuits for storage replication. If you find yourself in this boat, and have a separate WAN circuit for SyncIQ, be certain that the traffic is actually going across that link. This usually means a different subnet for replication on the cluster, and a static route, or alternatively Source Based Routing on OneFS 7.2 (Moby).
Hope this helps, I'll also forward this thread to a colleague who has even more experience with SyncIQ performance tuning.
You may have anticipated this answer, namely: "it depends" :-)
Shooting issues like this is a 'chase the bottleneck' exercise. For SIQ itself, there is an optimal balance of SIQ jobs (policies simultaneously executing), SIQ nodes engaged and SIQ workers per job and per cluster. You have to look at both 'sides' of the SIQ job, as well; it's not enough to look at the source cluster. Often times, it is the target cluster that is the bottleneck.
Still, there are dozens of things to check. Obtaining a packet trace is the key first move, to see first-hand where the latencies are not what you expect. This is the fine art of blending WAN/LAN and filesystem expertise.
It may be helpful to halt all SIQ activity and restart the appropriate daemons as well, to establish a baseline.
There is no substitute for a hop-by-hop analysis with packet traces when shooting issues involving a network.
(Adding to the two excellent replies above, and I am the colleague Chris is referring to)
Give us some more info on how you are currently configured:
How many nodes (both source and destination)?
How many workers set? (sounds like you've upped them...)
How many interfaces available on both source and destination?
How many jobs?
How many concurrent jobs?
Typically what I see is that the target cluster is the limiting factor; i.e. You have a 12 node source cluster, but only 3 nodes on the destination. Guess where the bottleneck (typically) is?
crklosterman
450 Posts
0
November 19th, 2014 06:00
Good Morning,
A number of points,SyncIQ can chew through a ton of bandwidth with no problem, but the amount of bandwidth it'll consume is going to be limited by a number of things.
1. On the first full, the number of files matters. If there are a large number of small files, up the worker count per node on the policy to 6,7, or 8. On subsequent block-level snapshot-based incrementals, this is far less important, and you can probably default back to the default of 3.
2. How many nodes are participating in the job? The more nodes, the better.
3. What version of OneFS are you using? If both clusters are running Jaws, AKA OneFS 7.1.1 or above, SyncIQ will automatically do file-splitting. So on older versions, each worker gets one file to process at a time. With small to medium sized files that's fine, but with bigger files, what you'll see is the performance tail-off towards the end, because one worker may be stuck on a 100GB file, while the other workers may be finished with their 100MB files. With file-splitting it'll automatically split files over 20MB and pass them off to workers, so the performance will stay a lot more even throughout the job.
4. Source and target node restrictions. If you have source subnet/pool or target smartconnect zone restrictions in place, although it helps limit SyncIQ impacts on production IO, it can negatively impact SyncIQ throughput, because you're limiting the amount of resources that SyncIQ has at its disposal.
5. Separate WAN circuits for storage replication. If you find yourself in this boat, and have a separate WAN circuit for SyncIQ, be certain that the traffic is actually going across that link. This usually means a different subnet for replication on the cluster, and a static route, or alternatively Source Based Routing on OneFS 7.2 (Moby).
Hope this helps, I'll also forward this thread to a colleague who has even more experience with SyncIQ performance tuning.
~Chris Klosterman
Senior Solution Architect
EMC Isilon Offer & Enablement Team
chris.klosterman@emc.com
twitter: @croaking
peglarr
99 Posts
0
November 19th, 2014 06:00
You may have anticipated this answer, namely: "it depends" :-)
Shooting issues like this is a 'chase the bottleneck' exercise. For SIQ itself, there is an optimal balance of SIQ jobs (policies simultaneously executing), SIQ nodes engaged and SIQ workers per job and per cluster. You have to look at both 'sides' of the SIQ job, as well; it's not enough to look at the source cluster. Often times, it is the target cluster that is the bottleneck.
Still, there are dozens of things to check. Obtaining a packet trace is the key first move, to see first-hand where the latencies are not what you expect. This is the fine art of blending WAN/LAN and filesystem expertise.
It may be helpful to halt all SIQ activity and restart the appropriate daemons as well, to establish a baseline.
There is no substitute for a hop-by-hop analysis with packet traces when shooting issues involving a network.
Cheers
Rob
mattashton1
93 Posts
0
November 19th, 2014 10:00
Hi Chjatwork,
(Adding to the two excellent replies above, and I am the colleague Chris is referring to)
Give us some more info on how you are currently configured:
How many nodes (both source and destination)?
How many workers set? (sounds like you've upped them...)
How many interfaces available on both source and destination?
How many jobs?
How many concurrent jobs?
Typically what I see is that the target cluster is the limiting factor; i.e. You have a 12 node source cluster, but only 3 nodes on the destination. Guess where the bottleneck (typically) is?
Cheers,
Matt