Start a Conversation

Unsolved

This post is more than 5 years old

T

1847

October 2nd, 2009 08:00

Replication warnings in Celerra Manager after DR move

We had 2 NS20s on site. One was production, the other for DR/replication. When the replications were configured, both NS20s were on the same LAN and everyting was fine.

We eventually moved the DR NS20 offsite to a data center. The sites are connected with a 20mb fiber WAN connection.

After the move, we started receiving the following warnings in Celerra Manager. The replications are still working and marked OK, but 20-30 warnings pop up daily about drops in the connection.

Severity: Warning
Brief Description: Slot 2: Primary=fs31_T1_LUN0_APM00081800540_0000_fs27_T1_LUN0_APM00080701610_0000(alias=iSCSI_LUN0_Replication), transferring. Data connection down. Retry.
Full Description: communication between source and destination is down.
Recommended Action: Bring network communication between source and destination up for the data to be transferred from source to destination.

The warnings are either single entries or up to exactly one minute to the second. They happen with all of the replications that we have setup.

After running bandwidth montioring, we are only using 6-7mb of our 20mb pipe.

Although I do think this is a network connectivity issue, our provider of the fiber line says everything is fine. Other tests do not show any drops in the connection. The Celerra Manager is the only device reporting connection drops.

Has anyone seen these warnings after moving replications from a LAN to WAN connection? Is there settings in the Celerra to adjust because of the change in bandwidth?

366 Posts

October 2nd, 2009 09:00

Hi,

What's the NAS code level running on source and destination boxes ?
What's the latency on this link ?

If you do a server_ping server_2 what's the response time ?


Gustavo Barreto.

366 Posts

October 2nd, 2009 09:00

Hi,

This RTT should not cause any issues...

There are several improvements on later NAS code versions, specially on 5.6.44.

I recommend you to open a support ticket, so they can investigate

Although this does not mean the message will be gone, a code upgrade would give you all fixes and enhancements...



Gustavo Barreto.

5 Posts

October 2nd, 2009 09:00

Both NS20s are version 5.6.40-3 NAS code.

When server_pinging the destination datamover, the response is 3ms.

366 Posts

October 2nd, 2009 10:00

Hi,

I understand.
The clock skew they fixed does not have any relation with the error message you are seeing. It eventually block your sessions if it's greater than 10 minutes.

I suggest you to ask for a NAS code upgrade on both sides.
As I mentioned, this might not "fix" your error messages, but will give you several fixes and enhancements.

Gustavo Barreto.

5 Posts

October 2nd, 2009 10:00

I did open 2 tickets with EMC . The first ticket they found and fixed the following:
Time on the destination control station is one hour behind.
The interconnect time was skewed and this was fixed.

However, this did not fix the issue with connection drops. We have talked to our EMC rep and he said that we could run the Fiber Analyzer from EMC but we would be charged for the service. They did not recommend running this because it is probably an network issue with the connection.

I am in the middle at this point because EMC is telling me everything is fine and our consultant is saying everything is fine with line.

17 Posts

October 2nd, 2009 11:00

Hi.

If you think it is a network problem and have noticed a time of day when it consistently happens, could you script a periodic/frequent ping to the IP address of the replication interface from a workstation to see if you lose connectivity? If the network goes down the ping should fail while it is down and you should be able to see it in the log if you capture the ping output. That'll give you one more data point.

Also, you might see if the server logs give you any indication of what is going on.

Cheers.
-Dan

5 Posts

October 5th, 2009 05:00

The warnings appear randomly and do not fall into certain periods of time. Sometimes the warnings appear in the early AM during non-productions hours and other times they happen during the day. The warning are mostly single entries but sometimes they appear for exactly one minute to the second.

At first, our consultant blamed the NS20 for saturating the WAN line and causing the warnings. They wanted us to throttle the NS20 to slow down the speed. After running some bandwidth monitoring, we found that wasn't the case.

The consultant is now opening tickets with the WAN provider and they have been running some tests. On Wednesday, they did some "repairs" on the line but the warnings are still appearing.

I'm trying to cover the bases with the NS20. I still do not know if this is 100% a WAN issue. I have a feeling the WAN provider will come back and say the line is fine, which it is to an extent. The replications are working and it is the small drops causing the Celerra to throw a warning.

Is it possible to configure the Celerra to be less restrictive on the warnings it displays or should I still continue to try to find the source of the issue?

5 Posts

October 7th, 2009 06:00

I have scheduled a DART upgrade on both boxes to see if this will fix the issue.

131 Posts

November 25th, 2009 06:00

I am also having the same issue currently. Were you able to fix the issue with upgrade or could you update whats the current state of the issue ?

Thanks,

95 Posts

December 1st, 2009 15:00

I have this issue for awhile now. replicating CIFs and iSCSI. Out of 12 replications I have an issue with 3 of them.

Latest code 5.6.45-5 does NOT resolve the problem. Support suggested to looking to my network after reviewing packet capture

which indicates a lot of "TCP Previous Segment Lost", "TCP Dup Ack", TCP Out-of-Order" captures.

Can anyone take a capture before your firewall and check if you see the same?

Weird thing is it is not happening with all replications so if it would be networking issue I would assume all my sessions would have the same problem.

131 Posts

December 1st, 2009 16:00

I heard the following from the thread owner. Might be upgrading to .46 code might has a fix.

"We did a DART code upgrade on both SANs to version 5.6.46.410.  After the upgrade, the warnings went from about 20-30 a day down to 1 or 2 every other day.  We are still receiving the occasional warning, but after talking with support, we decided the warning level is not high enough to look into any further".

6 Posts

December 9th, 2009 07:00

We have all these same issues but with the addition of we're getting very low transfer rates on a 150Mbps line (like 15Mbps MAX) along with the errors. We've worked with tech support on it for a while--both Cisco and EMC.  Out network traces show all the same issues--over 25% retransmits in some cases!

We've changed firewall settings (enabling SACK giving the most relief for now), data mover settings (tcpwindowlat, tcpwindowsizing) and done DART upgrades--nothing so far has fixed it but we did go from 2Mbps to the 15Mbps we get now.

I basically log in every few days and delete over 2500 of these errors and while we have gotten replication working "acceptably" as we add more data this is going to be a bigger problem.

In our case, we're running ASA5520s in this site and PIX525Es in the DR site so our next step is to upgrade the DR firewalls and revisit it at that point. But I wanted to get part of this thread in case we can help each other out!  We're getting 5540s for the DR site next week and once they are implemented I'll report back since Cisco, EMC and we all believe it to be a VPN/firewall issue at this point.  It's worth nothing that we are not seeing CPU saturation on any of our firewalls, and non-VPN/non-replication traffic is plenty fast without retransmit issues.

95 Posts

December 9th, 2009 09:00

xzi,

we are also running ASA5520.... Did you ever take any packet captures to see what is going on with traffic when replication starts?

6 Posts

December 9th, 2009 11:00

Yup, a bunch of times, for both EMC and for Cisco. We started doing traces after working through all the "easy stuff" like verifying switchport settings, duplex settings, etc.  The only other real network setting I think worth mentioning is that we use LACP port-groups on both ends for the DMs--2 each for iscsi and ther other 2 for replication along with our CIFS.  It's 6509 gig blades on the production side, and 6513 gig blades in DR, both using 2 interfaces with LACP.

Anyway, here's basically a few "observations" they made on our earlier traces before SACK was enabled, etc.:

TCP Retransmission Information TCP Segments         : 30348
TCP Retransmissions  : 2090 (3126)
Retransmission Ratio : 6.886% (10.30%)
Fast Retransmissions : 34
Possible Retrans.    : 1036
        RTO <= 0.25s : 276
0.25s < RTO <= 0.50s : 0
0.50s < RTO <= 1.00s : 0
1.00s < RTO <= 1.50s : 1780
1.50s < RTO          : 0
* Segments flagged as Out of Order are possibly Fast Retransmissions
* Numbers in parentheses include possible Fast Retransmissions

From the traces we sent during the establishment of a replication session, they made suggestions of changing DM settings like tcpwindow and tcpwindowlowat and while they helped, we still get the retransmits--just less I assume.  It's interesting you are also on the 5520s, I was kinda hoping the 525E's were the issue since we're putting 5540's on the other end!

6 Posts

August 26th, 2010 06:00

I wanted to follow up as we have FINALLY fixed our issues.  Turns out it was related to the IDS modules in our ASA5520s on the source side.  Disabling them (inline/bypass) was not enough, we had to actually remove the service inspect rule and like magic, we saturated our circuit almost immediatly.  Put the inspect rule back in (even without enabling the IDS module) and traffic died again.

For now we've left this rule off, and it doesn't appear to be a load issue with the IDS module (it's definitely not with the firewall) but just in case you guys (or new visitors) run into this issue, this is a place to look.  Our configuration happened to be ASA5520s with SSM-10 IDS modules, however I would imagine this could happen with any combination of ASA firewall with SSM IDS modules.

No Events found!

Top