Replication 'Partner Down' / Packet Errors on Mgmt Interface

Question

Hello, i got strange replication issues on two customer installation. Installation 1: - PS6110XS replicating to PS4110E using Smart Replicas (HIT/VE)- Both EQLs running V. 6.0.1- Both EQLs connected to PC8024F switch stack. Configured as 'Dell Best Practice (Flow Control, Jumbo frames, Port Fast, STP off)- Group Admin Web Interface shows up 'Partner down' suddenly for some replications. Last replication stucks in 'in progress'. Other replications (to same member) working fine.- After manual controller failover 'partner down' changes back to 'in progress' but no data is transfered. Installation 2: - PS6110XV replicating to PS6110XV using Smart Replicas (HIT/VE)- Both EQLs running V. 6.0.1- Each EQL is connected to PC8024 switch stack. Both stacks connected via 2* 10GBit LAG (~300 Meter) Configured as 'Dell Best Practice (Flow Control, Jumbo frames, Port Fast, STP off)- Group Admin Web Interface shows up 'Partner down' suddenly for some replications. Last replication stucks in 'in progress'. Other replications (to same member) working fine.- After manual controller failover 'partner down' changes back to 'in progress' but no data is transfered.- Massiv 'packet errors' (around ~100 errors / minute) on all management interfaces (even after manual controller failover). Change switch port to 100/Fdx. No Change. Switch (HP 8212 and HP 2810) shows no errors. Can't belive in four broken cable. Any Ideas? Marcel Mertens

MarcelMertens · Answer

Case is already open. But at the moment it doesn't look like that support gets the clou.

DCB is disabled on switch and eql

We don't know if the packet errors on management interface are causing the partner down error.

I change the switch, duplex settings, cable -> Still packet errors.

Also updated the replication site to 6.0.2. Still packet errors.

MarcelMertens · Answer

DCB is disabled on EQL. AFAIK is there no 'DCB off' switch on PC8024. Firmware is 5.0.0.4 PFC (priority flow control) is inactive

MarcelMertens · Answer

No, iSCSI is running on VLAN 10. No WAN Accelerators. Replication site (~400 meter) is connected via 20Gbit LAG (two PC8024er stacks (one stack á 2 switches each site).

There are some strange things:

Some replications are working fine while others are run into "partner down". I have to delete the complete replica set and start over. Sometimes it works for a few replications until it ran into "partner down" again.

The replication destinations always shows that the replication with "partner down" was successfull.

As you can see (source site). Replication from 14:31 still "in progress" / replication status "partner down"

Destination site shows that this replication is completed:

MarcelMertens · Answer

iSCSI VLAN 10 is untagged to all storage ports. See pictures: LLDP is active for all storage ports.

innyinskip · Answer

Hi Guys,

Did you ever find a resolution to this? We are seeing something similiar.

Replication is working fine on particular volumes but on others are stopping with the message 'parnter-down' we have logged a call with Dell who at the moment are going through networking troubleshooting tasks but if some volumes are working it cant be a networking issue.

Interestingly if i cancel the replica creation and start it again it stops at EXACTLY the same point each time.

MarcelMertens · Answer

Jep,

very simple (in our case):

On the destination site must be a small amount (at least 20-30gb) space left for the incoming replication. I had configured all space of the array for delegated space. After reducing the delegated space by 20-30gb so that there is a small amount "free space" everything was fine.

You can't find this in the manual, it is written in the release notes of the V6 firmware. It took a while for EQL support to figure this out.

I hope this will help you...

vmbru · Answer

They say 5% or 100GB to 200GB right? of free pool space needed as best practice.  Must have plenty of space to test failover of all volumes too.  If HQ is RAW 14.4 TB, figure bare min. >=28.8TB at DR site to fire up cloned volumes leaving original replicas alone when using vmware SRM.  I'd have DR site at 3X's more RAW capacity then HQ site, factor in dedicated SRM volumes, etc...

MarcelMertens · Answer

You need more raw space on recovery site as on primary. how much more depends on your restore points and replication schedule. But double the size as primary site is a good deal.

None of customers is using replication over wan link currently. Most use case is that customers have two datacenter on the same campus and replicate between both datacenters.

The application servers (mostly vmware hosts) are spreads in both datacenter. so in case of failover you just have to fire up the replication. No usecase of SRM

wadet5k · Answer

resource have always been listed in the firmware details for each version. Although the fix list is not a complete list and not all symptoms may be listed. 6.0.4 and later has address replication issues you should review.

A partner down message can also just mean that; you don't have great communication between the groups or there is too much going on.

you should review you logs for replication start times and replication complete times. You may need to spread out your replication schedules.

EqualLogic

Was this post helpful?