Replication partner down status following firmware upgrades
I am currently seeing volumes with a replication status of "partner down" following recent firmware upgrades. I have a case open with EQ Support, but their only suggestion so far doesn't seem to be related to the problem at hand (they asked me to enable jumbo frames on our hosts and switches - not sure what that has to do with SAN-to-SAN replication across a WAN link, but maybe they were just making a general performance suggestion).
Despite the "partner down" status for certain volumes, there are still volumes that have successfully replicated to that same partner since the upgrades. I've also seen volumes that replicated successfully for one or two days after the upgrades before exhibiting the "partner down" issue.
We replicate to a DR site in a neighboring state, so initial replicas of our 10 - 15TB volumes can take quite a while. Manual transfers aren't practical for us, and we can typically afford to wait a month or so for the initial replicas to complete. Having to delete a handful of these and start over, however, would definitely not be ideal.
Here is the background of the upgrades I performed: 2/19/18 - Paused replication, performed disk and controller firmware upgrades on the 2-member group used for replication at DR site. Multi-step - 8.1.1 -> 9.0.9 -> 9.1.4. Resumed replication. Source group remained at f/w 8.1.1 (still compatible with 9.1.4 for replication according to EQ compatibility matrix). A couple volumes ended up with "partner down" status.
2/22/18 - Paused replication (on both sides), performed disk and controller firmware upgrades on the 2-member source group. Multi-step - 8.1.1 -> 9.0.9 -> 9.1.4. Resumed replication. All members involved in replication are now on the same firmware version. At least one of the volumes that previously showed "partner down" completed its replication, but other volumes started exhibiting the problem. Because we typically have replication happening 24/7, I had to pause replication while some volumes were in progress. I would have preferred to wait until they finished, but we are adding two new members and therefore had to perform the upgrades now.
Opened support case on 2/23/18. For reference, I have tested "ping" for all combinations of source and destination communication (ping -I). No issues.
For all intents and purposes, this appears to be a bug more than anything else.
Has anyone else seen this issue that knows how we can fix it? The fact that some of the volumes have replicated successfully since the upgrades but are now having issues would lead me to believe pausing and resuming replication isn't necessarily what broke it.