Highlighted
ECN-APJ
3 Argentium

The overview of SRDF/A session drop reason and 3 cases sharing

The overview of SRDF/A session drop reason and 3 cases sharing

Introduction

This article provides the overview of SRDF/A session drop reason and 3 cases sharing.

Detailed Information

SRDF Solution:

  • SRDF modes of operation:
    • SRDF/S: Synchronous mode. Maintains a real-time (synchronous) mirrored copy of production data (R1 devices) at a physically separated Symmetrix system (R2 devices).
    • SRDF/A: Asynchronous mode. Mirrors data from the R1 devices while maintaining a dependent-write consistent copy of the data on the R2 devices at all times. The copy of the data at the secondary site is typically only seconds behind the primary site.
    • Adaptive copy: Adaptive copy modes allow the R1 and R2 devices to be more than one I/O out of synchronization. Unlike the asynchronous mode, adaptive copy modes do not guarantee a dependent-write consistent copy of data on R2 devices.
  • Modes can be changed dynamically.
  • Modes of operation can be specified on device level.
  • All the modes share Symmetrix memory.

I/O Operation of SRDF/A:

Following figure indicates the I/O Operation of SRDF/A.SRDFA_1.png

Common SRDF/A Connectivity

IP WAN Connectivity

  • Long distance
  • The stability of inter-connection router is lower than FC.

SRDFA_2.png

SRDF/A links recovery capacity

  • Transmit Idle and link limbo (default is 10s) can keep SRDF/A in an active state during all links lost conditions. It allows SRDF/A to remain fully active during network outages that cause an All Links Lost condition.
  • Write pacing is an SRDF/A feature that balances cache utilization by extending the host write I/O response time to prevent SRDF/A operational interruptions.
  • The group-level pacing option. It is enabled for the entire SRDF/A group when slowdowns in host I/O rates, transmit cycle rates, or apply cycle rates occur.
  • The device-level pacing option. This option is for SRDF/A solutions in which the SRDF/A R2 devices participate in TimeFinder copy sessions.
  • Tolerance Mode is an SRDF/A feature that allows you to balance the performance and data consistency requirements. Setting Tolerance mode ON allows one or more devices to be Not Ready on the link at the R1 side and the SRDF/A session remain active. R2 consistency is not maintained, but this may be acceptable in certain controlled service activities. If all links are lost, Tolerance mode does not keep SRDF/A active.

SRDF/A cache utilize limitation

  • Write Pending Limit
    • Prior to 5875, 80% of system memory.
    • After 5875, set Cache Allocation Policy parameter in BIN file, 50% - 70% of available memory.
    • SRDF/A cache usage limitation: 74% in VMAX, 94% prior to VMAX. Ex: VMAX WP is 75%, SRDF/A is 74%.
  • SRDF Group session priority: RDF group attributes session_priority used to determine which SRDF/A sessions to drop if cache becomes full.  Values range from 1 to 64, with 1 being the highest priority (last to be dropped).

Error message and Event

Enginuity error message:

  • 047D - one link lost
  • 046D - all links for a group lost
  • 047E - A single link has been gained
  • 046E - All links in an RDF group have been regained
  • CACA - SRDF/A session drop due to non-user requests

Symevent error message:

  • Link drop
  • All groups become suspend status.
    • If transmit idle is set, will become transmit idle status.

SRDFA_3.png

  • Reached memory limitation

SRDFA_4.png

SRDF/A common drop reasons and solutions

In addition to the mentioned link problem, SRDF/A common causes of link drop could be the data transferred exceed the link bandwidth, cache limitation is reached and triggered the link drop. On the other hand, the whole array performance issues will also cause the lack of memory will facing link drop.

  • Bandwidth issue:  increase link bandwidth
  • Short term insufficient bandwidth: DSE
  • Array-level performance issue: Fix array-level performance issue first

Case 1 - Bandwidth issue

SRDFA_5.png

Configuration: SRDF/A between SiteB to SiteC. Totally 622Mb WAN between three arrays

Issue: One of RDF group drop at 17:30 every day.

Cause: Check the performance trace, the SRDF/A data transfer reach around 90MB/s (over 720Mb/s) which exceed the total bandwidth. The insufficient bandwidth at peak time causes the SRDF/A group drop.

11/12/2013 17:20      87.94 MB

11/12/2013 17:25      94.59 MB

11/12/2013 17:30      71.96 MB

11/12/2013 17:35      73.58 MB

11/12/2013 17:40      79.58 MB

11/12/2013 17:45      79.58 MB

11/12/2013 17:50      78.46 MB

Solution: Increase bandwidth.

Case 2 - Short term insufficient bandwidth

SRDFA_6.png

Issue: From the performance log, SRDF / A traffic suddenly increased, and then the entire group is offline.

Log analysis: Check the inline log and find CACA.10 error, which means that the array memory usage exceeds the limit then drop the SRDF/A group.

  • Peak time is very short
  • DSE is not used
  • Backend resources are not busy

Solution: DSE is recommended. After applying the DSE, the short-term large amounts of data from SRDF/S group 206 need to be transferred through the SRDF/A group 207 to the remote disaster recovery array, we can see DSE cache a lot of data, SRDF/A group 207 ran for a period of time close to the upper limit of the transmission, but the SRDF / A group didn’t not drop offline.

SRDFA_7.png

Case 3 - A link problem caused an out of memory

Log analysis: Inline error shows cycle number has been an hour without switching at 16:30

SRDFA_8.png

Inline error display memory overrun led to SRDF / A interrupt. From the performance data to see Write Pending to 79% at 23:44

SRDFA_9.png

Performance log analysis: From 15:30, the cycle number is no longer grow, but the state of SRDF/A is still active. Then, active cycle size continued to grow, the memory also continued to grow. It became inactive at 23:45.

SRDFA_10.png

SRDFA_11.png

Cause: The link quality problem causes R2 to fail to receive the cycle, causing the cycle size to continue to grow and eventually exceed the cache usage limit.

Solution: Repair the link quality problems.

Author: Fenglin Li




iEMC APJ

Please click here for for all contents shared by us.



Labels (1)
Tags (2)