Start a Conversation

Unsolved

This post is more than 5 years old

11212

June 13th, 2012 13:00

SRDF/A getting suspended - Write Pending Limit reached

Guys,

   I am encountering a problem with SRDF/A replication with one of the critical RAC production database. The database is doing lot of I/0 close to 8000 IOPS on an average and peaking to 12000 IOPS on VMAX array.

and when it reaches the peak , the RDF is getting suspended with error " SRDF/A Sesion Dropped , Write pending limit reached " .

When I look at the frame write-pending limit ,it is under 20% utilized which means I do have lot of write-pending slots available.

Is there a limit on how many cache slots a RDF group can use ?? or is there a limit by device ??

I am suspecting the archive logs and redologs are the ones that were hitting very hard and were source of the problem.

Any help around this to nail down this issue would be appreciated.

Thanks,

Ram.

22 Posts

June 13th, 2012 14:00

Cache Size (Mirrored)                :  120320 (MB)

# of Available Cache Slots           : 1466088

Max # of System Write Pending Slots  : 1099566

Max # of DA Write Pending Slots      :       0

Max # of Device Write Pending Slots  :   54978

Replication Cache Usage (Percent)    :       5

Now , the max slots per device is 54978 ..does that mean I have 54978x64KB (3.35GB) cache for use per device??

Thanks,

Ram.

448 Posts

June 14th, 2012 07:00

Hopefully you aleady have a case open with EMC on this so they can look into it.  First glance it sounds as though you are not getting the data over to the target fast enough.  I would be looking at the number of srdf ports you are using and the bandwidth between the sites.

look up this primus solution: 

EMC142579  "What are the best practices for configuring SRDF/A?"

859 Posts

June 15th, 2012 00:00

HI Ram,

Symm drops SRDF/A when Snow cache usage (also called maximum cache usage) hits 74%. Perhaps you have looked at the Avg WP usage which is at 20% but this at one point had hit the max limit? I am just assuming...

Your R2 symm should also have enough cache slots (we recommend same or higher than r1) and the best way to get this problem identified is to collect WLA or STP data

regards,

Saurabh

1.3K Posts

March 20th, 2014 11:00

The WP count can drop rapidly, so it could have been high, then when you looked it was low.  Could also be the R2 WP count, not the R1?

1.3K Posts

March 20th, 2014 11:00

The following is a Primus(R) eServer solution:

ID:

emc142579

Domain: EMC1

Solution Class: 3.X

Compatibility

Goal       What are the best practices for

configuring SRDF/A?

Goal       SRDF/A Best

Practices

Fact       Enginuity:

5670

Fact       Enginuity:

5671

Fact       Enginuity:

5771

Fact       Enginuity:

5772

Fact       Enginuity:

5773

Fact       Enginuity:

5874

Fact       Enginuity: 5875

Fact       Product:

Symmetrix DMX Series

Fact       Product: Symmetrix

DMX-3

Fact       Product: Symmetrix

DMX-4

Fact       Product: Symmetrix VMAX

Series

Symptom    Error code: AX CACA

10

Symptom    Performance problems when SRDF/A is

running.

Symptom    SRDF/A running on Symmetrix DMX or DMX-3

Symmetrix systems, but failing to stay active, dropping with CACA.10

errors.

Fix        QUALIFICATION: The BCSD tool must be used  to size

new configurations. It is recommended that all SRDF configurations, including

SRDF/A,  be qualified by your EMC support representative via the SVC group.

R2 FRAME:
The R2 Frame should be AT LEAST as fast as the R1.
This
includes: The same amount, size, type of drives and protection schemes should be
used in both the R1 and R2 for the standard volumes. If additional volumes such
as BCVs are configured on the R2 side, additional drives and cache should be
used. For example, if using RAID 1/0 on the source frame with 15k drives, RAID
1/0 with 15k drives should be used on the target frame. Consideration should be
given to segregating standards and BCV volumes onto separate drives.

The default device write pending limit (amount of cache slots per volume)
should be the same or higher in the R2 as in the R1. This may require
more physical cache in the R2 than in the R1.

  • When defining CLONE on the R2, keep the clone devices on
    segregated drives and use the pre-copy option.
  • QOS with an initial value of 2 can be used to help reduce
    the copy impact.
  • SNAP is NOT ALLOWED on the R2
    volumes.

BANDcolor: #000080;">RA COUNT:
The correct number of RAs need to be configured. There should be

at least N+1 RAs, where N is the number of RAs required, so that a service

action can be performed to replace an RA if necessary.

Synchronous groups and SRDF/A groups should be segregated onto their own

physical adapters. Do not mix Synchronous and SRDF/A on the same
adapters
. Directors supporting SRDF/A should not be shared with

any other SRDF solution.

Caution! When moving from a Synchronous solution to SRDF/A,

in many cases we have seen the bandwidth and adapter utilization

INCREASE  as a result of the overall response time to

the system decreasing.

MONITORING:
SRDF/A should be monitored during the initial roll-out to ensure

that all components were properly sized and configured. Data needs to be

collected via STP or WLA and then run through the tools again to verify the

initial projections were correct. STP at 5x71 microcode includes SRDF/A

statistics, which can be very beneficial.

Do not forget that Mainframe MSC customers have a way to monitor for issues

and that is the SCF1562I and SCF1563I messages. These will tell if they are

getting transmit or restore issues. The messages will also tell which box is the

issue.

The SYMSTAT commands were specifically created for monitoring open systems

SRDF/A, but when issued from the Service Processor on the DMX it can be quite

informative regardless of whether it is mainframe or open systems.

There are three options:

  1. Cycle
  2. Requests
  3. Cache

Using different combinations of the three options can help determine what
caused the CACA and you can even prevent a drop by monitoring the cache
utilization closely. SRDF/A should be monitored on a regular basis to look for
workload changes and to predict  increases in CACHE or BANDWIDTH due to
growth.

VERIFICATION:
The network should always be verified to ensure that the
projected amount of bandwidth is configured. STP or WLA should be collected
during the initial Adaptive Copy Synchronization to ensure that the required
bandwidth is configured and that the network runs error free. Compression ratios
should also be checked either at the switches or on the GigE adapters to verify
that the correct numbers were used.

Upgrade or
Reconfiguration:
Always re-evaluate the SRDF/A solution
prior to doing any upgrades or reconfigurations. This includes drive upgrades,
adding volumes to the SRDF/A links or changing the front end connectivity. For
example changing ESCON to FICON.

Starting SRDF/A:
SRDF/A activation is considerate of cache utilization.
SRDF/A will capture a delta set of writes and send them in cycles across the
link. In addition to the new writes, SRDF/A will include up to 30,000 invalid
tracks per cycle. This is a design feature and the 30,000 track value was chosen
to prevent cache from being flooded by the invalid tracks. Therefore, EMC
generally recommends as a best practice to synchronize the boxes in Adaptive
Copy Disk mode to below 30,000 invalid tracks before activating SRDF/A. This
will ensure that SRDF/A will become secondary consistent within a few cycles.

SRDF/A will activate with many more than 30,000 invalid tracks and in
fact, some customers choose to activate SRDF/A  when they have thousands or
millions  of invalid tracks. This is allowed, but only a maximum of 30,000
invalid tracks will be sent with each SRDF/A cycle. As a result, it will take
many cycles before the frames are secondary consistent.

Fiber RDF Directors: Enable RF flow control. See emc152051 for a description of this feature.

Page Data Sets: Your EMC CE needs to set Enable Page
Date Set Mode
to YES in the IMPL.bin file to ensure synchronous
replication of all page data sets. Refer to emc100913.

Configuring Delta Set Extension (DSE): See emc204521 for best practices for configuring DSE.

Note       SRDF/A  will drop when 94% of System WP limit

is reached.  There is a parameter called "Snow Cache Use" or "Max Cache Usage"

limit that controls this.  This parameter can be lowered to cause SRDF/A to drop

sooner. Only SRDF/A  devices count against this value. If only a subset of the

devices in the DMX has SRDF/A running, then this parameter may need to be

lowered.   If DSE is configured in the Frame, Engineering recommends lowering

the SRDF/A "Snow Cache Use" percentage to 74%. The "Snow Cache Use" limit can be

changed via Inlines, Host Component, or SymCLI.  The recommendation is to have

the customer change it with their software.

As of October 6, 2009 the recommendation from EMC Engineering is to lower the
SRDF/A  "Snow Cache Use" percentage to 74% on all Symmetrix
running SRDF/A. The Snow Cache Use setting is normally set for the R1 (Source)
side since that is where the host is typically configured. But if there is ever
a fail over to the R2 (Target) side you would want to set it there. So for best
practices set it to the recommended 74% on both the Source and Target
boxes.

Symmetrix VMAX:
Starting at Enginuity
5874.207.166,  the SRDF/A "Snow Cache Use" percentage will automatically be
lowered to 74%.

Note      

To make the changes using Solutions Enabler, create a text file with the
following and use the command symconfigure:


file.txt



set Symmetrix rdfa_cache_percent=75;


symconfigure -sid XXXX -file c:\file.txt preview  (to check
that the command is valid)
symconfigure -sid XXXX -file c:\file.txt
commit

226 Posts

March 20th, 2014 11:00

bz15,

The new support.emc.com article number is 8391. Try this link -- https://support.emc.com/kb/8391

Thanks,

- Sean

1 Rookie

 • 

7 Posts

March 20th, 2014 11:00

I can't seem to be able to access the Primus link (looks like decommissioned service).

Can someone attach a copy of this EMC142579 doc here?

1 Rookie

 • 

7 Posts

March 20th, 2014 12:00

Thx guys. You guys are awesome!

60 Posts

February 18th, 2016 11:00

VMAX - 2 at engunity 5876.272.177, and here is the system write pending and cache details. SRDF is getting suspended frequently. What all things we can look into ? Can anyone suggest.

Cache Size (Mirrored)                :  999424 (MB)

# of Available Cache Slots           : 13256360

Max # of System Write Pending Slots  : 7974529

Max # of DA Write Pending Slots      :       0

Max # of Device Write Pending Slots  :  398726

Replication Cache Usage (Percent)    :       4

Max # of Replication Cache Slots     : 3469437

No Events found!

Top