Start a Conversation

Unsolved

This post is more than 5 years old

A

1536

July 5th, 2013 19:00

Avamar slows down at 90% ?

Has anyone else experienced a performance slowdown as their grid approaches 90% used capacity?

Is it really the capacity? Have you found anything that helps besides deleting or buying?


Our grid in question is Gen3 with version 6.1.1-87. This grid has been this full and higher several times in its past without any noticible slowdown, so we are skeptical that its simply just the capacity. It seems everything is slow, the hfschecks all run long, replication has slowed and runs long, and the most irritating part is the backups don't finish in the time allowed.

Anyone with any experience with a near full grid would be helpful.

Thanks,

Marty

1.2K Posts

July 7th, 2013 19:00

90% is a little high.

Please SSH to the Avamar server using a Putty session and run the following commands and paste the results:

status.dpn

capacity.sh

avmaint nodelist | grep fs-perc

July 7th, 2013 20:00

Thanks for the response. See the command results below. Sorry about the x's. I don't want to get in trouble with Security with host names and IP addresses on an open forum.

We're working to get the top 4 consumers retired, but it can't happen too fast.

We've had this grid and its predecessor grids well into the low 90's before without having them slow down so bad. We're just looking for some confirmation that the full condition is really the cause, or if there's something in the 6.1.1 version, or when using Image backups, or something like that that makes a difference. 

root@xxxxxxxxxx:~/#: status.dpn
Sun Jul  7 22:02:59 CDT 2013  [xxxxxxx] Mon Jul  8 03:02:59 2013 UTC (Initialized Tue Nov  9 16:57:45 2010 UTC)
Node   IP Address     Version   State   Runlevel  Srvr+Root+User Dis Suspend Load UsedMB Errlen  %Full   Percent Full and Stripe Status by Disk
0.0    x.x.x.x            6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   1.46 36088  5000682  59.0%  59%(onl:11377)
0.1    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  14 false   1.52 36080 12705336  59.0%  59%(onl:11407)
0.2    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  14 false   0.73 35675 78751301  58.8%  58%(onl:11358)
0.3    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  14 false   4.66 36070 11057251  58.6%  58%(onl:11355)
0.4    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   1.49 35987 12766806  58.7%  58%(onl:11348)
0.5    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   1.23 35697 12950119  58.6%  58%(onl:11413)
0.6    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  21 false   0.95 36106 88993917  58.6%  58%(onl:11365)
0.7    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   1.43 35671 11344763  57.2%  57%(onl:11395)
0.8    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   1.72 36047 15925735  56.7%  56%(onl:11381)
0.9    x.x.x.x.   6.1.1-87  ONLINE fullaccess mhpu+0hpu+0hpu  13 false   4.46 36090 12015515  56.8%  56%(onl:11395)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full

Last checkpoint: cp.20130707195915 finished Sun Jul  7 15:13:36 2013 after 13m 19s (OK)
Last GC: finished Sun Jul  7 09:17:17 2013 after 04h 02m >> recovered 331.03 GB (OK)
Last hfscheck: finished Sun Jul  7 14:57:07 2013 after 03h 09m >> checked 107005 of 107005 stripes (OK)

Maintenance windows scheduler capacity profile is active.
  The backup window is currently running.
  Next backup window start time: Mon Jul  8 18:00:00 2013 CDT
  Next blackout window start time: Mon Jul  8 05:00:00 2013 CDT
  Next maintenance window start time: Mon Jul  8 11:00:00 2013 CDT
root@xxxxxxxxxxxroot@xxxxxxxxxxx:~/:~/#: capacity.sh
Date          New Data #BU       Removed #GC    Net Change
----------  ---------- -----  ---------- -----  ----------
2013-06-23   429808 mb 579    -344097 mb 1        85711 mb
2013-06-24   264808 mb 613    -358416 mb 1       -93607 mb
2013-06-25   324323 mb 487    -241288 mb 1        83035 mb
2013-06-26   389259 mb 488    -206816 mb 1       182443 mb
2013-06-27   499685 mb 486    -302497 mb 1       197188 mb
2013-06-28   396560 mb 454    -313166 mb 1        83394 mb
2013-06-29   235463 mb 602    -367743 mb 1      -132279 mb
2013-06-30   664250 mb 600    -356170 mb 1       308080 mb
2013-07-01   448398 mb 556          0 mb         448398 mb
2013-07-02   527765 mb 427    -263946 mb 1       263819 mb
2013-07-03   409491 mb 388    -488189 mb 1       -78697 mb
2013-07-04   568293 mb 421    -372435 mb 1       195858 mb
2013-07-05   482242 mb 464    -557735 mb 1       -75492 mb
2013-07-06   173598 mb 620    -430612 mb 2      -257013 mb
2013-07-07   197886 mb 646    -338980 mb 1      -141093 mb
2013-07-08    41912 mb 232          0 mb          41912 mb
----------  ---------- -----  ---------- -----  ----------
Average      378359 mb        -308880 mb          69478 mb

Top 5 Capacity Clients                     Added  % of Total   ChgRate
-----------------------------------  ------------  ---------- ---------
  xxxxxx.xxx.com                1198264 mb       19.8%    8.865%
  xxxxxx.xxx.com                 779497 mb       12.9%    8.696%
  xxxxxx.xxx.com                 671585 mb       11.1%    8.130%
  xxxxxx.xxx.com                 489296 mb        8.1%    8.201%
  xxxxxx_UAmzZtaFG1XKBGCuqUtzZA     216933 mb        3.6%    8.412%
Total for all clients                  6053750 mb      100.0%    0.009%
root@xxxxxxxxxxxroot@xxxxxxxxxxx:~/:~/#: avmaint nodelist | grep fs-perc
        fs-percent-full="73.8"
        fs-percent-full="73.7"
        fs-percent-full="74.0"
        fs-percent-full="74.3"
        fs-percent-full="74.4"
        fs-percent-full="74.1"
        fs-percent-full="74.2"
        fs-percent-full="74.4"
        fs-percent-full="74.3"
        fs-percent-full="74.5"
root@xxxxxxxxxxxxroot@xxxxxxxxxxxx:~/:~/#:

1.2K Posts

July 7th, 2013 21:00

The output shows OS view is not high (73%-74%) but GSAN view is a little high (56%-59%).

Last GC recovered 331GB. This should be fine.

Average Net Change 69478 mb indicates data has increased 69GB in the past two weeks.

You may need to take some actions (such as delete backup, reduce retention policy) to keep steady-state.

For the slow performance issue, usually there are 3 possible bottlenecks for this situation: Avamar server, network connection or the client itself.

It is expected it will take more time for hfscheck if data is increasing.

Are all clients backing up slowly?

Also check CPU/RAM and disk IO

2K Posts

July 8th, 2013 06:00

I see from status.dpn that these are 3.3TB nodes.

3.3TB nodes use a RAID5 disk configuration which means that the disk seek performance is lower than node types that use RAID1. To compensate for the lower seek performance, these nodes have 36GB of RAM which they use for caching index information.

These index caches are unloaded at the start of the maintenance window in order to ensure that enough memory is available for the hfscheck process to run. The caches stay unloaded until the maintenance window ends AND hfscheck is complete. If the index caches are unloaded, backups will run substantially slower. If backups and hfscheck overlap, both will run substantially slower.

It seems counter-intuitive but if backups aren't completing in time and hfscheck is running long, pushing the backup schedule out can actually help the backups complete sooner.

One other possible concern has to do with the size of the index -- if the calculated size of the index exceeds the amount of available memory, the caches won't be loaded. This is a very rare condition but I have seen it before. Support would need to take a look at your system to determine if this is the case. There is a request for enhancement open to allow partial cache loading in a future version of the product.

2K Posts

July 8th, 2013 07:00

I got all the way through that answer and realized I didn't actually answer the question.

This type of performance issue can be caused by high capacity but there are usually other factors involved as well.

July 9th, 2013 17:00

Thank you for the detail.
Sounds like there isn't a lot we can (or want to) control with the indexes. We did shorten the retention a little, and deleted the oldest weekly backups, but the garbage collection now seems to be behind some.

Also, our performance has returned to somewhat normal, and the grid is still at 89.8%, which is not much lower than the 90.3% when we were having the problem.

We do have a new hypothesis now though:

Our 4 main Exchange servers represent the 4 highest change servers on this grid, and consume about 30% of the daily bytes sent, and... the email admins started a relatively massive mailbox migration to our new version Exchange system that is not backed up by Avamar.

Our hypothesis is that during this activity, the Exchange change rate ballooned, and the grid seems to have spent much more effort dealing with the Exchange backups, and everything suffered. Exchange is backed up by the Exchange snap-in with the regular Avamar client installed. The Exchange backups also start at the very beginning of the evening window, and ran all night right along with all the other backups.

Whatever it was, its over now, and the Exchange migrations are also done now, and the problem is gone.

thanks for the information.

Marty

No Events found!

Top