Slow backups

We have a CX3/40C and aren't getting expected backup performance (NetBackup) from our user fileshares.
RAID Group 3 is 5x 750GB SATA-II disks and has 8 LUNS, total 1774GB used. It takes 41hrs for the last Full Backup job to finish! 3.1MB/sec/disk
RAID Group 2 is 4x 750GB SATA-II disks and has 1 LUN, 999GB used. It takes 22 hrs for Full Backup. 4.31 MB/sec/disk
RAID Group 6 is 9x 300GB FC2 disks and has 2 LUNS, total 1473GB used. It takes 21 hrs for Full Backup. 2.49MB/sec/disk
(NB All RAID-5, so subtract 1 disk from calculations because parity only)
I have tuned NetBackup to maximise buffers & block-sizes etc. The backup goes to disk (RAID-3 sets in other DAEs) before being duplicated to 2 tape libraries, 2 LTO4 drives in each library. All the backup jobs start at the same time, so RG3 has 8 simultaneous jobs.
I am not a Navisphere expert but have logged last weekend's performance. From the graphs, I see that :-
1. a typical RG3 disk runs at 60-70% utilisation throughout the backup period (i.e. whether 1 or 8 jobs are running simultaneously), and this only comes down to 16% when no jobs are running. Why? - a typical RG2 disk runs at 20-45% utilisation, and a RG6 disk at 20-30%; both drop to 4% when jobs have finished.
2. RG3 disk queue-length is 1.5-2.5 throughout the backup period (RG2 & 6 <1).
3. Typical RG2/3/6 disks show read-sizes about 30-60KB during backups, rising to 350/100/200KB after jobs finish. Maybe having a lot of small files (Word docs etc) is slowing down our backups.
4. Neither the destination RAID-3 disks, nor the SPs seem to be anything like maxed out, even when our Exchange disks are backing up as well (Exchange backups run at about 150MB/sec for 2 hrs onto the same RAID-3 disks
From other tests, I see that running just the biggest RG2 and RG3 backup jobs simultaneously (1 job on each RAID group) reduces their times by 30-50%, so I suspect disk-thrashing is normally the problem.
Is the solution to buy more disks and spread the storage over more spindles? Another approach is to stagger the Full backups, so one LUN runs Monday, another Tuesday, etc.

Any comments/suggestions gratefully received,
TIA

Responses(4)

kelleg

4.5K Posts

1

October 6th, 2009 09:00

1. RG 3 has 8 LUNs - are you starting backup on all eight LUNs at the same time?

Which Raid Group(s) are you using to backup these eight LUNs?
What are the disk types and number in the Raid 3 raid groups - 4+1 or 8+1?

2. Need the same for the RG 2 and RG 6 - where are they being backed up to - number of disks and disk type FC or sata?

3. In Analyzer, look at the Total Throughput (IOPS) for one of the disks in each of the raid groups, including the raid 3 raid groups. Is the level of IOPS exceeding the recommended limits for the type of disk? Do the same but use Total Bandwidth (MB/s)

10K RPM FC = 120 IOPS or 10MB/s
7200 RPM SATA = 80 IOPS or 8MB/s

4. Look at "Dirty Pages (%)" in the Analyzer files that you have. This will only show up when you have SPA or SPB selected and highlighted. What is the percent Dirty Pages during backup? It should vary between the low and high watermark numbers (usually 60% and 80%). If it is consistently over the high watermark or close to 99%, then you are probably suffering from excessive force flushing.

glen

jps00

392 Posts

1

October 6th, 2009 10:00

(I see Glen has already responded. I'll post this because its already been written.)

This is a really good question and its well written. A few pieces of information that would clarify things have been left out, but they may not be important.

For example, the CX3-40 has two buses, how are the RAID groups distributed across them? What is your bus utilization? How are the RAID 3 destination RAID groups configured? A verification of the total number of backup streams would be helpful (8?). In addition, where is and how is the NetBackup catalog configured?

Do a quick look at your Backup Server to be sure its CPU, Memory and I/O metrics are within reason.

It may be my questions above are better than my answers, but I will try to answer your questions as best I can with the information provided.

1. a typical RG3 disk runs at 60-70% utilization throughout the backup period (i.e. whether 1 or 8 jobs are running simultaneously), and this only comes down to 16% when no jobs are running. Why? - a typical RG2 disk runs at 20-45% utilization, and a RG6 disk at 20-30%; both drop to 4% when jobs have finished.

The large number of LUNs in RG3 may be increasing the RAID group seek time. It also may be that these LUNs are highly fragmented, and are resulting in a random-like and not sequential read I/O pattern. A possible solution is to do a file system defragmentation, and reduce the number of LUNs in that RAID group.

2. RG3 disk queue-length is 1.5-2.5 throughout the backup period (RG2 & 6 <1).

See #1.

3. Typical RG2/3/6 disks show read-sizes about 30-60KB during backups, rising to 350/100/200KB after jobs finish. Maybe having a lot of small files (Word docs etc) is slowing down our backups.

NetBackup uses a 64KB I/O by default. Did you change it? (I think other things are slowing your backup.)

4. Neither the destination RAID-3 disks, nor the SPs seem to be anything like maxed out, even when our Exchange disks are backing up as well (Exchange backups run at about 150MB/sec for 2 hrs onto the same RAID-3 disks
From other tests, I see that running just the biggest RG2 and RG3 backup jobs simultaneously (1 job on each RAID group) reduces their times by 30-50%, so I suspect disk-thrashing is normally the problem.
Is the solution to buy more disks and spread the storage over more spindles? Another approach is to stagger the Full backups, so one LUN runs Monday, another Tuesday, etc.

Its possible that you are bottlenecking at the RAID 3 destination LUN. Best Practices recommends: Do not use more than five backup streams per LUN. More disks always help. However, file system defragmentation and redistribution through LUN migration may be all you need.

EMC CLARiiON Best Practices for Performance and Availability FLARE revision 28.5 in its RAID groups section and LUN section contains a detailed description on RAID group and LUN performance factors. This document is available on PowerLink.

ian Button

11 Posts

0

October 7th, 2009 03:00

Thanks for your help, guys. To try & answer your questions :-
Kelleg: 1. Yes, all 8 backups on RAID Group 3 start at the same time (00:10 early Saturday morning), along with the backups to RG2 & RG6. The destinations are three 4+1 RAID-3 groups (each 1 LUN) comprising 5x750GB SATA-II disks ¿ four of the 8 RG3 LUNS (and two small Windows servers) backup to RG10, and the other four (plus 7 small Windows servers) backup to RG12.
2. The one RG2 LUN and the two RG6 LUNs (also one of the Exchange databases and 21 small Windows servers) backup to RG13. But the big Exchange backups (400GB each mailserver) don¿t conflict with the RG2/3/6 backups ¿ they back up to disk 19:00-21:00ish daily, and duplicate to tape 21:00-23:00ish ¿ excellent performance). And the small Windows servers (application servers, domain-controllers, etc) backup at quite good speeds even while the SAN disks are chugging along slowly.
3. Hmm, I hadn¿t seen those recommended limits before. A typical RG3 disk (SATA-II) shows Total Throughput 90-110 IOPS throughout the backup period (from start of the 8 jobs till the end of the last (biggest) one 41 hrs later. The RG2 disk (SATA-II) averages about 90 IOPS and the RG6 disk (FC2) averages about 80 IOPS. For the destination disks, RG10 averages about 10 IOPS, RG12 disk averages 50 IOPS (range 20-90) and RG13 averages about 10 IOPS. All this is mostly Read IO ¿ for the backup LUNS (RGs10/12/13) it looks as though the peak values may coincide with the duplications to tape, but I need to investigate this more. Looking at Total Bandwidth MB/s, RG2/3/6 show values 4-6, 2-5, 3-5 MB/s (mostly Read) and RG10/12/13 show values 2-4, 6-16 & 2-4 MB/s (mostly Write).
4. Dirty Pages % ¿ sorry, I can¿t see that parameter anywhere, even when I select SPs via the SP tab, or look at them via the LUN tab. Maybe I omitted to tick some parameters when I enabled & started logging???

Jps00: Good point about the buses ¿ all the RGs we are talking about here, except RG6, are on Bus 0. How can I check Bus utilisation - I can¿t find this parameter ¿ or should I look at something else? Please see 1 above for the RG3 setup; yes ¿ the 8 RG3 backup jobs start together (also a dozen or so small Win server jobs).
The NetBackup setup is Enterprise Version 6.5.4 with 1 Master/media server, plus 1 extra media-server, running Win 2003 Server R2, 64-bit, SP2. Both are Dell PE2950-III 64-bit, single-processor (Xeon 2330MHz) Quad-core with 8GB RAM, SAN-attached. RG10 that receives half the RG3 backups is owned by the extra media-server; the other backup LUNs are owned by the Master/media server. What exactly did you want to know about the Catalog setup? I upped the NetBackup Client Communication buffer sizes to 512KB & server NET_BUFFER_SIZE to 513KB; the server NUMBER_DATA_BUFFER is 64 and SIZE_DATA_BUFFERS is 524288. I don¿t think NetBackup itself is the bottleneck, as performance for Exchange full backups is terrific - every day, two 400GB mailbox servers are staged to disk (2 hrs in parallel) and the images copied to tape at two locations each with a pair of fibre-attached LTO4 tape drives on the SAN (another 2 hours).

Thanks for suggesting the ¿Best Practices ¿ 28.5¿ ¿ I¿ll read it carefully. As this is more concerned with CX4¿s I had just used the older ¿BP¿26¿ document for our CX3.

Additional information ¿ the RG3 LUNs total 6.8 million files, the RG2 LUN is 3.6 million, and the two RG6 (FC) LUNs add up to 2.8 million files. Users¿.!! (sigh!) Small file-sizes could well be a major factor, but we don¿t backup Windows profiles , so no cookies or Temporary Internet Files.
It has been suggested that NetBackup SAN Client licences might speed up backups - the RG2/3/6 LUNs are shares on 4 clustered Dell NX1950 64-bit machines (dual quad-core Xeon 2330MHz, 8GB RAM) running WUDSS 64-bit. But I¿ve read that this may be true for slow backups of big files, not necessarily for lots of small files.

kelleg

4.5K Posts

0

October 7th, 2009 09:00

Quick answer on the Dirty Pages - if you open the Analyzer NAR file in the detail view, you will see the three tabs at the top - LUNs, Storage Pool and SP. Select the SP tab then check SPA and SPB ( not the individual ports under each SP). Then make sure that one of the SP's is highlighted. In the lower left hand window, you should see a number of parameters that you can select, one should be "Dirty Pages (%)". If you do not see this then you need to enable Advanced Characteristics - go to Tools\Analyzer\Customize - on the General tab, check the Advanced button. You should now see the Dirty Pages parameter.

You can also see this if Analyzer is running - click on SPA and select Properties - then select the Statistics tab - wait for a bit and the values should fill in - this will update every 60 seconds - Dirty Pages is one of the parameters.

This value is important - if you're reading (backup) from a LUN with say 5 disks (sata) and writing to 5 disks (sata), you can read faster then you can write - you need to calculate parity for raid 5. So you have 8 LUNs reading for the same raid group and writing to two different raid groups - the number if disks in each raid group is the same, but you may be writing faster than the destination disks can de-stage the writes from write cache to the disks. Dirty Pages will tell you if you have this problem - if Dirty Pages is constantly close to 99%, you the have spindle contention issues. If you are not seeing this, then you probably need to look at the host.

On the destination disks you should be seeing mostly Writes (there will be some Reads if you need to calculate parity) - if the write IO load is lower than the Reads from the source LUNs, where is the IO getting stuck? The IO path should be: read from source LUNs (pull the data off the array down to the server) then write the data to the destination LUNs (server writes the data back up to the array).

glen

View All

No Events found!