Data Domain - Compression FAQ

Summary: This article answers the most frequent questions regarding compression. Data Domain Restorers are independent of data type. Restorer uses compression algorithms which will back up only unique data - duplicated patterns or multiple backups are stored only once. Typical compression rates are 20:1 over many weeks of daily and incremental backups. Also data type has an effect on the compression ratio so compressed picture files, databases, and compressed archives (for example .zip files) do not compress well. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

APPLIES TO

  • All DDRs
  • All Releases

 

Compression: Frequently Asked Questions:


1. Will incremental and full backups use the same disk space?
 

Ideally, this would be true. In practice, the full backup uses a little more space than the incremental because of the following reasons. These reasons also explain why a full backup after no changes in data will still consume a positive amount of space.

  • The metadata takes about 0.5% of the logical size of the backup. Suppose the logical size of the full is 100GB and that of the incremental is 2GB. Suppose the incremental compresses to 1GB. Then the full will take at least 1.5GB.
  • The DD compression engine will rewrite some duplicate data segments for performance. The poorer the data locality of the changes, the more the duplicates are written. The duplicates are later reclaimed by "filesys cleaning". I have seen about 2% of the logical size rewritten as duplicate. Assuming this level of duplicates, the full might take 1GB (compressed) + 0.5GB (metadata) +2GB (duplicates) = 3.5GB. The amount of duplicates written can be controlled through a system parameter, but we generally do not tune this parameter in the field.
  • The data segmentation may vary a little from backup to backup depending on the order in which the NFS client sends the data. This order is not deterministic. In general, the segmentation algorithm tolerates shifts and re-ordering. However, it also creates some "forced" segments, which are prone to shifts and re-ordering. Typically, about 0.2% of the segments are forced, so one can expect that much more space is used.

2. The "filesys show space" and "filesys show compression" show different numbers:
 

"filesys show space" provides the compression ratio based on the logical size of the data stored and the disk space used at the time that the command is run.

"filesys show compression" provides the compression ratio based on how each file was compressed at the time that it was created.

"filesys show compression" is used mostly for support and debugging. In the presence of file deletes, "filesys show compression" overestimates the compression ratio.

For example, the assumption is that the first full backup gets 2x compression. A subsequent full backup without any data changes gets 200x compression. The first full backup is deleted. "filesys show space" will show a compression ratio of 2x. "filesys show compression" will now show a compression ratio of 200x, because the only file that exists now got a compression of 200x when it was created.

In the example mentioned above, after the second backup, "filesys show space" will show the cumulative ratio of about 4x. The cumulative ratio would improve asymptotically towards 200x if keeping doing more backups without deletion.

There are some other minor differences:

  •  "filesys show compression" does not account for container-level wastage, thus over-estimating the compression-ratio further
  •  "filesys show compression" does not account for duplicate-elimination by global compression, thus underestimating the compression ratio
  •  "filesys show compression" can provide per-file or per-directory information, while "filesys show space" is limited to the entire system
  •  "filesys show compression" provides the breakdown between global and local compression, while "filesys show space" does not
 

REFERENCES

 
  • Why are the compression ratios different for "filesys show space" and "vtl tape show summary"?

The compression ratio shown in "vtl tape show summary" is intended to match "filesys show compression /backup/vtc".

More generally, this VTL command may be given an optional filter to select a subset of tape cartridges, and the compression is supposed to match "filesys show compression" on that subset of cartridges.

However, because of a bug in the VTL UI code, the compression shown in "vtl tape show summary" is erroneous. This is a known issue which is resolved at release 4.5.0.0.
 

  • Why does "filesys show compression last 24 hours" not match expectation for VTL?

For VTL, the output of commands such as "filesys show compression last 24 hours" often does not meet expectation based on other sources such as "system show performance".

The problem happens due to a peculiarity in "filesys show compression" (fsc). In general, "filesys show compression" shows cumulative stats in selected files. The qualifier "last 24 hours" selects the files that were updated in the last 24 hours. The stats are still cumulative since the file was created or last truncated to zero size. Thus, if a file was appended in the last 24 hours, "filesys show compression last 24 hours" will show its cumulative stats before the last 24 hours.

In non-VTL environments backup files are written only once so there is not much discrepancy between files updated and files created. With VTL backups may be appended to existing tape files. For example, consider a tape of capacity 100GB that is filled up to 50GB. If 10GB of data is appended to this tape in the last 24 hours, "filesys show compression last 24 hours" will show the file's "Original bytes" written at 60GB.
 

  • How is the cumulative compression ratio computed?

Individual compression ratios do not add up linearly.

Suppose the compression on the first full backup is 2x and that on the second full backup is 20x. The cumulative compression is not (2+20)/2 or 11x, but 2/(1/2+1/20) or 3.64x.

In general, lower compression ratios have more impact than higher ones on the cumulative compression ratio.

Suppose that the ith backup has logical size si and compression ratio ci. Then, the cumulative compression ratio for k backups can be computed as follows:

C = (total logical size)/(total space used)
total logical size = s1 + s2 + .. + sk
total space used = s1/c1 + s2/c2 + ... + sk/ck


Often, the logical sizes are roughly the same. In that case, the above calculation simplifies to the following:

C = k/(1/c1 + 1/c2 + ... + 1/ck)


For example, if the first full backup gets 3x compression, and each subsequent full gets 30x compression, and the retention period is 30 days, the user sees a cumulative compression of 30/(1/3+29/30) or 23x.
 

  • How does Data Domain Compression work?

This question is answered in detail in a separate KB article, "Understanding Data Domain Compression" Data Domain: Understanding Data Domain Compression
 

  • Does the Data Domain support multiplexing? ​​​​​​​

Multiplexed data from the backup application will result in very poor global deduplication. For more information, see the related article Multiplexing in Backup Software Is Not Supported Data Domain: Multiplexing in Backup Software
 

  • With 1-to-1 directory replication, why does the replica show better global compression?​​​​​​​

This is usually because of variations in the level of duplicate segments written on the system:

  • The data stored at the source has been deduplicated once--against the previous data stored at the source.
  • The data sent over the wire has been deduplicated once--against the data stored at the replica.
  • The data stored at the replica has been deduplicated twice, once when the data was sent over the wire, and again when the received data is written on the replica.

 

Since the deduplication process leaves some duplicates, data that has been deduplicated multiple times has fewer duplicates. The data stored at the source and sent over the wire is deduplicated once, so they are roughly the same, assuming the data stored at the source and the replica are similar. The data stored on the replica is deduplicated twice, so it is better compressed.

Filesystem cleaning removes most of the duplicates. Therefore, after cleaning has been run on the source and the replica, the amount of data stored there should be about the same.

 
  • What is the change in compression when using lz, gzfast and gz local compression settings?
The local compression algorithm used in a DDR can be changed by the following command:
 

filesys option set compression {none | lz | gzfast | gz}
 

Warning: Prior to changing the local compression type, the file system must be shut down. It can then be restarted immediately after the compression option has been set.

 

In general the order of compression is as follows:

lz < gzfast < gz

 

The rough difference is:

  • lz to gzfast gives ~15% better compression and consumes 2x CPU
  • lz to gz gives ~30% better compression and consumes 5x CPU
  • gzfast to gz gives ~10-15% better compression


Note that changing the local compression first affects new data written to the DataDomain Restorer after the change was made. The old data retains its previous compression format until the next cleaning cycle. The next cleaning cycle will copy forward all the old data into the new compression format. This causes the cleaning to run much longer and take more CPU.

If the customer system is already low on CPU, particularly if the customer is doing backup and replication simultaneously, this can slow down their backup and/or replication. The customer may want to explicitly schedule some time to do this conversion.

 

Knowledge Refererences:

Additional Information

 

    Affected Products

    Data Domain

    Products

    Data Domain
    Article Properties
    Article Number: 000022100
    Article Type: How To
    Last Modified: 02 Oct 2024
    Version:  11
    Find answers to your questions from other Dell users
    Support Services
    Check if your device is covered by Support Services.