Data Domain: Compression Frequently Asked Questions
Summary: This article answers the most frequent questions regarding compression. Data Domains are independent of data type. Data Domain uses compression algorithms that back up only unique data - duplicated patterns or multiple backups are stored only once. ...
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Instructions
Table of Contents
- Do incremental and full backups use the same disk space?
- Why do '
filesys show space' and 'filesys show compression' show different numbers? - Why does '
filesys show compression last 24 hours' not match expectations for VTL? - How is the cumulative compression ratio calculated?
- How does Data Domain compression work?
- Does Data Domain support multiplexing?
- With 1-to-1 directory replication, why does the replica show better global compression?
- What is the change in compression when using lz, gzfast, and gz local compression settings?
Typical compression rates are 20:1 over many weeks of daily and incremental backups. Data type affects the compression ratio - compressed image files, databases, and compressed archives (such as .zip files) do not compress well.
Do incremental and full backups use the same disk space?
Ideally, this would be true. In practice, the full backup uses a little more space than the incremental for the following reasons. These reasons also explain why a full backup after no changes in data still consumes a positive amount of space.
- The metadata takes about 0.5% of the logical size of the backup. Suppose that:
- The logical size of the full is 100 GB
- The logical size of the incremental is 2 GB
- The incremental compresses to 1 GB
- ...then the full takes at least 1.5 GB
- The DD compression engine rewrites some duplicate data segments for performance. The poorer the data locality of the changes, the more the duplicates are written. The duplicates are later reclaimed by file system garbage collection (GC). In some cases about 2% of the logical size is rewritten as duplicate. Assuming this level of duplicates, the full might take 1 GB (compressed) + 0.5 GB (metadata) +2 GB (duplicates) = 3.5 GB. The amount of duplicates written can be controlled through a system parameter, but we generally do not tune this parameter in the field.
- The data segmentation may vary a little from backup to backup depending on the order in which the NFS client sends the data. This order is not deterministic. In general, the segmentation algorithm tolerates shifts and reordering. However, it also creates some "forced" segments, which are prone to shifts and reordering. Typically about 0.2% of the segments are forced, so that much more space usage can be expected.
Why do 'filesys show space' and 'filesys show compression' show different numbers?
- '
filesys show space' provides the compression ratio based on the logical size of the data stored and the disk space used at the time that the command is run. - '
filesys show compression' provides the compression ratio based on how each file was compressed at the time that it was created. - '
filesys show compression' is used mostly for support and debugging. In the presence of file deletes, 'filesys show compression' overestimates the compression ratio.
For example, assume that:
- The first full backup gets 2x compression
- A subsequent full backup without any data changes gets 200x compression
- The first full backup is deleted
The output of '
filesys show space' would show a compression ratio of 2x, while 'filesys show compression' would show a compression ratio of 200x, because the only file that exists now got a compression ratio of 200x when it was created.
In the example above, after the second backup, '
filesys show space' would show a cumulative ratio of about 4x. The cumulative ratio would improve asymptotically towards 200x if continuing with more backups without deletion.
There are some other minor differences. The '
filesys show compression' command:
- Does not account for container-level wastage, thus overestimating the compression ratio further
- Does not account for duplicate-elimination by global compression, thus underestimating the compression ratio
- Can provide per-file or per-directory information, while '
filesys show space' is limited to the entire system - Provides the breakdown between global and local compression, while '
filesys show space' does not
Why does 'filesys show compression last 24 hours' not match expectations for VTL?
For VTL, the output of commands such as '
filesys show compression last 24 hours' often does not meet expectation based on other sources such as 'system show performance'.
The problem happens due to a peculiarity in '
filesys show compression'. In general, it shows cumulative stats in selected files. The qualifier "last 24 hours" selects files that were updated in the last 24 hours. The stats are still cumulative since the file was created or last truncated to zero size. Thus, if a file was appended in the last 24 hours, 'filesys show compression last 24 hours' shows its cumulative stats before the last 24 hours.
Backup files in non-VTL environments are written only once, so there is little discrepancy between files updated and files created. With VTL, backups may be appended to existing tape files. For example, consider a 100 GB tape that is filled up to 50 GB. If 10 GB of data were appended to this tape in the last 24 hours, '
filesys show compression last 24 hours' would show the file's "Original bytes" written at 60 GB.
How is the cumulative compression ratio calculated?
Individual compression ratios do not add up linearly.
Suppose that the compression on the first full backup is 2x and that on the second full backup is 20x. The cumulative compression is not
(2 + 20) / 2 = 11x, but 2 / (1/2 + 1/20) = 3.64x.
In general, lower compression ratios have more impact than higher ones on the cumulative compression ratio.
Suppose that the
ith backup has logical size si and compression ratio ci. Then, the cumulative compression ratio for k backups can be computed as follows:
C = (total logical size)/(total space used)
total logical size = s1 + s2 + .. + sk
total space used = s1/c1 + s2/c2 + ... + sk/ck
Often, the logical sizes are roughly the same. In that case, the above calculation simplifies to the following:
C = k / (1/c1 + 1/c2 + ... + 1/ck)
For example, if:
- The first full backup gets 3x compression
- Each subsequent full gets 30x compression
- The retention period is 30 days
the user sees a cumulative compression of 30 / (1/3 + 29/30), or 23x.
How does Data Domain compression work?
This question is answered in detail in a separate article: Understanding Data Domain Compression
Does Data Domain support multiplexing?
Multiplexed data from the backup application results in very poor global deduplication. For more information, see this article: Data Domain: Multiplexing in Backup Software
With 1-to-1 directory replication, why does the replica show better global compression?
This is usually because of variations in the level of duplicate segments written on the system:
- The data stored at the source has been deduplicated once - against the previous data stored at the source.
- The data sent over the wire has been deduplicated once - against the data stored at the replica.
- The data stored at the replica has been deduplicated twice, once when the data was sent over the wire, and again when the received data is written on the replica.
Since the deduplication process leaves some duplicates, data that has been deduplicated multiple times has fewer duplicates. The data stored at the source and sent over the wire is deduplicated once, so they are roughly the same, assuming the data stored at the source and the replica are similar. The data stored on the replica is deduplicated twice, so it is better compressed.
File system cleaning removes most of the duplicates. Therefore, after cleaning has been run on the source and the replica, the amount of data stored there should be about the same.
What is the change in compression when using lz, gzfast, and gz local compression settings?
Use the following command to change the local compression algorithm used in a Data Domain:
filesys option set compression {none | lz | gzfast | gz}
Note: The file system must be shut down prior to changing the local compression type. It can then be restarted immediately after the compression option has been set.
In general, the order of compression is as follows:
lz < gzfast < gz
| Type | Expected comp. | CPU load |
|---|---|---|
| none | 1x | 0x |
| lz | 2x | 1x |
| gzfast | 2.5x | 2x |
| gz | 3x | 5x |
The rough difference is:
lz to gzfastgives ~15% better compression and consumes 2x CPUlz to gzgives ~30% better compression and consumes 5x CPUgzfast to gzgives ~10-15% better compression
Note that changing the local compression first affects new data written to the Data Domain after the change was made. The old data retains its previous compression format until the next cleaning cycle. The next cleaning cycle copies forward all the old data into the new compression format. This causes the cleaning to run much longer and take more CPU.
If the system is already low on CPU, particularly if backups and replication are running simultaneously, this can slow down the backups and. The customer may want to explicitly schedule some time to do this conversion.
Additional Information
Knowledge References:
Affected Products
Data DomainProducts
Data DomainArticle Properties
Article Number: 000022100
Article Type: How To
Last Modified: 24 Apr 2026
Version: 12
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.