Data Domain: Understanding Data Domain Compression
Summary: The terminologies, tradeoffs, and measures are explained here to describe the compression types used, terminology, and other aspects of compression on Data Domain.
Instructions
The compression techniques involved in a Data Domain use state-of-the-art techniques to reduce the physical space required by backup data. As such, the technologies and measurements of compression levels are complex topics.
This article discusses some terminologies, tradeoffs, and measures to better explain the compression types used, and other aspects of compression in a Data Domain environment.
APPLIES TO: All Data Domain Models
1. Introduction:
Last updated: January 2024
- Compression is a data reduction technology which aims to store a dataset using less physical space.
- In Data Domain systems (DDOS), deduplication and local compression is done to compress user data. De-duplication, or "dedupe," is used to identify redundant data segments and store only unique data segments.
- Local compression further compresses the unique data segments with certain compression algorithms, such as
lz, gzfast, gz, so on. - The overall user data compression in DDOS is the joint effort of deduplication and local compression. DDOS uses "compression ratio" to measure the effectiveness of its data compression.
- Generally, it is the ratio of the total user data size to the total size of compressed data or the used physical space size.
- Data Domain file system is a "log-structured" deduplication file system.
- A log-structured file system only appends data to the system and deletion by itself cannot free physical space.
- Such file systems rely on garbage collection to reclaim no-longer-needed space.
- The characteristics of the log-structured file system and the deduplicated technology combined together make it tricky to clearly understand all aspects of compression in DDOS.
For compression, there are many aspects that can be measured.
This article discusses the step-by-step details to help understand DDOS compression.
- At first, the overall system compression effect is explained, which tells us the realistic compression achieved in a Data Domain system, the amount of user data, the amount of physical space consumed, and the ratio of them.
- This ratio is called "system effective compression ratio" in this article.
- DDOS conducts deduplication inline and tracks the statistics of the original user data segments, post-dedupe unique data segments, and the local compression effect on the unique data segments.
- These inline compression statistics are used to measure the inline compression effect. Inline compression statistics may be measured for each write. Also, DDOS tracks the statistics at different levels: Files,
MTrees, and the entire system.
The contents of this article can be applied to all DDOS releases until publication of this article, up to DDOS 7.13.
There is no guarantee that all the contents are accurate for future releases.
In releases prior to 5.0, the entire system has only one mtree and the term mtree is not explicitly called out.
2. Compression: System Overall Effect:
The system wide overall compression effect is measured by the effective compression ratio, which is the ratio of the user data size to the size of used physical space. It is reported by the "filesys show compression" (FSC) CLI command (the corresponding information is also available on UI). A sample output of FSC is shown at below:
# filesys show compression
From: 2023-12-31 03:00 To: 2024-01-07 03:00
Active Tier:
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
---------------- -------- --------- ----------- ---------- -------------
Currently Used:* 6439.6 113.4 - - 56.8x (98.2)
Written:
Last 7 days 135421.3 1782.0 35.1x 2.2x 76.0x (98.7)
Last 24 hrs 532.5 1.5 334.3x 1.1x 356.5x (99.7)
---------------- -------- --------- ----------- ---------- -------------
* Does not include the effects of pre-comp file deletes/truncates
since the last cleaning on 2024/01/05 11:34:13.
-
-
- The system effective compression ratio is reported at row 1 of the result section in the CLI output. The row is highlighted above.
- The total user data size is labeled as "Pre-Comp."
- The total consumed physical space (by both data and metadata) is labeled as "Post-Comp."
- The "Pre-Comp" number and "Post-Comp" number are both read at runtime. FSC implicitly synchronizes the entire system, then queries the two numbers.
- These two numbers are measured in the same way as the "
filesys show space" command. - System effective compression ratio = Pre-Comp/Post-Comp
-
The rest of the output describes the inline compression statistics (discussed later).
There are some operations that can affect the system effective compression ratio:
-
Fastcopy
-
When a
fastcopyis done from a file in the active namespace (not a snapshot), it is a perfect deduplication, as no extra physical space is needed for the target file. The effect of afastcopyis that the user data size is increased without consuming additional physical space. This increases the system effective compression ratio. When manyfastcopiesare done, the system effective compression ratio may become artificially high.
-
-
Virtual synthetics
-
Virtual synthetic backups tend to show a high system effective compression ratio. This is because virtual synthetics make logical full backups, but only transfers changed or new data to Data Domain systems. The impact to system effective compression ratio of virtual synthetics is somewhat like the effect of
fastcopy.
-
-
Overwrites
-
Overwrites consume more physical space but do not increase the logical size of the dataset, thus overwrites lower the system effective compression ratio
-
-
Storing sparse files
-
Sparse files contain large "holes" that are counted in the logical size but do not consume physical space due to compression. As a result, they can make the system effective compression ratio seem high.
-
-
Storing small files
-
DDOS adds nearly 1 KB overhead to each file for certain internal metadata. When a system stores a significant number of small files (sizes less than 1 KB or in single-digit kilobytes), the overhead of metadata drags the effective compression ratio down.
-
-
Storing pre-compressed or pre-encrypted files
-
Compression and encryption can amplify the level of data change and reduce the possibility of deduplication. Such files usually cannot be deduplicated well and bring the system effective compression ratio lower.
-
-
Deletes
-
Deletions reduce the logical size of the system, but the system does not get the corresponding unused space back until garbage collection runs. Many deleted files make the compression ratio low until Garbage Collection (GC) runs.
-
-
Garbage Collection (GC) or Cleaning
-
GC reclaims the space consumed by data segments that are no longer seen by any file. If a lot of files have been deleted recently, GC may increase the system compression ratio by reducing the physical space consumption footprint.
-
-
Aggressively taking snapshots
-
When a snapshot of an
Mtreeis taken, the logical size of the dataset is not changed. However, all the data segments referenced by the snapshot must be locked down, even if all files captured by the snapshot are deleted after the snapshot was taken. GC cannot reclaim the space that is still needed by snapshots, therefore having lots of snapshots may make the system effective compression ratio appear low. However, snapshots are useful crash recovery facilities. Never hesitate to take snapshots or set up proper snapshot schedules when needed.
-
3. Compression: Inline Statistics:
DDOS conducts deduplication inline, as data is written to the system. It tracks the effects of inline deduplication and local compression for each write, and accumulates the statistics at the file level. Per-file inline compression statistics are further aggregated at the mtree level and at the system level. Compression is measured based on three numbers in the inline statistics:
-
The length of each write:
raw_bytes -
The length of all unique segments:
pre_lc_size -
The length of locally compressed unique segments:
post_lc_size
-
-
Global compression (
g_comp)- Equals (
raw_bytes/pre_lc_size), and reflects the deduplication ratio
- Equals (
-
Local compression (
l_comp)-
Equals (
pre_lc_size/post_lc_size) and reflects the effect of the local compression algorithm
-
-
The accumulated inline compression statistics are part of the file metadata in DDOS and are stored in the file inode. DDOS provides tools to check the inline compressions at all three levels: File, MTree, and system-wide. These are detailed in the following sections.
3.1 File Compression:
-
- File compression can be checked with the "
filesys show compression <path>" CLI command, which reports the accumulated compression statistics stored in the fileinode. - When a directory is specified, the inline compression statistics of all the files under that directory are summed up and reported.
- In the CLI output,
raw_bytesis labeled as "Original Bytes",pre_lc_sizeis labeled as "Globally Compressed",post_lc_bytesis marked as "Locally Compressed". The other overheads are reported as "Meta-data." The two examples are captured from an actual DD:
- File compression can be checked with the "
Example 1: Inline compression statistics of a file
filesys show compression /data/col1/main/dir1/file_1
Total files: 1; bytes/storage_used: 7.1
Logical Bytes: 53,687,091,200
Original Bytes: 11,463,643,380
Globally Compressed: 4,373,117,751
Locally Compressed: 1,604,726,416
Meta-data: 18,118,232
Example 2: Inline compression statistics of all files under a directory, including all subdirectories
filesys show compression /data/col1/main/dir1
Total files: 13; bytes/storage_used: 7.1
Logical Bytes: 53,693,219,809
Original Bytes: 11,501,978,884
Globally Compressed: 4,387,212,404
Locally Compressed: 1,608,444,046
Meta-data: 18,241,880
-
-
- The system reports the overall inline compression ratio in the above CLI output as "bytes/
storage_used." - However, care must be taken in interpreting the above information, as it can be misleading for various reasons.
- One reason is that the pre_lc_size and post_lc_size are recorded at the time the data operations are processed.
- When the file that originally added those segments gets deleted, the number of the unique data segments in the remaining file should be increased.
- The system reports the overall inline compression ratio in the above CLI output as "bytes/
-
As an example, assume that a file sample file is backed up to a Data Domain, and in the first backup, the compression information of the file is pre_lc_size=10 GiB, post_lc_size=5Gib.
-
-
- Next, assume that the data of this file is unique with no data sharing with any other file.
- In the second backup of the file, further assume that the file gets ideal deduplication, such that both
pre_lc_sizeandpost_lc_sizeshould be zero because all segments of the file already existed on the system. - When the first backup is deleted, the second backup of the file becomes the only file that references the 5 GiB of data segments.
- In this case, ideally, the
pre_lc_sizeandpost_lc_sizeof the file in the second backup should be updated from both being zero to be 10 GiB and 5 GiB, respectively. - However, there is no way to detect which files that should be done for, so the inline compression statistics of the existing files are left unchanged.
-
-
- Another factor that affects the above numbers is the cumulative statistics.
- When a file gets a lot of overwrites, it is impossible to track the extent to which the cumulative statistics reflect the writes that introduced the live data.
- Thus, over a long time, the inline compression statistics can only be treated as a heuristics to roughly estimate the compression of a particular file.
-
- Another fact worth highlighting is that the inline compression of a file cannot be measured for an arbitrary time interval.
- The file inline compression statistics are a cumulative result and cover all the writes that the file has ever received.
- When a file receives lots of overwrites, the
raw_bytescan be far larger than the logical size of the file. For sparse files, the file sizes may be larger than the "Original Bytes."
3.2 MTree Compression:
-
- The compression of a particular
mtreecan be checked with the"mtree show compression"(MSC) CLI command. - The absolute values of the inline compression statistics are cumulative over the lifetime of the
MTree. - Given that the lifetime of a
MTreecan be many years long, these values become less and less informative over time. - To address this issue, the amount of change (deltas) of the inline compression statistics are used and report compression only for certain time intervals.
- The underlying approach is that the
MTreeinline compression statistics are periodically dumped to a log. - When a client queries MTree compression with the MSC command, the log is used to calculate the deltas of the numbers for compression reporting.
- By default, MSC reports compression for the last 7 days and the last 24 hours, though anytime period of interest can be specified.
- The compression of a particular
To demonstrate, assume the following log for MTree A:
3:00AM, raw_bytes=11000GB, pre_lc_size=100GB, post_lc_size=50GB 4:00AM, raw_bytes=12000GB, pre_lc_size=200GB, post_lc_size=100GB
Then the compression of MTree A for this hour is:
g_comp = (12000-11000)/(200-100) = 10x
l_comp = (200-100)/(100-50) = 2x
overall compression ratio = (12000-11000)/(100-50) = 20x
The above compression ratio calculation does nothing with the dataset size. For example, the above mtree may only have 500 GB logical data.
-
- MSC supports the "daily" and "daily-detailed" options, as does the "
filesys show compression" command. - When "daily" is specified, the command reports the daily compression in a calendar fashion.
- It uses the daily deltas of the
raw_bytesand post_lc_size to compute the daily compression ratio. - When "daily-detailed" is specified, the command shows all three deltas (of the
raw_bytes,pre_lc_size, andpost_lc_size, respectively) for each day. It also computes theg_compandl_compalongside the total compression factor.
- MSC supports the "daily" and "daily-detailed" options, as does the "
A sample output from these systems is in the Appendix.
3.3 System Compression:
-
- Once how compression is reported on
MTreesis understood, it is straightforward to extend the concept to the entire system. - The system-wide compression inline statistics collection and reporting are exactly the same as with
MTrees. - The only difference is the scope, as one is in a particular
MTree, while one is over the entire system. - The results can be checked by using the "
filesys show compression" command. - An example of this can be found in Section 2.
- The "last 7 days" and "last 24 hours" system compression is reported in the last two lines of the result section in the FSC output.
- Once how compression is reported on
4. Cloud Tier:
- On DDs with cloud tier implemented, the storage is separated into the active tier and the cloud tier, which are two independent deduplication domains.
- Users can inject data only into the active tier.
- Later, DDOS data-movement functions can be used to migrate data from the active tier to the cloud tier.
- Thus the space and compression measurement and reporting are handled independently in each tier.
- But at a file level, no differentiation is done by tier and report inline compression statistics, they are exactly the same as what described in Section 3.1.
5. Deduplication:
The last topic to highlight is some of the characteristics of deduplication, which is called "global compression" in many Data Domain articles.
Although it contains the word "compression," it is entirely different than the traditional concept of compression, which is also provided by DDOS under the name "local compression."
- Local compression reduces the size of a piece of data using a certain algorithm (some kinds of data are not compressible and applying compression algorithms on them may slightly increase data size).
- Usually, once an algorithm is decided, the data itself is the only factor of the compression ratio.
- However, deduplication is different - it is not a local concept, it is "global."
- An incoming data segment is deduped against all the existing data segments in a deduplicated domain, which includes all the data on non-cloud Data Domain systems.
- The data segment itself does not matter in the deduplication procedure.
- In practice, a high deduplication ratio is rarely seen in the initial backup of a dataset. In initial backups, often the major data reduction comes from local compression.
- When subsequent backups land on the Data Domain, deduplication shows its strength and becomes the dominant factor for compression.
- The effectiveness of deduplication relies on the fact that the change rate of a dataset is low from backup to backup. For this reason, datasets with high change rates cannot be well deduped.
- When the backup application inserts its own metadata chunks (called markers by Data Domain) into the backup images at high frequency, it also may not get a good deduplication ratio. Our marker-handling techniques can help sometimes, but not always.
Given these observations, what to expect:
-
Initial backups may only achieve a small system effective compression ratio, often 2x, or 3x. Dedupe usually has little opportunity to show its strength in initial backups.
-
The global compression ratio of an incremental backup is lower than the compression ratio of the corresponding full backup. This is because an incremental backup contains only changed or new files compared to the immediate earlier backup. The global compression ratio depends on the percentage of new data within the incremental backup.
-
The deduplication ratio of a full backup (the non-initial ones) can also be low in some scenarios. Some frequently-observed scenarios include:
-
A high change rate in the data being backed up
-
The dataset being dominated by small files (less than 5 MiB)
-
Backup applications adding a lot of closely spaced markers
-
Database backups that are incremental or using small block size
-
When a low compression ratio is observed in a full backup with a low data change rate, check if one of the above cases applies, or if further analysis is needed.
-
-
Compression of a later backup image does not always be better than the initial one. Consecutive backup images can show a high deduplication ratio because the initial and earlier backup images already added most of the data to the system. When all the earlier backup images are deleted, the global and local compression ratio of the earliest existing backup image may still be high, but this only means that it got good deduplication when it was added to the system, nothing else. When a file is deleted that has a high global and local compression ratio and is the last backup image of a particular dataset, it may release more space than the size derived from the compression ratio.
-
Compression ratios of the same dataset on different systems cannot be compared, regardless of the way the dataset is added to those systems. This is because that each system is an independent deduplication domain. There is no expectation that two different DDs get the same or even necessarily similar compression ratios, even if their datasets are the same.
6. Summary:
- Measuring compression is difficult in deduplicated file systems, but it is even harder in log-structured deduplicated file systems.
- How deduplication works and how compression statistics are tracked must be understood.
- Compression ratios are useful information to understand the behavior of a particular system.
- The system effective compression ratio is the most important, reliable, and informative measure.
- The inline compression statistics can be helpful too, but they might be no more than heuristics in some circumstances.
Appendix: Sample output of "mtree show compression" command
- Assume that there is a
MTreeholding 254792.4 GiB of data. It has received 4379.3 GiB of new data in the last 7 days, and 78.4 GiB in the last 24 hours (other time intervals can be specified). - The "daily" option reports the inline compression statistics for the last 33 days.
- When the "daily-detailed" option is provided, the total compression ratios are further detailed by separating them into global and local compression ratios.
The mtree list output:
mtree list /data/col1/main
Name Pre-Comp (GiB) Status
--------------- -------------- ------
/data/col1/main 254792.4 RW
--------------- -------------- ------
D : Deleted
Q : Quota Defined
RO : Read Only
RW : Read Write
RD : Replication Destination
IRH : Retention-Lock Indefinite Retention Hold Enabled
ARL : Automatic-Retention-Lock Enabled
RLGE : Retention-Lock Governance Enabled
RLGD : Retention-Lock Governance Disabled
RLCE : Retention-Lock Compliance Enabled
M : Mobile
m : Migratable
MSC (no options):
mtree show compression /data/col1/main
From: 2023-09-07 12:00 To: 2023-09-14 12:00
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
------------- -------- --------- ----------- ---------- -------------
Written:
Last 7 days 4379.3 883.2 3.4x 1.5x 5.0x (79.8)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
------------- -------- --------- ----------- ---------- -------------
With "daily" option:
mtree show compression /data/col1/main daily
From: 2023-08-12 12:00 To: 2023-09-14 12:00
Sun Mon Tue Wed Thu Fri Sat Weekly
----- ----- ----- ----- ----- ----- ----- ------ -----------------
-13- -14- -15- -16- -17- -18- -19- Date
432.0 405.9 284.1 438.8 347.0 272.7 331.4 2511.8 Pre-Comp
85.5 66.2 45.3 81.9 61.4 57.4 66.3 464.1 Post-Comp
5.0x 6.1x 6.3x 5.4x 5.7x 4.7x 5.0x 5.4x Total-Comp Factor
-20- -21- -22- -23- -24- -25- -26-
478.0 387.8 450.2 533.1 386.0 258.4 393.6 2887.1
100.6 81.5 100.8 119.0 84.0 40.6 75.3 601.8
4.8x 4.8x 4.5x 4.5x 4.6x 6.4x 5.2x 4.8x
-27- -28- -29- -30- -31- -1- -2-
27.6 1.0 0.4 470.7 467.3 517.7 641.9 2126.7
4.9 0.2 0.1 83.9 92.3 89.8 140.1 411.2
5.6x 5.6x 4.3x 5.6x 5.1x 5.8x 4.6x 5.2x
-3- -4- -5- -6- -7- -8- -9-
539.6 495.0 652.8 658.7 537.1 398.7 305.5 3587.3
110.8 108.0 139.4 137.0 111.5 78.3 48.3 733.3
4.9x 4.6x 4.7x 4.8x 4.8x 5.1x 6.3x 4.9x
-10- -11- -12- -13- -14-
660.2 738.3 787.2 672.9 796.9 3655.5
143.9 152.5 167.6 126.9 163.3 754.2
4.6x 4.8x 4.7x 5.3x 4.9x 4.8x
----- ----- ----- ----- ----- ----- ----- ------ -----------------
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
-------------- -------- --------- ----------- ---------- -------------
Written:
Last 33 days 14768.3 2964.5 3.4x 1.5x 5.0x (79.9)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
-------------- -------- --------- ----------- ---------- -------------
Key:
Pre-Comp = Data written before compression
Post-Comp = Storage used after compression
Global-Comp Factor = Pre-Comp / (Size after de-dupe)
Local-Comp Factor = (Size after de-dupe) / Post-Comp
Total-Comp Factor = Pre-Comp / Post-Comp
Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100
With "daily-detailed" option:
mtree show compression /data/col1/main daily-detailed
From: 2023-08-12 12:00 To: 2023-09-14 12:00
Sun Mon Tue Wed Thu Fri Sat Weekly
----- ----- ----- ----- ----- ----- ----- ------ -----------------
-13- -14- -15- -16- -17- -18- -19- Date
432.0 405.9 284.1 438.8 347.0 272.7 331.4 2511.8 Pre-Comp
85.5 66.2 45.3 81.9 61.4 57.4 66.3 464.1 Post-Comp
3.5x 4.1x 4.3x 3.6x 3.8x 3.3x 3.4x 3.7x Global-Comp Factor
1.4x 1.5x 1.5x 1.5x 1.5x 1.4x 1.5x 1.5x Local-Comp Factor
5.0x 6.1x 6.3x 5.4x 5.7x 4.7x 5.0x 5.4x Total-Comp Factor
80.2 83.7 84.1 81.3 82.3 78.9 80.0 81.5 Reduction %
-20- -21- -22- -23- -24- -25- -26-
478.0 387.8 450.2 533.1 386.0 258.4 393.6 2887.1
100.6 81.5 100.8 119.0 84.0 40.6 75.3 601.8
3.3x 3.3x 3.0x 3.0x 3.3x 4.1x 3.6x 3.3x
1.4x 1.5x 1.5x 1.5x 1.4x 1.5x 1.4x 1.5x
4.8x 4.8x 4.5x 4.5x 4.6x 6.4x 5.2x 4.8x
79.0 79.0 77.6 77.7 78.2 84.3 80.9 79.2
-27- -28- -29- -30- -31- -1- -2-
27.6 1.0 0.4 470.7 467.3 517.7 641.9 2126.7
4.9 0.2 0.1 83.9 92.3 89.8 140.1 411.2
4.4x 3.7x 2.6x 3.8x 3.5x 3.9x 3.2x 3.5x
1.3x 1.5x 1.6x 1.5x 1.4x 1.5x 1.5x 1.5x
5.6x 5.6x 4.3x 5.6x 5.1x 5.8x 4.6x 5.2x
82.1 82.2 76.8 82.2 80.3 82.7 78.2 80.7
-3- -4- -5- -6- -7- -8- -9-
539.6 495.0 652.8 658.7 537.1 398.7 305.5 3587.3
110.8 108.0 139.4 137.0 111.5 78.3 48.3 733.3
3.4x 3.1x 3.2x 3.4x 3.3x 3.4x 4.1x 3.3x
1.4x 1.5x 1.5x 1.4x 1.4x 1.5x 1.6x 1.5x
4.9x 4.6x 4.7x 4.8x 4.8x 5.1x 6.3x 4.9x
79.5 78.2 78.6 79.2 79.2 80.4 84.2 79.6
-10- -11- -12- -13- -14-
660.2 738.3 787.2 672.9 796.9 3655.5
143.9 152.5 167.6 126.9 163.3 754.2
3.1x 3.4x 3.2x 3.7x 3.4x .3x
1.5x 1.4x 1.5x 1.4x 1.5x 1.5x
4.6x 4.8x 4.7x 5.3x 4.9x 4.8x
78.2 79.3 78.7 81.1 79.5 79.4
----- ----- ----- ----- ----- ----- ----- ------ -----------------
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
-------------- -------- --------- ----------- ---------- -------------
Written:
Last 33 days 14768.3 2964.5 3.4x 1.5x 5.0x (79.9)
Last 24 hrs 784.6 162.1 3.3x 1.4x 4.8x (79.3)
-------------- -------- --------- ----------- ---------- -------------
Key:
Pre-Comp = Data written before compression
Post-Comp = Storage used after compression
Global-Comp Factor = Pre-Comp / (Size after de-dupe)
Local-Comp Factor = (Size after de-dupe) / Post-Comp
Total-Comp Factor = Pre-Comp / Post-Comp
Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100