Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Data Domain: Understanding Data Domain Compression

Summary: The terminologies, tradeoffs, and measures are explained here to describe the compression types used, terminology, and other aspects of compression on Data Domain.

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Instructions

The compression techniques involved in a Data Domain use state-of-the-art techniques to reduce the physical space required by customer data. As such, the technologies and measurements of compression levels are complex topics. This document discusses some of the terminologies, tradeoffs, and measures in order to better explain the compression types used, terminology, and other aspects of compression in a Data Domain system.

APPLIES TO:
All Data Domain Models

1. Introduction

Last updated: January 2024

Compression is a data reduction technology which aims to store a dataset using less physical space. In Data Domain systems (DDOS), we do deduplication and local compression to compress user data. De-duplication, or "dedupe," is used to identify redundant data segments and store only unique data segments. Local compression further compresses the unique data segments with certain compression algorithms, such as lz, gzfast, gz, so on. The overall user data compression in DDOS is the joint effort of deduplication and local compression. DDOS uses "compression ratio" to measure the effectiveness of its data compression. Generally, it is the ratio of the total user data size to the total size of compressed data or the used physical space size.

Data Domain file system is a "log-structured" deduplication file system. A log-structured file system only appends data to the system and deletion by itself cannot free physical space. Such file systems rely on garbage collection to reclaim no-longer-needed space. The characteristics of the log-structured file system and the deduplicated technology combined together make it tricky to clearly understand all aspects of compression in DDOS.

For compression, there are many aspects we can measure. In this document, we discuss the details step by step to help understand DDOS compression. At first, we explain the overall system compression effect, which tells us the realistic compression achieved in a Data Domain system, the amount of user data, the amount of physical space consumed, and the ratio of them. This ratio is called "system effective compression ratio" in this document. DDOS conducts deduplication inline and tracks the statistics of the original user data segments, post-dedupe unique data segments, and the local compression effect on the unique data segments. These inline compression statistics are used to measure the inline compression effect. Inline compression statistics may be measured for each write. Also, DDOS keeps track of the statistics at different levels; files, MTrees, and the entire system.

The contents of this document can be applied to all DDOS releases until publication of this document, up to DDOS 7.13. There is no guarantee that all the contents are accurate for future releases. In releases prior to 5.0, the entire system has only one mtree and the term mtree is not explicitly called out.

2. Compression: System Overall Effect

The system-wide overall compression effect is measured by the system effective compression ratio, which is the ratio of the user data size to the size of used physical space. It is reported by the filesys show compression (FSC) CLI command (the corresponding information is also available on UI).  A sample output of FSC is shown at below:

# filesys show compression

From: 2023-12-31 03:00 To: 2024-01-07 03:00


Active Tier:
                   Pre-Comp   Post-Comp   Global-Comp   Local-Comp      Total-Comp
                      (GiB)       (GiB)        Factor       Factor          Factor
                                                                     (Reduction %)
----------------   --------   ---------   -----------   ----------   -------------
Currently Used:*     6439.6       113.4             -            -    56.8x (98.2)
Written:
  Last 7 days      135421.3      1782.0         35.1x         2.2x    76.0x (98.7)
  Last 24 hrs         532.5         1.5        334.3x         1.1x   356.5x (99.7)
----------------   --------   ---------   -----------   ----------   -------------
 * Does not include the effects of pre-comp file deletes/truncates
   since the last cleaning on 2024/01/05 11:34:13.

The system effective compression ratio is reported at row 1 of the result section in the CLI output. The row is highlighted above. The total user data size is labeled as "Pre-Comp." The total consumed physical space (by both data and metadata) is labeled as "Post-Comp."

The "Pre-Comp" number and "Post-Comp" number are both read at runtime. FSC implicitly synchronizes the entire system, then queries the two numbers. These two numbers are measured in the same way as the "filesys show space" command.

System effective compression ratio = Pre-Comp/Post-Comp

The rest of the FSC output describes the inline compression statistics, and we discuss them later.

There are some operations that can affect the system effective compression ratio:

  • Fastcopy

    • When a fastcopy is done from a file in the active namespace (not a snapshot), it is a perfect deduplication, as no extra physical space is needed for the target file. The effect of a fastcopy is that we increase the user data size without consuming additional physical space. This increases the system effective compression ratio. When many fastcopies are done, the system effective compression ratio may become artificially high.

  • Virtual synthetics

    • Virtual synthetic backups tend to show a high system effective compression ratio. This is because virtual synthetics make logical full backups, but only transfers changed or new data to Data Domain systems. The impact to system effective compression ratio of virtual synthetics is somewhat like the effect of fastcopy.

  • Overwrites

    • Overwrites consume more physical space but do not increase the logical size of the dataset, thus overwrites lower the system effective compression ratio.

  • Storing sparse files

    • Sparse files contain large "holes" that are counted in the logical size but do not consume physical space due to compression. As a result, they can make the system effective compression ratio seem high.

  • Storing small files

    • DDOS adds nearly 1 KB overhead to each file for certain internal metadata. When a system stores a significant number of small files (sizes less than 1 KB or in single-digit kilobytes), the overhead of metadata drags the effective compression ratio down.

  • Storing pre-compressed or pre-encrypted files

    • Compression and encryption can amplify the level of data change and reduce the possibility of deduplication. Such files usually cannot be deduplicated well and bring the system effective compression ratio lower.

  • Deletes

    • Deletions reduce the logical size of the system, but the system does not get the corresponding unused space back until garbage collection runs. Many deleted files make the compression ratio low until Garbage Collection (GC) runs.

  • Garbage Collection (GC) or Cleaning

    • GC reclaims the space consumed by data segments that are no longer seen by any file. If a lot of files have been deleted recently, GC may increase the system compression ratio by reducing the physical space consumption footprint.

  • Aggressively taking snapshots

    • When we take a snapshot of a Mtree, we do not change the logical size of the dataset. However, all the data segments referenced by the snapshot must be locked down, even if all files captured by the snapshot are deleted after the snapshot was taken. GC cannot reclaim the space that is still needed by snapshots; therefore having lots of snapshots may make the system effective compression ratio appear low. However, snapshots are useful crash recovery facilities. We should never hesitate to take snapshots or set up proper snapshot schedules when needed.

3. Compression: Inline Statistics

DDOS conducts deduplication inline, as data is written to the system. It tracks the effects of inline deduplication and local compression for each write, and accumulates the statistics at the file level. Per-file inline compression statistics are further aggregated at the mtree level and at the system level. Compression is measured based on three numbers in the inline statistics:

  • The length of each write, called raw_bytes
  • The length of all unique segments, called pre_lc_size
  • The length of locally compressed unique segments, called post_lc_size

Based on the above three numbers, DDOS defines two more fine-granularity compression ratios:

  • Global compression (g_comp). It equals (raw_bytes/pre_lc_size), and reflects the deduplication ratio;
  • Local compression (l_comp). It equals (pre_lc_size/post_lc_size) and reflects the effect of the local compression algorithm.

The accumulated inline compression statistics are part of the file metadata in DDOS and are stored in the file inode. DDOS provides tools to check the inline compressions at all three levels; file, MTree, and system-wide. We detail them in the following sections.

3.1 File Compression
File compression can be checked with the "filesys show compression <path>" CLI command, which reports the accumulated compression statistics stored in the file inode. When a directory is specified, the inline compression statistics of all the files under that directory are summed up and reported. In the CLI output, raw_bytes is labeled as "Original Bytes"; pre_lc_size is labeled as "Globally Compressed"; post_lc_bytes is marked as "Locally Compressed"; the other overheads are reported as "Meta-data." The two examples are captured from an actual DD:

Example 1: Inline compression statistics of a file

# filesys show compression /data/col1/main/dir1/file_1 
Total files: 1;  bytes/storage_used: 7.1
        Logical Bytes:       53,687,091,200
       Original Bytes:       11,463,643,380
  Globally Compressed:        4,373,117,751
   Locally Compressed:        1,604,726,416
            Meta-data:           18,118,232

Example 2: Inline compression statistics of all files under a directory, including all subdirectories

# filesys show compression /data/col1/main/dir1 
Total files: 13;  bytes/storage_used: 7.1
        Logical Bytes:       53,693,219,809
       Original Bytes:       11,501,978,884
  Globally Compressed:        4,387,212,404
   Locally Compressed:        1,608,444,046
            Meta-data:           18,241,880

The system reports the overall inline compression ratio in the above CLI output as "bytes/storage_used."  However, care must be taken in interpreting the above information, as it can be misleading for various reasons. One reason is that the pre_lc_size and post_lc_size are recorded at the time the data operations are processed. When the file that originally added those segments gets deleted, the number of the unique data segments in the remaining file should be increased.

As an example, assume a file sample.file is backed up to a Data Domain, and in the first backup, the compression information of the file is pre_lc_size=10GiB, post_lc_size=5Gib.

Next, assume that the data of this file is unique with no data sharing with any other file. In the second backup of the file, further assume that the file gets ideal deduplication, such that both pre_lc_size and post_lc_size should be zero because all segments of the file already existed on the system. When the first backup is deleted, the second backup of the file becomes the only file that references the 5 GiB of data segments. In this case, ideally, the pre_lc_size and post_lc_size of the file in the second backup should be updated from both being zero to be 10 GiB and 5 GiB, respectively. However, there is no way to detect which files that should be done for, so the inline compression statistics of the existing files are left unchanged.

Another factor that affects the above numbers is the cumulative statistics. When a file gets a lot of overwrites, it is impossible to track the extent to which the cumulative statistics reflect the writes that introduced the live data. Thus, over a long time, the inline compression statistics can only be treated as a heuristics to roughly estimate the compression of a particular file.

Another fact worth highlighting is that the inline compression of a file cannot be measured for an arbitrary time interval. The file inline compression statistics are a cumulative result and cover all the writes that the file has ever received. When a file receives lots of overwrites, the raw_bytes can be far larger than the logical size of the file. For sparse files, the file sizes may be larger than the "Original Bytes."

3.2 MTree Compression
We can check the compression of a particular mtree with the "mtree show compression" (MSC) CLI command. The absolute values of the inline compression statistics are cumulative over the lifetime of the MTree. Given that the lifetime of a MTree can be many years long, these values become less and less informative over time. To address this issue, we use the amount of change (deltas) of the inline compression statistics and report compression only for certain time intervals. The underlying approach is that we periodically dump the MTree inline compression statistics to a log. When a client queries MTree compression with the MSC command, we use the log to calculate the deltas of the numbers for compression reporting. By default, MSC reports compression for the last 7 days and the last 24 hours, though anytime period of interest can be specified.

To demonstrate, assume the following log for MTree A:

3:00AM, raw_bytes=11000GB, pre_lc_size=100GB, post_lc_size=50GB 4:00AM, raw_bytes=12000GB, pre_lc_size=200GB, post_lc_size=100GB

Then the compression of MTree A for this hour is:

g_comp = (12000-11000)/(200-100) = 10x
l_comp = (200-100)/(100-50) = 2x
overall compression ratio = (12000-11000)/(100-50) = 20x

The above compression ratio calculation does nothing with the dataset size. For example, the above mtree may only have 500 GB logical data.

MSC supports the "daily" and "daily-detailed" options, as does the "filesys show compression" command. When "daily" is specified, the command reports the daily compression in a calendar fashion. It uses the daily deltas of the raw_bytes and post_lc_size to compute the daily compression ratio. When "daily-detailed" is specified, the command shows all three deltas (of the raw_bytes, pre_lc_size, and post_lc_size, respectively) for each day; it also computes the g_comp and l_comp alongside the total compression factor.

Sample output from these systems is in the Appendix.

3.3 System Compression
Once we understand how compression is reported on MTrees, it is straightforward to extend the concept to the entire system. The system-wide compression inline statistics collection and reporting are exactly the same as with MTrees. The only difference is the scope, as one is in a particular MTree, while one is over the entire system. The results can be checked by using the "filesys show compression" command. An example of this can be found in Section 2. The "last 7 days" and "last 24 hours" system compression is reported in the last two lines of the result section in the FSC output.

4. Cloud Tier

On DDs with cloud tier implemented, the storage is separated into the active tier and the cloud tier, which are two independent deduplication domains. Users can inject data only into the active tier. Later, DDOS data-movement functions can be used to migrate data from the active tier to the cloud tier. Thus the space and compression measurement and reporting are handled independently in each tier. But at a file level, we do not differentiate by tier and report inline compression statistics; they are exactly the same as what we described in Section 3.1.

5. Deduplication

The last topic to highlight is some of the characteristics of deduplication, which is called "global compression" in many Data Domain documents. Although it contains the word "compression," it is entirely different than the traditional concept of compression, which is also provided by DDOS under the name "local compression."

Local compression reduces the size of a piece of data using a certain algorithm (some kinds of data are not compressible and applying compression algorithms on them may slightly increase data size). Usually, once an algorithm is decided, the data itself is the only factor of the compression ratio.

However, deduplication is different - it is not a local concept, it is "global." An incoming data segment is deduped against all the existing data segments in a deduplicated domain, which includes all the data on non-cloud Data Domain systems. The data segment itself does not matter in the deduplication procedure.

In practice, we rarely see high deduplication ratio in the initial backup of a dataset. In initial backups, often the major data reduction comes from local compression. When subsequent backups land on the Data Domain, deduplication shows its strength and becomes the dominant factor for compression. The effectiveness of deduplication relies on the fact that the change rate of a dataset is low from backup to backup. For this reason, datasets with high change rates cannot be well deduped. When the backup application inserts its own metadata chunks (called markers by Data Domain) into the backup images at high frequency, it also may not get a good deduplication ratio. Our marker-handling techniques can help sometimes, but not always.

Given these observations, what can we expect?

  • Initial backups may only achieve a small system effective compression ratio, often 2x, or 3x. Dedupe usually has little opportunity to show its strength in initial backups.
  • The global compression ratio of an incremental backup is lower than the compression ratio of the corresponding full backup. This is because an incremental backup contains only changed or new files compared to the immediate earlier backup. The global compression ratio depends on the percentage of new data within the incremental backup.
  • The deduplication ratio of a full backup (the non-initial ones) can also be low in some scenarios. Some frequently-observed scenarios include:
    • A high change rate in the data being backed up
    • The dataset being dominated by small files (less than 5 MiB)
    • Backup applications adding a lot of closely spaced markers
    • Database backups that are incremental or using small block size
    • When a low compression ratio is observed in a full backup with a low data change rate, we must check if one of the above cases applies, or if further analysis is needed.
  • Compression of a later backup image does not always be better than the initial one. Consecutive backup images can show a high deduplication ratio because the initial and earlier backup images already added most of the data to the system. When all the earlier backup images are deleted, the global and local compression ratio of the earliest existing backup image may still be high, but this only means that it got good deduplication when it was added to the system, nothing else. When a file is deleted that has a high global and local compression ratio and is the last backup image of a particular dataset, it may release more space than the size derived from the compression ratio.
  • Compression ratios of the same dataset on different systems cannot be compared, regardless of the way the dataset is added to those systems. This is because that each system is an independent deduplication domain. There is no expectation that two different DDs get the same or even necessarily similar compression ratios, even if their datasets are the same.

 6. Summary

Measuring compression is difficult in deduplicated file systems, but it is even harder in log-structured deduplicated file systems. We must understand how deduplication works and how compression statistics are tracked. Compression ratios are useful information to understand the behavior of a particular system. The system effective compression ratio is the most important, reliable, and informative measure. The inline compression statistics can be helpful too, but they might be no more than heuristics in some circumstances.

Appendix: Sample output of "mtree show compression" command

Assume that there is a MTree holding 254792.4 GiB of data. It has received 4379.3 GiB of new data in the last 7 days, and 78.4 GiB in the last 24 hours (other time intervals can be specified). The "daily" option reports the inline compression statistics for the last 33 days. When the "daily-detailed" option is provided, the total compression ratios are further detailed by separating them into global and local compression ratios.

Mtree List output:

# mtree list /data/col1/main 
Name              Pre-Comp (GiB)   Status
---------------   --------------   ------
/data/col1/main         254792.4   RW
---------------   --------------   ------
 D    : Deleted
 Q    : Quota Defined
 RO   : Read Only
 RW   : Read Write
 RD   : Replication Destination
 IRH  : Retention-Lock Indefinite Retention Hold Enabled
 ARL  : Automatic-Retention-Lock Enabled
 RLGE : Retention-Lock Governance Enabled
 RLGD : Retention-Lock Governance Disabled
 RLCE : Retention-Lock Compliance Enabled
 M    : Mobile
 m    : Migratable
MSC (no options):
# mtree show compression /data/col1/main

From: 2023-09-07 12:00 To: 2023-09-14 12:00

                Pre-Comp   Post-Comp   Global-Comp   Local-Comp      Total-Comp
                   (GiB)       (GiB)        Factor       Factor          Factor
                                                                  (Reduction %)
-------------   --------   ---------   -----------   ----------   -------------
Written:
  Last 7 days     4379.3       883.2          3.4x         1.5x     5.0x (79.8)
  Last 24 hrs      784.6       162.1          3.3x         1.4x     4.8x (79.3)
-------------   --------   ---------   -----------   ----------   -------------

With "daily" option:

# mtree show compression /data/col1/main daily

From: 2023-08-12 12:00 To: 2023-09-14 12:00

  Sun     Mon     Tue     Wed     Thu     Fri     Sat   Weekly
-----   -----   -----   -----   -----   -----   -----   ------   -----------------
 -13-    -14-    -15-    -16-    -17-    -18-    -19-            Date
432.0   405.9   284.1   438.8   347.0   272.7   331.4   2511.8   Pre-Comp
 85.5    66.2    45.3    81.9    61.4    57.4    66.3    464.1   Post-Comp
 5.0x    6.1x    6.3x    5.4x    5.7x    4.7x    5.0x     5.4x   Total-Comp Factor

 -20-    -21-    -22-    -23-    -24-    -25-    -26-
478.0   387.8   450.2   533.1   386.0   258.4   393.6   2887.1
100.6    81.5   100.8   119.0    84.0    40.6    75.3    601.8
 4.8x    4.8x    4.5x    4.5x    4.6x    6.4x    5.2x     4.8x

 -27-    -28-    -29-    -30-    -31-     -1-     -2-
 27.6     1.0     0.4   470.7   467.3   517.7   641.9   2126.7
  4.9     0.2     0.1    83.9    92.3    89.8   140.1    411.2
 5.6x    5.6x    4.3x    5.6x    5.1x    5.8x    4.6x     5.2x

  -3-     -4-     -5-     -6-     -7-     -8-     -9-
539.6   495.0   652.8   658.7   537.1   398.7   305.5   3587.3 
110.8   108.0   139.4   137.0   111.5    78.3    48.3    733.3 
 4.9x    4.6x    4.7x    4.8x    4.8x    5.1x    6.3x     4.9x 

 -10-    -11-    -12-    -13-    -14-   
660.2   738.3   787.2   672.9   796.9                   3655.5
143.9   152.5   167.6   126.9   163.3                    754.2 
 4.6x    4.8x    4.7x    5.3x    4.9x                     4.8x 
-----   -----   -----   -----   -----   -----   -----   ------   -----------------
                 Pre-Comp   Post-Comp   Global-Comp   Local-Comp      Total-Comp
                    (GiB)       (GiB)        Factor       Factor          Factor
                                                                   (Reduction %)
--------------   --------   ---------   -----------   ----------   -------------
Written:
  Last 33 days    14768.3      2964.5          3.4x         1.5x     5.0x (79.9)
  Last 24 hrs       784.6       162.1          3.3x         1.4x     4.8x (79.3)
--------------   --------   ---------   -----------   ----------   -------------

Key:
       Pre-Comp = Data written before compression
       Post-Comp = Storage used after compression
       Global-Comp Factor = Pre-Comp / (Size after de-dupe)
       Local-Comp Factor = (Size after de-dupe) / Post-Comp
       Total-Comp Factor = Pre-Comp / Post-Comp
       Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100

With "daily-detailed" option:

# mtree show compression /data/col1/main daily-detailed 

From: 2023-08-12 12:00 To: 2023-09-14 12:00

  Sun     Mon     Tue     Wed     Thu    Fri     Sat    Weekly
-----   -----   -----   -----   -----   -----   -----   ------   -----------------
 -13-    -14-    -15-    -16-    -17-    -18-    -19-            Date
432.0   405.9   284.1   438.8   347.0   272.7   331.4   2511.8   Pre-Comp
 85.5    66.2    45.3    81.9    61.4    57.4    66.3    464.1   Post-Comp
 3.5x    4.1x    4.3x    3.6x    3.8x    3.3x    3.4x     3.7x   Global-Comp Factor
 1.4x    1.5x    1.5x    1.5x    1.5x    1.4x    1.5x     1.5x   Local-Comp Factor
 5.0x    6.1x    6.3x    5.4x    5.7x    4.7x    5.0x     5.4x   Total-Comp Factor
 80.2    83.7    84.1    81.3    82.3    78.9    80.0     81.5   Reduction %

 -20-    -21-    -22-    -23-    -24-    -25-    -26-
478.0   387.8   450.2   533.1   386.0   258.4   393.6   2887.1
100.6    81.5   100.8   119.0    84.0    40.6    75.3    601.8
 3.3x    3.3x    3.0x    3.0x    3.3x    4.1x    3.6x     3.3x 
 1.4x    1.5x    1.5x    1.5x    1.4x    1.5x    1.4x     1.5x 
 4.8x    4.8x    4.5x    4.5x    4.6x    6.4x    5.2x     4.8x
 79.0    79.0    77.6    77.7    78.2    84.3    80.9     79.2

 -27-    -28-    -29-    -30-    -31-    -1-     -2-
 27.6     1.0     0.4   470.7   467.3   517.7   641.9   2126.7
  4.9     0.2     0.1    83.9    92.3    89.8   140.1    411.2
 4.4x    3.7x    2.6x    3.8x    3.5x    3.9x    3.2x     3.5x 
 1.3x    1.5x    1.6x    1.5x    1.4x    1.5x    1.5x     1.5x
 5.6x    5.6x    4.3x    5.6x    5.1x    5.8x    4.6x     5.2x
 82.1    82.2    76.8    82.2    80.3    82.7    78.2     80.7

  -3-     -4-     -5-     -6-     -7-    -8-     -9-
539.6   495.0   652.8   658.7   537.1   398.7   305.5   3587.3 
110.8   108.0   139.4   137.0   111.5    78.3    48.3    733.3 
 3.4x    3.1x    3.2x    3.4x    3.3x    3.4x    4.1x     3.3x 
 1.4x    1.5x    1.5x    1.4x    1.4x    1.5x    1.6x     1.5x
 4.9x    4.6x    4.7x    4.8x    4.8x    5.1x    6.3x     4.9x 
 79.5    78.2    78.6    79.2    79.2    80.4    84.2     79.6

 -10-    -11-    -12-    -13-    -14-   
660.2   738.3   787.2   672.9   796.9                   3655.5
143.9   152.5   167.6   126.9   163.3                    754.2
 3.1x    3.4x    3.2x    3.7x    3.4x                      .3x 
 1.5x    1.4x    1.5x    1.4x    1.5x                     1.5x
 4.6x    4.8x    4.7x    5.3x    4.9x                     4.8x
 78.2    79.3    78.7    81.1    79.5                     79.4
-----   -----   -----   -----   -----   -----   -----   ------   -----------------
                 Pre-Comp   Post-Comp   Global-Comp   Local-Comp      Total-Comp
                    (GiB)       (GiB)        Factor       Factor          Factor
                                                                   (Reduction %)
--------------   --------   ---------   -----------   ----------   -------------
Written:
  Last 33 days    14768.3      2964.5          3.4x         1.5x     5.0x (79.9)
  Last 24 hrs       784.6       162.1          3.3x         1.4x     4.8x (79.3)
--------------   --------   ---------   -----------   ----------   -------------

Key:
       Pre-Comp = Data written before compression
       Post-Comp = Storage used after compression
       Global-Comp Factor = Pre-Comp / (Size after de-dupe)
       Local-Comp Factor = (Size after de-dupe) / Post-Comp
       Total-Comp Factor = Pre-Comp / Post-Comp
       Reduction % = ((Pre-Comp - Post-Comp) / Pre-Comp) * 100

Article Properties


Affected Product

Data Domain

Product

Data Domain

Last Published Date

28 Mar 2024

Version

16

Article Type

How To