Troubleshooting poor deduplication and compression ratio of files on Data Domain Restorers (DDRs)
Summary: Troubleshooting poor deduplication and compression ratio of files on Data Domain Restorers (DDRs)
Symptoms
Data Domain Restorers (DDRs) are designed to hold large amounts of logical (precompressed) data using minimal physical (postcompressed) disk space. This is achieved using:
- Deduplication of ingested data to remove duplicate chunks of data that is already stored on disk on the DDR leaving only unique data
- Compression of unique data before that data is physically written to disk.
The overall compression ratio of data that a DDR can ingest varies due to several factors such as:
- Use case
- Data types being ingested
- Backup application configuration
When optimally configured, DDRs typically achieve 10-20x overall compression ratio (and sometimes can show ratios higher than this). Conversely, however, in some environments overall compression ratio can be lower than this which can cause:
- The DDR to quickly exhaust its usable capacity
- Impact to backup, restore, or replication performance
- A failure of the DDR to meet customer expectations
- It can also be impacting on cloud units attached to the DD
Cause
- A brief overview of deduplication and compression of data on a DDR
- How to determine the overall compression ratio for the system and individual files
- Factors which can cause degradation to overall compression ratio
Resolution
How does a Data Domain Restorer ingest new data?
- The backup application sends data (that is files) to the DDR.
- The DDR splits these files into chunks of 4-12 Kb in size - each chunk is seen as a 'segment'.
- The DDR generates a unique 'fingerprint' (similar to a checksum) for each segment depending on the data contained within the segment.
- The fingerprints of newly arrived segments are checked against on disk indexes on the DDR to determine if the DDR already holds a segment with the same fingerprint.
- If the DDR already holds a segment with the same fingerprint, then the corresponding segment in the newly arrived data is a duplicate and can be dropped (that is deduplicated).
- Once all duplicate segments have been removed from the newly arrived data, only unique or new segments remain.
- These unique or new segments are grouped into 128 Kb 'compression regions' then compressed (using the lz algorithm by default).
- Compressed compression regions are packed into 4.5 Mb units of storage that is known as 'containers' which are then written to the hard drive.
- A very similar process is used when moving data to the cloud tier.
How does the DDR track which segments make up a certain file?
In addition to deduplication/compression of newly arrived data, the DDR also builds a 'segment tree' for each ingested file. This is essentially a list of segment 'fingerprints' making up that file. If the DDR must later read the file back, it:
- Determine the location of the files segment tree.
- Read the segment tree to obtain a list of all segment fingerprints making up the region of file being read.
- Use on disk indices to determine the physical location (that is container) of the data on disk.
- Read the physical segment data from the underlying containers on disk.
- Use physical segment data to reconstruct the file.
File segment trees are also stored in 4.5 Mb containers on disk and represent the majority of each files 'metadata' (discussed later in this article).
How can overall compression ratio on a DDR be determined?
The overall utilization of a DDR (and compression ratio) can be seen using the 'filesys show space' command. For example:
Active Tier:
Resource Size GiB Used GiB Avail GiB Use% Cleanable GiB*
---------------- -------- -------- --------- ---- --------------
/data: pre-comp - 115367.8 - - -
/data: post-comp 6794.2 6242.4 551.8 92% 202.5
/ddvar 49.2 9.1 37.6 20% -
---------------- -------- -------- --------- ---- --------------
In this case, we see that:
- Precompressed or logical data that is held on DDR: 115367.8 Gb
- Postcompressed or physical space that is used on DDR: 6242.4 Gb
- The overall compression ratio is 115367.8 / 6242.4 = 18.48x
The output of the 'filesys show compression' command confirms the data that is held, space that is used, and the compression ratio. For example:
Pre-Comp Post-Comp Global-Comp Local-Comp Total-Comp
(GiB) (GiB) Factor Factor Factor
(Reduction %)
---------------- -------- --------- ----------- ---------- -------------
Currently Used:* 115367.8 6242.4 - - 18.5x (94.6) <=== NOTE
Written:
Last 7 days 42214.7 1863.2 11.0x 2.1x 22.7x (95.6)
Last 24 hrs 4924.8 274.0 8.8x 2.0x 18.0x (94.4)
---------------- -------- --------- ----------- ---------- -------------
Overall utilization figures on the DDR are calculated as follows:
- Total precompressed data: The sum of the precompressed (logical) size of all files held by the DDR.
- Total postcompressed data: The number of in use 'containers' on disk multiplied by 4.5 Mb (the size of a single container).
- Total postcompressed size: The number of maximum 'containers' which are created given available disk space on the system.
Statistics around maximum in use containers are available in autosupports. For example:
Container set 73fcacadea763b48:b66f6a65133e6c73:
...
attrs.psize = 4718592 <=== Container size in bytes
...
attrs.max_containers = 1546057 <=== Maximum possible containers
attrs.free_containers = 125562 <=== Currently free containers
attrs.used_containers = 1420495 <=== Currently in use containers
...
See that:
Postcomp used = 1420495 * 4718592 / 1024 / 1024 / 1024 = 6242.4 Gb
How can deduplication and compression ratios for an individual file, directory, or directory tree be determined?
When a file is ingested the DDR records statistics about the file including:
- Precompressed (logical) bytes
- Size of unique segments after deduplication
- Size of unique segments after deduplication and compression
- Size of the file's metadata (that is segment tree and so on)
It is possible to dump some of these statistics using the 'filesys show compression [path]' command - for example to report statistics for a single file:
SE@DDVE60_JF## filesys show compression /data/col1/backup/testfile
Total files: 1; bytes/storage_used: 2.9
Original Bytes: 3,242,460,364
Globally Compressed: 1,113,584,070
Locally Compressed: 1,130,871,915
Meta-data: 4,772,672
To report statistics for an entire directory tree:
SE@DDVE60_JF## filesys show compression /data/col1/backup
Total files: 3; bytes/storage_used: 1.4
Original Bytes: 7,554,284,280
Globally Compressed: 5,425,407,986
Locally Compressed: 5,510,685,100
Meta-data: 23,263,692
Note, however, that there are a couple of caveats around using these statistics:
- The statistics are generated at the time of file or data ingestion, and following this are not updated. Due to how a DDR works, ingestion of new files or deletion of files referencing the same data, and so on, can change how a file deduplicates over time causing these statistics to become stale.
- In addition, certain use cases on the DDR (such as fastcopy of a file then deletion of the original file) can cause these statistics to become misleading or incorrect.
As a result, these figures should be considered as estimates only.
The precompressed bytes are not necessarily the precompressed/logical size of the file. Instead, it is the total number of bytes written to a file in its lifetime. As a result, in certain environments existing files are commonly overwritten (such as those using virtual tape library functionality), this figure can be larger than the logical size of the corresponding files.
Can ingestion of 'poor quality' data cause degradation in the overall compression ratio?
Yes - For a DDR to achieve a good overall compression ratio of ingested data, it must be able to deduplicate and compress that data. There are various types of data which can prevent this as discussed below:
Precompressed/pre-encrypted data:
These are data types which are either compressed or encrypted on the client system or by the backup application. This may also include application specific files which are compressed or encrypted by design (for example media files) and database files which are either compressed or encrypted or embed binary objects such as media files.
Due to how the compression or encryption algorithm works a relatively small change to a file's underlying data causes changes to 'ripple out' across the file. For example, a client may hold a 100Mb encrypted file within which 10Kb are modified. Ordinarily, the resulting file would be identical before and after modification apart from the 10Kb section which has changed. When encryption is used, even though only 10 Kb of unencrypted data has changed before and after modification, the encryption algorithm causes the entire contents of the file to change.
When such data is regularly modified and periodically sent to a DDR, this 'ripple out' effect causes each generation of the file to look different to previous generations of the same file. As a result, each generation contains a unique set of segments (and segment fingerprints) so it shows poor deduplication ratio.
Note also that instead of precompressed files the lz algorithm is unlikely to be able to further compress constituent segment data so data cannot be compressed before being written to disk.
As a general guideline precompression/pre-encryption causes the following:
- Preencrypted data: Poor deduplication ratio but acceptable compression ratio
- Precompressed data: Poor deduplication ratio and poor compression ratio
When the same (that is unchanged) precompressed/pre-encrypted data is ingested by a DDR multiple times, deduplication ratio of the data improves as, despite the use of compression or encryption algorithms, a similar set of segments (and segment fingerprints) is seen during each backup.
Where possible data sent to a DDR should not be encrypted or compressed - This may necessitate disabling encryption or compression on the end client or within the corresponding backup application.
For assistance in checking, modifying encryption, or compression settings within a certain backup, client application, or operating system contact the appropriate support provider.
Media files:
Certain file types contain precompressed or preencrypted data by design. For example:
- PDF files
- Certain audio files (mp3, wma, ogg, and so forth)
- Video files (avi, mkv, and so forth)
- Image files (png, bmp, jpeg, and so forth)
- Application specific files (Microsoft Office, Open Office, Libre Office, and so forth)
The data within the files is compressed or encrypted by the file's codec and, as a result, causes the same issues when ingested onto a DDR as described above for precompressed or preencrypted data.
Files with high 'uniqueness':
Achieving a good deduplication ratio depends on the DDR seeing the same set of segments (and segment fingerprints) multiple times. Certain data types, however, contain only unique transactional data which, by design, contain 'unique' data.
If these files are sent to a DDR, then each generation of the backup contains a unique set of segments or segment fingerprints and as a result, sees degraded deduplication ratio.
Examples of such files are:
- Database transaction logs (for example Oracle archive logs).
- Microsoft Exchange transaction logs
The first backup of a 'new' client to a DDR can also cause this issue (as the data has not been seen before by the DDR hence the corresponding segments or segment fingerprints in the backup are unique). Over time, however, as future generations of the same backup are sent to the DDR the deduplication ratio of backups improve as fewer segments in each new backup are unique. Due to this, it is expected that the overall deduplication or compression ratio on a newly installed DDR receiving mostly new backups are degraded but improve over time.
Small files:
Small files cause various issues when written to a DDR. These include:
- Metadata bloat - The DDR starts to hold a higher than expected amount of file metadata when compared to physical data.
- Poor container utilization - By design (due to the Data Domain Stream Informed Segment Layout or SISL architecture - beyond the scope of this document) a 4.5Mb container on disk only holds data from one single file. As a result backing up a single 10 Kb file, for example, causes at least one full 4.5 Mb container to be written for that file. This may mean that, for such files, the DDR uses considerably more postcompressed (physical) space than the corresponding amount of precompressed (logical) data being backed up which in turn causes a negative overall compression ratio.
- Poor deduplication ratio - Files which are smaller than 4 Kb (the minimum supported segment size on a DDR) consists of a single segment which is padded to 4 Kb. Such segments are not deduplicated but instead are written directly to disk. This can cause the DDR to hold multiple copies of the same segment (seen as duplicate segments).
- Poor backup, restore, or clean performance - There are large overheads during backup, restore, or clean when moving from one file to the next (as the context of metadata being used has to be switched).
See that:
- The impact to clean performance when using small files has been mitigated, to an extent, by the introduction of physical cleaning or garbage collection in DDOS 5.5 and later.
- Cleaning attempts to 'undo' poor container utilization by aggregating data from containers with low utilization into more tightly packed containers during its copy phase.
- Cleaning attempts to remove excessive duplicate segments during its copy phase.
Despite the above, the use of large numbers of small files or workloads consisting mainly of small files should be avoided. It is better to combine large numbers of small files into a single uncompressed/unencrypted archive before backup than sending the small files to the DDR in their native state. For example, it is far better to back up a single 10 Gb file containing 1048576 individual 10 Kb files than all 1048576 files individually.
These same issues with small files will also be impacting on the cloud tier. There will be metadata bloat for the local cloud metadata storage when backing up small files to the cloud unit. The container utilization and dedup ratio will be poor for the cloud unit.
Excessive multiplexing by backup applications:
Backup applications can be configured to perform multiplexing of data across streams being sent to the backup device, that is data from input streams (that is different clients) is sent in a single stream to the backup device. This functionality is primarily of use when writing to physical tape devices as:
- A physical tape device can only support a single incoming write stream.
- The backup application must maintain sufficient throughput to the tape device to prevent tape starts, stops, or rewinds (also known as shoe-shining) - This is easier if the stream going to the tape device contains data being read from more than one client.
In the case of a DDR, however, this causes a single file on the DDR to contain data from multiple clients that are interleaved in arbitrary order or chunk sizes. This can cause a degraded deduplication ratio as the DDR may not be able to accurately notice duplicate segments from each generation of a given clients backup. In general, the smaller the multiplexing granularity the worse the impact to deduplication ratio.
In addition, restore performance may be poor as to restore a certain clients data the DDR must read many files or containers where most of the data in the files or containers is superfluous as it relates to other clients backups.
Backup applications do not must use multiplexing when writing to a DDR as DDRs support higher incoming stream count than physical tape devices with each stream being able to write at a variable speed. As a result multiplexing by backup applications should be disabled. If backup performance is impacted after disabling multiplexing, then:
- Backup applications using CIFS, NFS, or OST (DDBoost) should have their number of write streams increased (so that more files can be written in parallel on the DDR).
- Environments using VTL should add additional drives to the DDR as each drive allows support of an additional parallel write stream.
If assistance is required in disabling multiplexing or you want to discuss recommended multiplexing configuration for a specific backup application, contact your contracted support provider.
Backup applications inserting excessive tape markers:
Some backup applications may insert repeated data structures into a backup stream that is known as 'markers'. Markers do not represent physical data within the backup but instead are used as an indexing or positional system by the backup application.
In some circumstances, the inclusion of markers in a backup stream can degrade the deduplication ratio, for example:
- In the first generation of a backup, there was 12 Kb of data which was contiguous - This was recognized by the DDR as a single segment.
- In the second generation of the backup, however, the same 12 Kb of data is split by the inclusion of a backup marker which may be represented by 6 Kb of data, backup marker, 6 Kb of data.
- As a result, the segments that are created during the second generation of the backup do not match those generated during the first generation of the backup hence, they do not deduplicate properly.
The more closely spaced the markers are the worse the impact to deduplication ratio (for example a backup application inserting markers every 32 Kb causes more issues than a backup application inserting markers every 1 Mb).
To avoid this issue, the DDR uses marker recognition technology which allows:
- Back up markers to be transparently removed from the backup stream during ingest of the backup.
- Back up markers to be reinserted into the backup stream during restore of the backup
This helps to prevent fragmentation of data or segments by backup markers and improves the deduplication ratio of corresponding backups.
To take full advantage of this technology, however, it is important that the DDR can correctly recognize the markers being inserted in backup streams. The DDR looks for markers depending on the setting of the 'marker type' option, for example:
SE@DDVE60_JF## filesys option show
Option Value
-------------------------------- --------
...
Marker-type auto
...
-------------------------------- --------
Usually this should be left set to 'auto' as this allows the DDR to automatically match most common marker types. If the system is ingesting data from only one backup application which does insert markers, then there may be a performance benefit from specifying a specific marker type, that is:
# filesys option set marker-type {auto | nw1 | cv1 | tsm1 | tsm2 | eti1 | fdr1 | hpdp1 | besr1 | ssrt1 | ism1 | bti1| none}
See that:
- Any benefit to performance from selecting a specific marker type is likely to be minimal.
- Selecting an incorrect marker type may cause significant additional degradation to backup or restore performance and deduplication ratio.
As a result, Data Domain generally recommends leaving marker type set to 'auto'. For further advice on modifying marker type, contact your contracted support provider.
For systems ingesting data from applications which use backup markers but which are not recognized by automated marker handling technology (such as products from BridgeHead software) contact your contracted support provider who can then work with Data Domain support to determine required settings on the DDR to detect the nonstandard marker.
Indications of 'poor quality' data being received by a DDR:
The following table lists expected deduplication and compression ratios for the different data types that are listed above. This list is not exhaustive and there can obviously be some variation in the exact figures that are seen on a given system due to workload or data that is ingested by the DDR:
| Global Compression | Local Compression | Likely Cause |
| Low (1x - 4x) | Low (1x - 1.5x) | Precompressed or encrypted data |
| Low (1x - 2x) | High (>2x) | Unique but compressible data, such as database archive logs |
| Low (2x - 5x) | High (>1.5x) | Markers that are not detected or high data change rate or stream multiplexing. |
| High (>10x) | Low (<1.5x) | Backups of the same compressed or encrypted data. This is uncommon. |
Are there certain factors on a DDR which can impact overall deduplication ratio?
Yes - There are several factors which can cause old/superflous data to be retained on disk on a DDR which causes an increase in postcompressed (physical) disk space and a drop in overall compression ratio. Such factors are discussed below.
A failure to regularly run file system cleaning:
File system cleaning is the only way in which to physically remove old/superflous data on disk which is no longer being referenced by files on the DDR. As a result, a user might delete several files from the system (causing a drop in precompressed utilization) but not run clean (leaving postcompressed/physical utilization high). This would cause a drop in overall compression ratio.
Data Domain recommends scheduling clean to run at regular intervals as follows:
- Normal DDR: Once per week
- DDR using extended retention: Once every two weeks
Clean should not be run more than once a week as this can cause issues with fragmentation of data on disk which manifests itself as poor restore/replication performance.
Excessive old snapshots on the system:
DDRs can create mtree snapshots which represent the contents of an mtree at the point in time the snapshot was created. Note, however, that leaving old snapshots on a system can cause an increase in postcompressed/physical utilization causing a drop in overall compression ratio. For example:
- An mtree exists containing many files (so precompressed utilization is high).
- A snapshot of the mtree is created.
- Many the files are deleted (causing precompressed utilization to drop).
- File system cleaning is run - note, however, that minimal hard drive space is freed as a copy of the deleted files remains in the mtree snapshot meaning that data that is referenced by those files cannot be removed from disk.
- As a result postcompressed/physical utilization remains high
Data Domain recommends that if mtree snapshots are used (for example for recovery from accidental data deletion) they are managed using automated snapshot schedules such that snapshots are created at regular intervals with a defined expiration period (the amount of time before the snapshot is automatically removed). In addition, the expiration period should be as short as possible (however this can obviously depend on use case of the snapshots or level of protection these snapshots are providing). This prevents a buildup of old snapshots with a long expiration period.
Further information about working with snapshots and snapshot schedules is available in the following article: Data Domain - Managing Snapshot Schedules
Excessive replication lag:
Native Data Domain replication uses either a replication log or mtree snapshots (depending on replication type) to track which files or data are pending replication to a remote DDR. Replication lag is the concept of the replica falling behind changes to the source DDR. This can occur due to various factors including:
- Replication contexts being disabled
- Insufficient network bandwidth between DDRs
- Frequent network disconnects.
Large replication lag can cause the replication log to continue to contain references to files which have been deleted on the source DDR or old or stale mtree snapshots on source and destination DDRs. As described above data referenced by these snapshots (or the replication log) cannot be physically removed from disk on the DDR even if corresponding files have been deleted from the system. This can cause postcompressed or physical utilization of the DDR to increase which then causes degradation to deduplication ratio.
If DDRs are suffering from high utilization, and this is believed to be due to replication lag contact your contracted support provider for further assistance.
Are there configuration changes or certain factors on a DDR which can increase the overall compression ratio?
Yes - removing or addressing the issues that are discussed previously in this document should allow a DDR to show an improving overall compression ratio over time. There are also various factors or work loads on a DDR which can lead to an increase in deduplication ratio. These generally involve:
- Reducing the amount of hard drive space used by files on the DDR (for example increasing the aggressiveness of the compression algorithm used by the DDR)
- Suddenly increasing the amount of precompressed (logical) data on the DDR without a corresponding increase in postcompressed/physical utilization
Modification of compression algorithm:
By default DDRs compress data being written to disk with the the lz algorithm. As mentioned previously, lz is used as it has relatively low overheads in terms of CPU required for compression or decompression but shows reasonable effectiveness in reducing data size.
It is possible, to increase the aggressiveness of the compression algorithm to provide further savings in postcompressed or hard drive utilization (and as a result improve the overall compression ratio). Supported compression algorithms, in order of effectiveness (from low to high), are as follows:
- lz
- gzfast
- gz
A general comparison of each algorithm is as follows:
- lz compared to gzfast gives ~15% better compression and consumes 2x CPU.
- lz compared to gz gives ~30% better compression and consumes 5x CPU.
- gzfast compared to gz gives ~10-15% better compression.
It is also possible to completely disable compression (specify an algorithm of none) however this is not supported for use on customer systems and is for internal testing only.
As per the above table the more aggressive the compression algorithm the more CPU is required during compression or decompression of data. Due to this, changes to a more aggressive algorithm should only be made on systems which are lightly loaded under normal workload. Changing the algorithm on heavily loaded systems can lead to extreme degradation in backup or restore performance and possible file system panics or restarts (causing an outage of the DDR).
For further information about changing compression type see the following article: Data Domain System and Cleaning Performance Impact of Converting to GZ Compression
Due to the potential impact of changing compression algorithm, it is recommended that customers interested in doing this contact their contracted support provider to further discuss the change before proceeding.
Use of file system fastcopy:
DDRs allow use of the 'file system fastcopy' command to quickly copy a file (or directory tree). This functionality creates a file by cloning the metadata of an existing file (or group of files) so that while the new files are not physically connected to the original file they reference the exact same data on disk as the original file. This means that regardless of the size of the original file the new file consumes little space on disk (as it deduplicates perfectly against existing data).
The result of this behavior is that when file system fastcopy is used the precompressed (logical) size of data on the DDR increases quickly but the postcompressed/physical utilization of the DDR remains static.
For example, the following DDR has utilization as follows (indicating overall compression ratio of ~1.8x):
Active Tier:
Resource Size GiB Used GiB Avail GiB Use% Cleanable GiB*
---------------- -------- -------- --------- ---- --------------
/data: pre-comp - 12.0 - - -
/data: post-comp 71.5 6.8 64.7 10% 0.0
/ddvar 49.2 1.1 45.6 2% -
/ddvar/core 158.5 0.2 150.2 0% -
---------------- -------- -------- --------- ---- --------------
It contains a large file (/data/col1/backup/testfile):
!!!! DDVE60_JF YOUR DATA IS IN DANGER !!!! # ls -al /data/col1/backup/testfile
-rw-r--r-- 1 root root 3221225472 Jul 29 04:20 /data/col1/backup/testfile
The file is fastcopied several times:
sysadmin@DDVE60_JF# filesys fastcopy source /data/col1/backup/testfile destination /data/col1/backup/testfile_copy1
sysadmin@DDVE60_JF# filesys fastcopy source /data/col1/backup/testfile destination /data/col1/backup/testfile_copy2
sysadmin@DDVE60_JF# filesys fastcopy source /data/col1/backup/testfile destination /data/col1/backup/testfile_copy3
This causes precompressed utilization to increase for little change in postcompressed utilization:
Active Tier:
Resource Size GiB Used GiB Avail GiB Use% Cleanable GiB*
---------------- -------- -------- --------- ---- --------------
/data: pre-comp - 21.0 - - -
/data: post-comp 71.5 6.8 64.7 10% 0.0
/ddvar 49.2 1.1 45.6 2% -
/ddvar/core 158.5 0.2 150.2 0% -
---------------- -------- -------- --------- ---- --------------
As a result the DDR now shows overall compression ratio of ~3.1x.
As mentioned above, the compression statistics of the copies show that they deduplicate perfectly:
sysadmin@DDVE60_JF# filesys show compression /data/col1/backup/testfile_copy1
Total files: 1; bytes/storage_used: 21331976.1
Original Bytes: 3,242,460,364
Globally Compressed: 0
Locally Compressed: 0
Meta-data: 152
Fastcopy functionality cannot be used to improve overall compression ratio by reducing physical utilization of the DDR however it can be the cause of high overall compression ratio (especially in environments making extensive use of fastcopy such as Avamar 6.x).