Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Article Number: 000017462

Data Domain: An overview of Data Domain File System clean/garbage collection phases

Summary: This article provides an overview of phases during Data Domain File System (DDFS) cleaning/garbage collection (GC). It describes the differences between various clean algorithms used in various releases of the Data Domain Operating System. ...

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content

Symptoms

The Data Domain File System (DDFS) is different from many common file system implementations in that when a file is deleted from the file system space used by that file is not immediately available for reuse. The reason for this is because the Data Domain Restorer (DDR) does not immediately know whether data which was referenced by the deleted file is also being deduplicated against other files and therefore whether it is safe to remove that data or not.

Cleaning (sometimes known as garbage collection or GC) is the process by which a DDR:

Determines which data on disk is superfluous (that is is no longer referenced by objects such as files or snapshots)
Physically removes superfluous data making underlying disk space available for reuse (that is ingestion of new data)

Clean/GC is commonly scheduled to run at regular intervals (by default it starts at 6am every Tuesday) and can be:

Long running
Computationally expensive

Note, however, that there is no way other than by running clean/GC in which data can be removed/space physically freed on a Data Domain Restorer (that is there are no shortcuts to speed up this process).

This article describes clean/GC in more detail explaining:

The phases which clean generally runs
The different clean algorithms used in various versions of DDOS

Cause

None

Resolution

Each time clean/GC is run it has two main purposes - first it must find superfluous data on the DDR - a brief overview of how this is done is as follows:

Clean/GC enumerates the contents of the DDFS file system looking for objects such as files, snapshots, and replication logs which currently exist on the system
It then determines all the physical data on the disk which is actively referenced by these objects
Data which is actively referenced is said to be 'live' and cannot be removed from the DDR otherwise the objects referencing this data would be damaged (they would no longer be able to be read as the underlying data they depended on would no longer exist on disk)
Data which is not actively referenced by any object is said to be 'dead' and is superfluous - this data can be safely removed from the system
All data on a DDR is packed into objects of 4.5 Mb in size known as containers
Through enumeration clean/GC can determine which 4.5 Mb containers hold dead data and the amount of dead data in each
By default clean/GC will select 4.5Mb containers holding > 8% dead data for 'processing'

Second it must remove dead data on the DDR - a brief explanation of how this is done is as follows:

Containers selected for processing are checked again to confirm that they do hold a good amount of dead data
Live data is extracted from these containers and written to new 4.5 Mb containers at the end of the file system
Once this is complete selected containers (including the dead data they contain) are deleted from disk physically freeing disk space

The clean process is split into several 'phases' with the total phase count dependent on:

The version of DDOS being used on the DDR (hence the clean algorithm used, by default, by that version of DDOS)
The configuration or contents of the system

In general, however, the process of finding 'dead' data and selecting corresponding containers takes place across several phases whereas removal of dead data takes place in a single phase known as 'copy'. For example, certain versions of DDOS would run clean phases as follows:

Pre-enumeration - enumerate the contents of the DDFS file system
Pre-merge - perform a DDFS index merge to ensure that the latest copy of index information is flushed to disk
Pre-filter - determine if there is duplicate data within the DDFS file system and if so where this is
Pre-select - determine which 4.5 Mb containers should be 'processed' by cleaning
Copy - physically extract live data from selected containers, write this to new containers, then delete selected containers
Summary - rebuild summary vectors (which are used as an optimization during ingestion of new data)

In the above example, phases 1-4 are used to determine where 'dead' data exists on the DDR. This is referred to as' enumeration phases' in the remainder of this document. Phase 5 (copy) is used to physically remove this data.

No space will be physically freed on the system until clean/GC reaches the copy phase. As a result there may be a significant delay between clean being started and space starting to be freed (due to the enumeration process having to first run to completion). For this reason, systems should not be allowed to fill 100% full before clean/GC is started.

Enumeration phases tend to be expensive in terms of CPU utilization (that is they are generally CPU bound) whereas the copy phase is expensive in terms of both CPU and I/O (that is they are generally CPU and I/O bound). In summary, however, it is possible to say that:

The total length of enumeration phases depends on the amount of data on the DDR which must be enumerated
The total length of the copy phase depends on the amount of dead data on the DDR which must be removed and how 'fragmented' that data is on disk (discussed further below)

The number/functionality of enumeration phases depends on the release of DDOS used on a DDR.

DDOS 5.4 (and earlier) - full clean algorithm: Runs 6 or 10 phases (as shown above):

The contents of the DDFS file system is enumerated top down (that is it is file-centric)
DDFS discovers all files which exist on the DDR then scans each file in turn to determine which data is referenced by that file
This allows clean/GC to determine which data on disk is 'live'

DDOS 5.5 (and later) - physical clean algorithm (PGC): Runs 7 or 12 phases:

The contents of the DDFS is enumerated bottom up (that is it no longer scans individual files)
DDFS discovers file system metadata which references physical data on disk and scans that metadata to determine which data is referenced
This allows clean/GC to determine which data on disk is 'live'
This is achieved by addition of an 'analysis' phase (hence the increase in phase count over the full clean algorithm)
Usually, the total duration of physical clean is expected to be shorter than the full clean for the same system (despite consisting of more individual phases)

DDOS 6.0 (and later) - perfect physical clean algorithm (PPGC):

This is simply an optimization to the physical clean algorithm and is discussed further below

DDOS switched from the full clean algorithm to the physical clean algorithm to improve scalability/performance of the enumeration process - due to the top down nature of the full clean algorithm it did not scale well on DDRs with either:

Many small files (as the context switch when moving from enumeration of one file to the next was expensive/slow)
High deduplication ratio (as multiple files referenced the same physical data so the same data was enumerated multiple times)

DDRs switch from full to physical clean algorithms automatically when upgraded from DDOS 5.4 (or earlier) to 5.5 (or later). The only exception to this is systems configured with extended retention where the contents of the DDFS file system must be checked for 'spanning' files before physical clean can be enabled - a discussion of this process is beyond the scope of this document however this check runs automatically following the upgrade and physical clean is enabled on completion of this check with no manual action required.

Similarly DDRs switch from physical to perfect physical clean algorithms automatically when upgraded from DDOS 5.x to 6.0 (or later). Note, however, that the perfect physical clean algorithm requires indexes to be in the 'index 2.0' format before it can be used. Note that:

The 'index 2.0' format was introduced with DDOS 5.5 (so all file systems created on 5.5 or later will already be using index 2.0)
File system created on 5.4 or earlier will initially have had indexes in the index 1.0 format. Once upgraded to DDOS 5.5 (or later) the indexes will be converted to the index 2.0 format - conversion happens every time clean runs however only ~1% of indexes are converted during each clean so it may take up to 2 years (assuming clean runs weekly) to fully convert indexes to the 2.0 format

DDRs initially running DDOS 5.4 (or earlier) but which have subsequently been upgraded to DDOS 5.5 (or later) can be forced to convert indexes to the index 2.0 format by a one-time 'index rebuild'. Note, however, that an index rebuild requires a period of down time whilst indexes are physically rebuild - this operation generally takes 2-8 hours to complete depending on the size/amount of data on the DDR. To discuss performing an index rebuild, contact your contracted support provider.

Regardless of the clean algorithm used, it may require a variable number of phases - for example the full clean algorithm may require 6 or 10 phases. The reason for this is that:

When DDFS is started, it reserves a fixed amount of memory to be used by clean/GC
Within this memory clean/GC creates data structures to describe the results of enumeration (that is describe where live vs dead data exists on disk)
When a DDR contains a relatively small amount of data, the entire contents of the DDFS file system can be described in this area of memory
On many systems, however, this is not possible and this area of memory would become exhausted before the entire contents of the DDFS file system was enumerated
As a result these systems perform 'sampling' which increases the number of clean phases required

When sampling is used clean/GC will:

Perform a sampling pass of enumeration across the entire file system - note that this enumeration is not 'complete' (that is it does not record full information about each part of the file system but instead approximates information for each part of the file system)
Use this sampling information to determine which part of the DDFS file system would benefit most from having clean/GC run against it (that is which part of the file system would give the best returns in terms of space being freed if it were cleaned)
Perform a second round of complete enumeration against the selected part of the file system whose contents can now be fully described within memory reserved for GC

Sampling is automatically enabled during clean/GC if required however will cause:

An increase in the number of phases needing to be performed by GC
A corresponding increase in the total duration of GC

Prior to DDOS 6.0 the majority of DDRs perform sampling during GC (unless they hold a relatively small dataset). The perfect physical clean algorithm, however, includes various optimizations to reduce the amount of memory required by GC when enumerating data within the file system. This means that many systems which were performing sampling during GC on DDOS 5.x will no longer require sampling on DDOS 6.0 - this therefore reduces the number of phases performed by clean and causes a corresponding decrease in total clean run time (that is improvement in clean performance).

There is no information available to directly determine that a system has switched from the physical clean algorithm to the perfect physical clean algorithm other than:

When the system was running physical clean on DDOS 5.5 - 5.7 it was performing 12 phases during clean
Following the system being upgraded to DDOS 6.0 (or later) it performs only seven phases during clean

If a system running DDOS 6.0 still must perform sampling this will be enabled automatically during clean and it falls back to running 12 phases during clean.

Regardless of clean algorithm used the copy phase (where dead data is physically removed from the system) functions in a similar manner across all releases. Performance of the copy phase is generally dependent on:

The amount of 'dead' data which has to be removed
The 'fragmentation' of this dead data (that is how it is spread across disk)

As described above copy works by selecting 4.5 Mb containers which hold dead data, extracting any live data from those containers and writing that live data to new containers, then deleting the originally selected containers. The following examples describe why fragmentation of dead data is important:

Example 1:

Ten containers are selected for copy (45 Mb total data)
All if these containers hold no live data (that is the data they hold is completely unreferenced)
As a result the copy has to mark these containers as deleted to free 45 Mb physical space on disk

Example 2:

100 containers are selected for copy (450 Mb total data)
Each of these containers holds 90% live data/10% dead data
To process these containers copy has to:

Read the 90% live data from all 100 containers (405 Mb data).

Create a set of new containers to hold this 405Mb data at the end of the file system.

Write this 405Mb data to these containers and update structures such as indexes accordingly.

Mark the 100 selected containers as deleted hence freeing 45 Mb physical space on disk.

More I/O and CPU is required to perform the copy described in example 2 compared with that in example 1 hence it takes longer to free 45 Mb physical space in disk in this example.

Users generally have no control over the 'fragmentation' of dead data on disk on a DDR as this very much depends on use case/type of data being written to the system. Note, however, that clean/GC does maintain statistics which help determine the 'fragmentation' of dead data encountered during the copy phase (which therefore allows a user to determine if this fragmentation can explain a potentially long running copy phase). Such statistics from the latest phase of clean/GC are collected in autosupports. For example the following shows a copy phase where dead data was fairly contiguous (that is most containers selected for copy held mostly dead data):

Percentage of live data in copied: 3.6% 4.3%

Conversely the following shows a copy phase where dead data was fragmented (that is most containers selected for copy held mostly live data):

Percentage of live data in copied: 70.9% 71.5%

As described above clean/GC will have to perform comparatively much more work in the second scenario to physically free space on the DDR which will cause throughput of the copy phase to reduce.

Copy phase throughput can also be adversely affected by:

The use of encryption: Data may need to be decrypted/re-encrypted during copy which significantly increases the amount of CPU required
The use of low bandwidth optimization: Containers may need 'sketch' information to be generated during copy which also causes a significant increase in the amount of CPU required

Note that if low bandwidth optimisation and/or encryption has been recently enabled all existing containers (regardless of whether they are selected for copy or not) may need to be encrypted and/or have sketch information generated against them during subsequent clean - this can cause the clean operation (specifically the copy phase) to take longer than normal.

Additional Information

Further notes on checking/modifying clean schedule and throttle are available in the following KB article: Data Domain - Scheduling Cleaning on a DDR

Note however that:

In normal circumstances clean should be scheduled to run at most once per week - running clean more frequently than this can cause data on disk to become excessively 'fragmented' (that is exhibit poor spatial locality) which can result in poor read/replication/data movement performance
Clean throttle does not affect the total amount of CPU and I/O bandwidth consumed by clean - instead it controls how sensitive clean is to other workload on the system. For example:

A DDR with clean throttle of '1' (that is the lowest/least aggressive possible throttle setting) will still use significant CPU and I/O whilst clean is running. It should, however, immediately back off and release resources as soon as the DDR experiences any other workload.

A DDR with clean throttle of '100' (that is the highest/most aggressive possible throttle setting) will use significant CPU and I/O whilst clean is running and will not release resources even if the DDR is subject to other workload (in this scenario it is highly likely that running clean will cause significant degradation to performance of ingest/restore/replication operations)

By default clean throttle is set to 50 - it is the responsibility of the user to test running clean with different throttle settings whilst the DDR experiences normal workload to determine a throttle setting which allows:

Clean to run in the minimum amount of time possible.

Clean to run without causing excessive degradation to the performance of other workload.

A long-running clean is not necessarily an issue as long as:

Clean is able to fully complete between its scheduled start times (i.e. if clean is scheduled to start at 6am on Tuesdays it should complete before 6am the following Tuesday)

The system has sufficient free space such as not to become full before clean reaches its copy phase (and space starts to be reclaimed)

Clean does not cause excessive degradation to the performance of other workload whilst it runs.

System using extended retention functionality should be configured such that:

Data movement from active -> archive tier is scheduled to run at regular intervals (that is once a week)

Active tier clean is scheduled to run on completion of data movement

Active tier clean does not have its own/independent schedule (as this can cause excessive cleaning to take place)

Full information from the latest clean operation is included in autosupports and details:

An overview of phases run during clean

Duration and throughput of each phase of clean

Detailed statistics for each phase of clean

For example:

GC stats for Physical Cleaning on Active Success 39 Aborted 0
Most recent successful GC container range: 15925661 to 62813670
GC phase: pre-merge time: 133 average: 154 seg/s: 0 cont/s: 0
GC phase: pre-analysis time: 1331 average: 1768 seg/s: 0 cont/s: 0
GC phase: pre-enumeration time: 34410 average: 31832 seg/s: 1471833 cont/s: 0
GC phase: pre-filter time: 2051 average: 1805 seg/s: 1988827 cont/s: 0
GC phase: pre-select time: 2770 average: 2479 seg/s: 1472593 cont/s: 2675
GC phase: merge time: 111 average: 69 seg/s: 0 cont/s: 0
GC phase: analysis time: 1350 average: 900 seg/s: 0 cont/s: 0
GC phase: candidate time: 1478 average: 739 seg/s: 6833465 cont/s: 2156
GC phase: enumeration time: 37253 average: 20074 seg/s: 5490502 cont/s: 0
GC phase: filter time: 1667 average: 910 seg/s: 9787652 cont/s: 0
GC phase: copy time: 52164 average: 49496 seg/s: 0 cont/s: 61
GC phase: summary time: 2840 average: 2427 seg/s: 5552869 cont/s: 2501

GC analysis phase details: Recent Cumulative
Number of Segments in Index: 16316022459 572186212855
Unique Segment count iterated: 494653358 319255282440
Unique Lp Segment count: 494653866 17879171482
Delay buffer reallocated count: 0 0
Index fully upgraded: 1 16
Only Scan For Lps: 1 39
Max Lp segment count supported: 18105971430 706132885747
...

Data Domain: An overview of Data Domain File System clean/garbage collection phases

Summary: This article provides an overview of phases during Data Domain File System (DDFS) cleaning/garbage collection (GC). It describes the differences between various clean algorithms used in various releases of the Data Domain Operating System. ...

Article Content

Symptoms

Cause

Resolution

Additional Information

Article Properties

Affected Product

Product

Last Published Date

Version

Article Type

Welcome

Welcome to Dell

Data Domain: An overview of Data Domain File System clean/garbage collection phases

Summary: This article provides an overview of phases during Data Domain File System (DDFS) cleaning/garbage collection (GC). It describes the differences between various clean algorithms used in various releases of the Data Domain Operating System. ... View More View Less

Article Content

Symptoms

Cause

Resolution

Additional Information

Article Properties

Affected Product

Product

Last Published Date

Version

Article Type

Summary: This article provides an overview of phases during Data Domain File System (DDFS) cleaning/garbage collection (GC). It describes the differences between various clean algorithms used in various releases of the Data Domain Operating System. ...