Introduction to De-duplication

Introduction to De-duplication

Introduction

De-duplication means during backup, it finds and eliminates duplicated data to improve backup efficiency.

This article will introduce the levels and types of de-duplication. It will also help you understand how de-duplication works and what benefits de-duplication brings to us.

Detailed Information

First let’s have a look at a simple question:

How many kinds of fruits are among below?

With all the several choices, a child picked up 12 unique kinds of fruits among all of them. This is the simplest kind of de-duplication. De-duplication looks simple but could be complicated in live production environments. The de-duplication rate and efficiency of different de-duplication products can be dramatically different.

Now let’s have a look at how de-duplication works in live production environments.

De-duplication levels:

De-duplication lowers storage needs through recognizing duplicate or redundant data. De-duplication has several levels and different levels have different abilities to recognize redundancy.

File-based de-duplication: As long as any part of a file is modified, the entire file is considered as a new file and therefore it will be stored. However, if nothing within the file is modified, the file is considered as a redundant file and therefore it does not need to be stored again. A pointer will be created to point to the backup file and the pointer and metadata will be retained. When the file needs to be recovered, we can achieve this by using the unique file as well as its pointer and metadata.

Fixed segment de-duplication: Often used in snapshot and replication technology. A file is divided into fixed segments and this can recognize redundant data more efficiently. However, because segments are in a fixed size, even if a small part of the file is modified, it may cause all segments to be changed. For example, as shown in the graph, if a segment needs to be added, the data stream has to be moved behind to free up space. The efficiency of this kind of de-duplication is not good.

Variable segment size de-duplication: Improved based on fixed segment de-duplication, it uses an intelligent method to determine the sizes of segments and is aimed at data itself to determine boundary points of segments. Variable segment size de-duplication provides better granular recognition of duplicate data, eliminates the low efficiency of file-based and Fixed segment de-duplication. With Variable segment size de-duplication, when data is added, it is added to variable segments and there is no need to move the entire data stream. Therefore, more segments can be defined as the same and data needs to be stored declines. For fixed segment de-duplication and variable segment size de-duplication, pointers which point to the unique segment and metadata will be retained as well. EMC Avamar and Data Domain use variable segment size de-duplication.

De-duplication types:

According to the objects executed by de-duplication, de-duplication can be divided into two types:

Source-based de-duplication: Recognize redundant data on the client or data source side, transmit de-duplicated data via network and store it onto the disk to backup. EMC Avamar provides this type of de-duplication.

Target-based de-duplication: Data is sent from the client to the destination storage device and then de-duplication is performed on the destination storage device. At this moment, the gaps between duplication products provided by different vendors appear. De-duplication products of many other vendors need to store data to a temporary area first and then perform de-duplication. In addition, their products are unable to compress data after it is de-duplicated. This not only requires additional disks but also needs additional manpower to manage data in different statuses in the data pools. 99% of redundant data analysis is performed in the memory and only a little part of data which cannot be recognized in the memory will be compared with data that has been stored on the disk. Because de-duplication is almost finished in the memory and rarely access the disk, the de-duplication speed is very fast. If a segment is recognized as an old one, it will not be stored again but a pointer will be created for it. After that, Data Domain will compress the de-duplicated data and then store it to the disk on the destination storage device.

Advantages of de-duplication:

Save disk space: After de-duplication, the amount of data can be reduced dozens of times. This saves disk space to a great extent.
Improve backup speed: Because of de-duplication greatly reduces the amount of backup data, the backup window is shortened.
Lower network load: For source-based de-duplication, because the amount of data which needs to be transmitted is largely reduced, the network bandwidth is alleviated.
Speed up data recovery: Since the amount of data is reduced greatly after de-duplication, long-term data storage on the disk becomes reality. As data is stored on the disk instead of tape, data reading is speeds up.
Improve data protection: Without de-duplication, because of the limitation of backup period, in most cases we can only perform weekly full backup and daily incremental backup. However, with de-duplication, we can develop more active backup strategy, such as daily full backup.

Author: Tim Quan

iEMC APJ

Please click here for for all contents shared by us.

View All

No Events found!

Data Domain

Introduction to De-duplication

Introduction

Detailed Information