Unsolved

This post is more than 5 years old

2 Intern

 • 

308 Posts

8776

October 13th, 2014 00:00

EMC Isilon SmartDedupe

EMC Isilon SmartDedupe

Share: twitter.png

Please click here for all contents shared by us.

Introduction

Version 7.1 of OneFS includes numerous enhancements that will improve the performance of Isilon cluster, the SmartDedupe is one of them. The SmartDedupe software module enables deduplication to save storage space on a cluster by reducing redundant data. This article will introduce SmartDedupe software module and demonstrate how to configure deduplication on the Isilon.

Detailed Information

Deduplication is applied at the subdirectory level and targets all files and directories underneath one or more root directories. As you write file to the cluster, some of those files or blocks of data in the files might be duplicates. You can run a deduplication job that scans the file system to see if the data already exits. After duplicate blocks are discovered, SmartDedupe moves a single copy of those blocks to a special set of files known as shadow stores. During this process, duplicate blocks are removed from the actual files and replaced with pointers to the shadow stores.

With post-process deduplication, new data is first stored on the storage device and then a subsequent process analyzes the data looking for commonality. This means that initial file write or modify performance is not impacted because no additional computation is required in the write path.

The deduplication job is set to run at low priority by default, so impact to your workflow should be minimal. However, it’s a good idea to wait until users have finished modifying their files on the cluster before you run the job.

You can perform the following deduplication tasks from the CLI to configure SmartDedupe assessment and start Deduplication.

Step 1: Log into Isilon cluster.

Step 2: Configure SmartDedupe Assessment Directory.

The assessment directory is /ifs/Demo_Data/Data/”Legal Discovery” in this example, so I type isi dedupe settings modify –assess-parths/ifs/Demo_Data/Data/”Legal Discovery” and press Enter. This command is used to configure which directory SmartDedupe will perform an assessment on.


Untitled.png

Step 3: Start Deduplication assessment job.

Type isi job jobs start dedupeassessmentand then press Enter. This command will start the DedupeAssessment job.


Untitled_3.png


Step 4: View active jobs.

Type isi job jobs list and then press Enter. This command displays information about active jobs. Wait for the DedupeAssessment job to finish. To refresh re-run the above command. When it is finished, it will no longer display when you issue the isi job jobs list command.

Untitled_4.png

Step 5: view Deduplication assessment report.

Type isi dedupe reports view (job ID from step 4) and then press Enter. This command will display the deduplication report. What is the Dedupe percent? Dedupepercent is the percentage of scanned blocks that would be deduplicated.

Untitled_5.png

Step 6: Specify directory to performance Deduplication on.

Type isi dedupe settings modify –paths /ifs/Demo_Data/Data/”Legal Discovery”and then press Enter. This command is used to configure which directory SmartDedupe will deduplicate.

Untitled_6.png

Step 7: Start Deduplication job

Type isi job jobs start dedupeand then press Enter. This command will start the dedupe job. This will help to deduplicate the directory.

Untitled_8.png

Step 8: view active jobs.

Type isi job jobs list and then press Enter.

Untitled_9.png

Step 9: view Deduplication report.

Type isi dedupe reports view (Job ID from setp 9) and then press Enter. Dedupepercent is the percentage of scanned blocks that were deduplicated.

Untitled_10.png

Step 10: view disk space savings.

Type isi dedupe statsand then press Enter. This command will allow you to view the amount of disk space that you are currently saving with deduplication. Observe the estimated physical saving. This is the total amount of physical disk space saved by deduplication, including protection overhead and metadata.

Untitled_11.png


                                                                                                                                              

Author: Jeffey Liu

Community Manager

 • 

7.4K Posts

 • 

67.8K Points

January 5th, 2017 01:00

hi ... can you please advise how to stop dedup if customer desided  ?

thanks !

aya

2 Intern

 • 

308 Posts

January 5th, 2017 03:00

Please use the following command to stop deduplicating the specified root directory:

isi dedupe settings modify --remove-paths

Community Manager

 • 

7.4K Posts

 • 

67.8K Points

January 5th, 2017 16:00

hi !

thanks heaps !

aya

2 Intern

 • 

356 Posts

January 6th, 2017 08:00

Can you explain what the impact is when users want to read or write to a deduped file?  What performance hits can we expect on the user and systems side? What is the process the system goes through to make this happen?

Thank you,

2 Intern

 • 

308 Posts

January 9th, 2017 01:00

chjatwork

As with most things in life, deduplication is a compromise. In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.

Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.

Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.

The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.

If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off.

As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1TB of data per day, per node.

Community Manager

 • 

7.4K Posts

 • 

67.8K Points

January 22nd, 2017 20:00

hi

thanks for the information ... BTW Do you know anybody had issue Pausing SmartDedup job and start aga

thanks

aya

2 Intern

 • 

308 Posts

January 22nd, 2017 21:00

Pardon me, what's aga?

Community Manager

 • 

7.4K Posts

 • 

67.8K Points

January 22nd, 2017 21:00

Opps ...sorry my typo ...

Question is ...

Do you know anybody had issue Pausing SmartDedup job and start again ? ( eg taking log time to complete job in second time .. or so .)

2 Intern

 • 

308 Posts

January 22nd, 2017 22:00

ayas

The SmartDedupe is comprised of five modules:

  • Deduplication control path
  • Deduplication Job
  • Deduplication Engine
  • Shadow Store
  • Deduplication Infrastructure

These modules are described in more detail below:

Untitled.png

SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. So the deduplication job will automatically ignore the directories which aren't listed in the Deduplication settings. In other words, if we don't add any directory to the deduplication setting, Dedupe Job will no scan any file.

1 Rookie

 • 

4 Posts

May 10th, 2022 06:00

What happens when dedup is disabled, does it remove dedup from each file, or only new files stop getting deduplication.

Moderator

 • 

7.9K Posts

 • 

45 Points

May 10th, 2022 12:00

Hello vgite,

What is our current onefs version?

0 events found

No Events found!

Top