EMC Isilon SmartDedupe

Question

EMC Isilon SmartDedupe Share: Please click here for all contents shared by us. Introduction Version 7.1 of OneFS includes numerous enhancements that will improve the performance of Isilon cluster, the SmartDedupe is one of them. The SmartDedupe software module enables deduplication to save storage space on a cluster by reducing redundant data. This article will introduce SmartDedupe software module and demonstrate how to configure deduplication on the Isilon. Detailed Information Deduplication is applied at the subdirectory level and targets all files and directories underneath one or more root directories. As you write file to the cluster, some of those files or blocks of data in the files might be duplicates. You can run a deduplication job that scans the file system to see if the data already exits. After duplicate blocks are discovered, SmartDedupe moves a single copy of those blocks to a special set of files known as shadow stores. During this process, duplicate blocks are removed from the actual files and replaced with pointers to the shadow stores. With post-process deduplication, new data is first stored on the storage device and then a subsequent process analyzes the data looking for commonality. This means that initial file write or modify performance is not impacted because no additional computation is required in the write path. The deduplication job is set to run at low priority by default, so impact to your workflow should be minimal. However, it’s a good idea to wait until users have finished modifying their files on the cluster before you run the job. You can perform the following deduplication tasks from the CLI to configure SmartDedupe assessment and start Deduplication. Step 1: Log into Isilon cluster. Step 2: Configure SmartDedupe Assessment Directory. The assessment directory is /ifs/Demo_Data/Data/'Legal Discovery' in this example, so I type isi dedupe settings modify –assess-parths/ifs/Demo_Data/Data/'Legal Discovery' and press Enter. This command is used to configure which directory SmartDedupe will perform an assessment on. Step 3: Start Deduplication assessment job. Type isi job jobs start dedupeassessmentand then press Enter. This command will start the DedupeAssessment job. Step 4: View active jobs. Type isi job jobs list and then press Enter. This command displays information about active jobs. Wait for the DedupeAssessment job to finish. To refresh re-run the above command. When it is finished, it will no longer display when you issue the isi job jobs list command. Step 5: view Deduplication assessment report. Type isi dedupe reports view (job ID from step 4) and then press Enter. This command will display the deduplication report. What is the Dedupe percent? Dedupepercent is the percentage of scanned blocks that would be deduplicated. Step 6: Specify directory to performance Deduplication on. Type isi dedupe settings modify –paths /ifs/Demo_Data/Data/'Legal Discovery'and then press Enter. This command is used to configure which directory SmartDedupe will deduplicate. Step 7: Start Deduplication job Type isi job jobs start dedupeand then press Enter. This command will start the dedupe job. This will help to deduplicate the directory. Step 8: view active jobs. Type isi job jobs list and then press Enter. Step 9: view Deduplication report. Type isi dedupe reports view (Job ID from setp 9) and then press Enter. Dedupepercent is the percentage of scanned blocks that were deduplicated. Step 10: view disk space savings. Type isi dedupe statsand then press Enter. This command will allow you to view the amount of disk space that you are currently saving with deduplication. Observe the estimated physical saving. This is the total amount of physical disk space saved by deduplication, including protection overhead and metadata.                                                                                                                                                Author: Jeffey Liu

ayas · Answer

hi ... can you please advise how to stop dedup if customer desided  ? thanks ! aya

ECN-APJ · Answer

Please use the following command to stop deduplicating the specified root directory: isi dedupe settings modify --remove-paths

ayas · Answer

hi ! thanks heaps ! aya

chjatwork · Answer

Can you explain what the impact is when users want to read or write to a deduped file? What performance hits can we expect on the user and systems side? What is the process the system goes through to make this happen?

Thank you,

ECN-APJ · Answer

chjatwork

As with most things in life, deduplication is a compromise. In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.

Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.

Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.

The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.

If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off.

As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1TB of data per day, per node.

ayas · Answer

hi thanks for the information ... BTW Do you know anybody had issue Pausing SmartDedup job and start aga thanks aya

ECN-APJ · Answer

Pardon me, what's aga?

ayas · Answer

Opps ...sorry my typo ... Question is ... Do you know anybody had issue Pausing SmartDedup job and start again ? ( eg taking log time to complete job in second time .. or so .)

ECN-APJ · Answer

ayasThe SmartDedupe is comprised of five modules:Deduplication control pathDeduplication JobDeduplication EngineShadow StoreDeduplication InfrastructureThese modules are described in more detail below:SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. So the deduplication job will automatically ignore the directories which aren't listed in the Deduplication settings. In other words, if we don't add any directory to the deduplication setting, Dedupe Job will no scan any file.

v_g · Answer

What happens when dedup is disabled, does it remove dedup from each file, or only new files stop getting deduplication.

DELL-Sam L · Answer

Hello vgite, What is our current onefs version?

Isilon

EMC Isilon SmartDedupe

Please click here for all contents shared by us.

Introduction

Detailed Information

Was this post helpful?