Can you explain what the impact is when users want to read or write to a deduped file? What performance hits can we expect on the user and systems side? What is the process the system goes through to make this happen?
As with most things in life, deduplication is a compromise. In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.
Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.
Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.
The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.
If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off.
As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1TB of data per day, per node.
SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. So the deduplication job will automatically ignore the directories which aren't listed in the Deduplication settings. In other words, if we don't add any directory to the deduplication setting, Dedupe Job will no scan any file.
ayas
Community Manager
•
7.4K Posts
0
January 5th, 2017 01:00
hi ... can you please advise how to stop dedup if customer desided ?
thanks !
aya
ECN-APJ
2 Intern
•
308 Posts
0
January 5th, 2017 03:00
Please use the following command to stop deduplicating the specified root directory:
isi dedupe settings modify --remove-paths
ayas
Community Manager
•
7.4K Posts
0
January 5th, 2017 16:00
hi !
thanks heaps !
aya
chjatwork
2 Intern
•
356 Posts
0
January 6th, 2017 08:00
Can you explain what the impact is when users want to read or write to a deduped file? What performance hits can we expect on the user and systems side? What is the process the system goes through to make this happen?
Thank you,
ECN-APJ
2 Intern
•
308 Posts
0
January 9th, 2017 01:00
chjatwork
As with most things in life, deduplication is a compromise. In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.
Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.
Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.
The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.
If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off.
As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1TB of data per day, per node.
ayas
Community Manager
•
7.4K Posts
0
January 22nd, 2017 20:00
hi
thanks for the information ... BTW Do you know anybody had issue Pausing SmartDedup job and start aga
thanks
aya
ECN-APJ
2 Intern
•
308 Posts
0
January 22nd, 2017 21:00
Pardon me, what's aga?
ayas
Community Manager
•
7.4K Posts
0
January 22nd, 2017 21:00
Opps ...sorry my typo ...
Question is ...
Do you know anybody had issue Pausing SmartDedup job and start again ? ( eg taking log time to complete job in second time .. or so .)
ECN-APJ
2 Intern
•
308 Posts
0
January 22nd, 2017 22:00
ayas
The SmartDedupe is comprised of five modules:
These modules are described in more detail below:
SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. So the deduplication job will automatically ignore the directories which aren't listed in the Deduplication settings. In other words, if we don't add any directory to the deduplication setting, Dedupe Job will no scan any file.
v_g
1 Rookie
•
4 Posts
0
May 10th, 2022 06:00
What happens when dedup is disabled, does it remove dedup from each file, or only new files stop getting deduplication.
DELL-Sam L
Moderator
•
7.8K Posts
0
May 10th, 2022 12:00
Hello vgite,
What is our current onefs version?