Unsolved
This post is more than 5 years old
25 Posts
0
2102
January 5th, 2011 05:00
deduplication questions: single instance only, complete scan force and fs_dedup figures update
Hello,
A few questions raised by the deduplication tests i'm currently running.
1/ after an initial fs_dedup "-state on" run, I can check dedup efficiency figures with -i option.
Unfortunately, if I decide to deduplicate/un-deduplicate some files or folders from a windows computer, thru the corresponding windows share (CIFS Compression Enabled is ON), how can I know about the consequences on storage use at the FS level ?
Would (tree) quota be more helpful ? I suppose not: block based quotas are useless. If I turn to file based quota, I won't see any difference.
Any idea ?
2/ after I change some deduplication settings, how can I force a complete scan again to update fs_dedup figures ?
For instance, the 1st deduplication run has been done with compression feature only (Duplicate Detection Method is OFF).
Then I change this setting to SHA1.
When I ask the system to scan again and take new parameters into account, almost nothing happens... Despite access and modification time are all 0.
Would the next day full scan do what I expect ?
I can fix it by un-deduplicating with "-state off" and duplicate again. By it's a complete waste of time.
How can I force the system to run a true full scan NOW ?
3/ It's easy to setup deduplication with single instance OFF and compression ON.
Is there any way to do the opposite: only single instance and no compression ?
I have many filesystems with a lot of that are the worst possible candidates to compression (gif, jpg, proprietary compressed data files...)
But it doesn't mean they are not candidates for single instance, even though I confess they are neither the best candidates for single instancing...
This is why I'm interested in having single instance without compression.
I can guess what you're wondering:
- exclude gif/jpg files
=> they will be excluded for the whole deduplication process, including single instance, right ? not exactly what I want
- disable compression for this FS, you don't expect so much gains from single instance !
=> partially true... but this FS contains other trees, with totally different content that could deserve single instance
- blacklist the tree with Path Exclude List option
=> too much work. Also, how can I figure that qtree #1 should be in the exclude list but not qtree #2 ? I don't have detailed dedup efficiency figures per qtree to take the decision and I have poor knowledge of files types the system is hosting. I manage the container (Celerra), not the content.
Any good tip appreciated !
Eric
0 events found


whoreallycares
25 Posts
0
January 6th, 2011 02:00
Bound to my questions above, the document Achieving Storage Efficiency through EMC Celerra Data Deduplication (h6065-achieve-storage-effficiency-celerra-dedup-wp.pdf) mention the following on page 7:
"Note that Celerra detects non-compressible files and stores them in their original form.
However, these files can still benefit from file-level deduplication."
This would be great, but I don't feel it's actually working like this...
Can someone confirm those features are implemented ? I'm running DART 5.6.47.11.
Thanks
Eric
Rainer_EMC
6 Operator
•
8.6K Posts
1
January 6th, 2011 07:00
Hi Eric,
should work.
Just to be clear - it means that even if a file isnt compressible it is still a candidate for single-instancing - i.e. if multiple files with the same content exist in this file system only one instance gets stored - the others are represented by 8k stubs
Rainer
whoreallycares
25 Posts
0
January 6th, 2011 07:00
Hi,
I have the same understanding of the statements I quoted from this doc.
It would be a good point if it worked like this, but I'm not as confident as you.
From file size information taken from right-click properties in windows explorer on a jpg file.
Size: 91659 bytes
Size on disk: 98304 bytes
91659 / 8192 = 11.18
So this files needs 12 blocks
12 * 8k = 96ko (98304 bytes)
After compression
Size on disk: 106496 bytes
It means it now occupies 13 blocks
It guess this single extra block has been accounted because of the stub.
Can you confirm ?
This file is unique so there is no reason for me to have this stub mecanism in place for this file.
Is it mandatory for any file evaluated for deduplication to have a stub file ? even though the file is finally not compressible and the file is unique.
Sound weird to me...
Is it the way it behaves ?
Let's take another example: another jpg file, but a big one this time.
The size on disk raised from 103 blocks to 107 blocks. Even if one block is due to the stub, it means it needs 3 extra blocks to store the file in it's compressed form.
This is not surprising that the file is bigger in it's compressed form. A compression algorithme aimed for speed rather than killing compression ratio (I guess) will make a jpg file bigger when compressed because of it's compression metadata.
If my hypothesis is right in that example, why is DART not considering that this file is not a good candidate for compression ?
Thanks,
Eric
bergec
275 Posts
0
January 6th, 2011 13:00
The doc says there is an additional stub (therefore 2 stubs for one file)
You can also specify a list of extensions not to be deduped
Claude
jukokkon
51 Posts
0
January 6th, 2011 13:00
I haven't seen dedup doing any real selection based on how well file gets compressed. But it doesn't really matter. For normal mixed shit fs one gains 20-30% of space. If you are not using single instancing then I think you should exclude stuff like jpeg.
Also fs_dedupe -info seems to lie. For example I have deduped filesystem (2,5tb) of which fs_dedupe says 3,7tb is original size but actually it is 3,2tb. My guess is that it counts all the files that have been deduped and never minuses size if file(s) gets deleted.
There's no option to use just single instancing.
We've been using dedupe about as long it has existed. Backups are a bit slower. Otherwise it's been working nicely and transparently for users. Now we're archiving and getting rid of dedupe in tier 1. Tier 2 gets deduped. Main reason for reduplication is to move to vbb backups (with ndmp tape drives). In our environment it is only way to backup in reasonable time window. And one can't do file level restores from deduped fs backed up with vbb.
--
Jussi
whoreallycares
25 Posts
0
January 7th, 2011 01:00
Jussi,
Thanks for your experience feedback, including backup xp.
As I already said, the problem is not only jpeg files or pictures files in general.
I'm dealing with files of very different types. If EMC provided a tool to collect and report dedup efficiency per file type (based on extension) for each FS, maybe I could rely on this to update the blacklist... but once again, it's a pity because it would also exclude those files from single instancing...
I disagree with your interpretation of fs_dedup -i (or did I misunderstood your remark ?).
fs_dedup report correctly savings from compression and single instancing:
On the same dataset:
With compression only (Duplicate Detection Method = off)
As of the last file system scan (Thu Jan 6 17:55:18 CET 2011):
Files scanned = 13680
Files deduped = 13665 (100% of total files)
File system capacity = 54160 MB
Original data size = 27462 MB (51% of current file system capacity)
Space saved = 14410 MB (52% of original data size)
With compression + single instance (Duplicate Detection Method = sha1)
As of the last file system scan (Fri Jan 7 09:44:37 CET 2011):
Files scanned = 13680
Files deduped = 13665 (100% of total files)
File system capacity = 54160 MB
Original data size = 27462 MB (51% of current file system capacity)
Space saved = 19386 MB (71% of original data size)
It this example, files removing has been accounted by fs_dedupe.
To me, fs_dedup is lying for another reason: it reports 0 savings when actually running dedup have had a negative effect because the dataset was not a good candidate (inodes overhead, compression overhead).
Eric
jukokkon
51 Posts
0
January 7th, 2011 02:00
Just try it.
If you know filesystem contains mostly music, videos, compressed files and that sort of stuff forget the dedupe. If it's "normal" mixed stuff you'll gain something like 20-30%. First dedupe round for millions of files takes few days. After that you'll see what's the outcome. If it's not good you can always redupe the fs (you'll have impact on checkpoint storage and backups on both operations).
We've gained terabytes with dedupe so it's been good to us.
--
Jussi