We have a system here with 240million files which are in total 40TB, the average filesize is around 174kb.
Now this eats 90TB from the customers 3-node cluster. So instead of 33% overhead we have around 127% overhead...
From all what I can find can we expect some addtional overhead with this relatively small blocksize, but 50TB is a bit huge...
Can anybody shine a light on this? (In the past we had this fantastic Isilon capacity calculator, but the URL doesn't work anymore...)
By the way: the cluster is running OneFS 7.1 code.
For files under 128Kib, OneFS will mirror the files and not stripe them. The number of mirrors will depend on your protection level. Let's assume the default of N+2:1. So for example a 32Kib file will take a total of 98 Kib of real physical space. 32Kib * 3 (3x mirroring on N+2:1) = 96 Kib + 2 Kib inode overhead. The inode overhead is variable depending on the number of ACLs, the file size and some other factors, but using 2-4Kib/file is reasonable on average.
For storing LOTS of small files, OneFS may not be that efficient on a per file basis, however when you take into account the fact that you don't have to reserve space for failed drives, reserve space for snapshots, split the file system into smaller chunks all of which you don't want to run at 100% the space lost to the small file mirroring is generally more than made up for by reclaiming the lost space to over provisioning.
I don't know the exact filesizes used, but yes it's possible that quite a percentage of the files is smaller than 128kib, so those will be triple-mirrored. Sounds like the N+2:1 protection scheme is very inefficient for small files. 127kb will consume 386kib of diskspace on a 3 node cluster (203% overhead), where on the same cluster a 256kib file consumes also 386k (50% overhead).
This is the minimum overhead on 3 nodes. However the formal guides state that the overhead is 33%, it's the way you calculate and look at it:.
Assume 100TB raw capacity you get 67TB usable max with 3 nodes and N+2:1. So 67 %is usable. but 50% is overhead (50% from 67TB is 33TB overhead).
So a good calculation would be on a 3-node N:2+1 cluster:
slack space = average 64kb slack for each file + mirror protection overhead for files <128kib + protection overhead of files >= 128kib
slack space = (number of files * 64kib) + (number of files <128kb * 2 * 128 kib) + (number of files >= 128kib * 64kib)
Could it be that the protection level is set too high?
When I run the ISI get command, I get the follow feedback about the directories I'm quering:
Current Protection: 4x
How should I interpret this?
The admin guide states the following:
OneFS also supports data mirroring from 2x-8x, allowing from two to eight mirrors of the specified content. Does this imply that in our example the specified content is duplicated four times on the cluster?
I agree with you in terms of how the overhead is presented. You can look at it either as the percentage of usable vs the percentage for protection or you can look at is as the total overhead divided by the total usable. Two ways of saying the same thing.
Just be aware that that overhead calculation is the worst at a 3 node cluster. Isilon is not very space efficient at 3 nodes, but the benefit is the consolidated file system space and the ability to expand quickly and easily. So as you add the 4th, 5th and 6th node, your over goes from 33%/50% to 25%/33%, 20%/25% and 16%/20%. At 5 and 6 nodes the overhead becomes much better and the above mentioned benefits of less wasted space in terms of provisioning continue to apply.
For your calculation of the wasted/overhead space I'm not sure I would agree with you there. If you know how many files are < 128Kib, you can make an assumption that the average size is 64Kib. So the actual space consumed by those files would be 64Kib*3 or 192Kib each. For files > 128Kib we do NOT waste an entire stripe. So for example, if you have a file that is 448Kib in size, that is 3 full 128Kib stripes for 384 Kib of space used plus only 64Kib (8 disk blocks). The other half of the 128Kib stripe is not wasted like it would be when talking about disk block size. This file would also have a total of 2 parity stripes of 128Kib in size. The real physical size on disk is 730Kib (384 Kib + 64Kib + 256Kib + 26 Kib for inode overhead)
Just for other peoples knowledge, when looking at the protection level, 6+2/2 tells you that the protection level is N+2:1. In this case, this is an 4 node cluster so you would have 6 data stripes and 2 parity stripes in total. The /2 portion essentially tells you that the number of stripes is grouped in a set of 2. The first set has 3 data stripes and 1 parity stripe and a second set also has 3 data stripes and 1 parity stripe.
Some ASCII art to illustrate:
N1 | N2 | N3 | N4
D1 | D3 | D2 | P1 <-- 3 of the 6 stripes for the file
D2 | D3 | D1 | P2 <-- The other 3 of the 6 stripes for the file
N1 | N2 | N3 | N4
D1 | D3 | D4 | P1
D2 | D5 | D6 | P2
Assuming each data stripe lives on a disk on a node and assume the file we are writing is exactly 768Kib so we need 6 data stripes.
If you only considered a single set at a time (D1, D2, D3 and P1), you can see that if you lose D1 and P1 you cannot reconstruct the data set. This looks in some ways like RAID5 with 4 drives. This does not allow us to have any 2 disks fail and still preserve the data.
This represents the same amount of data, but now instead of just taking 1 set of 3 data and 1 parity block, we consider 2 sets as a logical unit. So now we have D1 through D6 for 6 data stripes and P1 and P2 for 2 parity stripes. When taken as a whole, if we lost, say D3 and D4, those data stripes could be recovered using the data from P1 and P2! Another way to look at is is that we are no longer striping across 4 nodes, but 8. This makes the data look like RAID6 with 8 disks.
When you look at each individual stripe it still only has the same overhead of 33% (1 parity/3 data or 2 parity/6 data). Using the same idea, you can see how Isilon is able to provide N+3:1 protection, or 3 drive protection. You essentially look at 3 sets of data and parity blocks instead of 2.
Yes this helps. Thanks.
We digged a bit further in the box.
Actually it's a 4 node cluster we have here. (N+2:1 protection 6+2)
What happens (we looked at the real file size (ls -la), the apparent filesize on isilon (du -h -A) and at the isilon consumed filesize (du -h). The following table is for our 4-node NL400 cluster with N+2:1 protection:
|real file size KiB||file size on isilon KiB||consumed with protection KiB||overhead of source||overhead of total capacity|
What we learn from this is that filesizes below 2000KiB can consume significantly more diskspace than expected. Generally speaking the 35% overhead (or 26% if looked to it from total capacity) will be reached only in files larger than 5MB. Below 5MB filesize the overhead will be higher with much filesizes.
Not completely clear to me is why the inode overhead is 26KiB for files from 256KiB and larger. But i am sure there is a good reason for that. (Files smaller have only 2KiB inode overhead.)
I wonder wether anybody has experience with packing bunches of small files
into larger tars or images, and then transparently accessing the files on the clients
through a local hook (like a virtual filesystem that mounts the tar or image
and presents its contents as a directory)?