If you fall below 2% nothing would happen. If you fall to the minimum supported level for GNA 1.5%, GNA will be disabled. We will continue to write to the SSD's until they are full however, we will no longer record the GNA metadata blocks. We will continue to READ metadata blocks stored for acceleration.
So are you saying that once GNA is enabled, even if we fall below 1.5% of total space, you essentially still get GNA? I had thought that when you go below the threshold the metadata on the SSD would be evicted. You are saying that is not the case?
What do you mean exactly when you say that "we will no longer record the GNA metadata blocks"?
The following white paper talks more to GNA and other OneFS storage topics.
Metadata blocks are files, directories and OneFS internal file system structures. GNA is a separate metadata block that allows nodes that don't have SSD's physically in them to store metatdata on nodes that do. Think of GNA as another layer were we are offering a means to leverage SSD HW from nodes that don't have SSDs. The challenge as you grow the cluster is simply that with GNA enabled if we have too many nodes that are not sharing the work, we create a situation were many non-SSD nodes bullied which does impact overall cluster performance. When GNA falls to the minimum supported level meaning we no longer have the right ratio of SSD's to SATA/SAS i.e. we added more nodes that didn't have ssd's.
We stop creating new GNA copy block as you create more files or directories on nodes that don't have SSD's.
Your metadata will still be written to the SSD's as per your metadata policy, we just won't create the GNA copies. You gain no benefit to new files or dirs that you create on non-SSD nodes in that GNA copies are not being written. Meaning all the metadata on non-SSD nodes are written to SAS/SATA disks.
Last point, we don't delete existing GNA BLOCKS, meaning you still get benefit from old blocks.
Thanks a lot, John.
How do the vnodes fit into the Level1/Level2 caching mechanism?
It is advisable to "tune" the vnode space?
Considering the SSD/GNA aspect (not so much caching) for large
tree walks, why is there no speed up when having only
the directory's metadata (= links and filenames) - but not the files' metadata - on SSD?
Using and enjoying GNA, we found that 2% is quite a "safe" limit.
For example, we see a 1% ratio of (consumed)SSD : (consumed)HDD
for a pool with an average file size of 0.25 MB
(actually on a X200/SATA+SSD pool with metadata on for all files; so not really GNA).
Summarizing several real GNA observations (X200/SATA+SSD -> 108NL),
the SSD:HDD ratio seems to drop down to 0.2% for an average file size of about 2 MB,
and to 0.1% for 50 MB and larger.
That's on 6.5.5; probably on 7.0 it comes out a bit different.
VNODES are part of the operating system side of OneFS that is seeded from BSD7.x code base from 6.5.x to 7.0.x. VNODES play a role in L1 and L2 in that what is being tracked are file and directory or symlink level details. My point in mentioning kern.maxvnodes is that is does play a performance role in large file sets in that the more direct pointers to data blocks that you can cache in memory the more optimal the performance. This kernel parameter does scale the number of elements as memory increases in a system
4GB = 250,000
8GB = 500,000
16GB = 1,000,000
32GB+ = 2,000,000
per node. Tuning is not recommended in that on smaller memory foot prints memory as a result of storing too many vnodes will reduce other memory areas due to pressure for available pages. In the larger memory foot-prints you need to be concerned about page operations that effectively poison cache and burn CPU cycles with a large list of cache lines.
e.g. you have a cluster with 500,000,000 or 500million file/dir objects. If one of your shares attached to a top level \\ALLMYFILES and all users were allowed Microsoft Index Scanner or Anti-VIRUS or iTunes Media scan ... These processes would start to walk the 500million file/dir objects and you would reach high water marks were managing a really large cache structure becomes expensive, increasing leads to more poisoning not more performance.
Tuning beyond 2million object per node requires some thought and in most cases the ratio between available memory to number of cache elements is adequate for active file set. Meaning, when you have 500,000,000+ files or directories it is more likely that the active file set is a smaller number say 8 million files in any given day.
There are differences between 6.5 and 7.x and likely 7.2 just around the corner. In 7.x we added METADATA-WRITE, this is the ability to put all metadata from work-flow DIRS/FILES/LINKS onto the SSD's to optimize METADATA WRITE operations like file create, rename, change attributes: size, acls, uid,gid, anti-virus scan hash...
Internally OneFS stores certain types of metadata that is part of the file system, b-tree's (a part of DIRS/SNAPSHOTS/OneFS structures) and extension blocks as needed when large numbers of snapshots are retained.
As we move forward in product development we will balance metadata structures that live on SSD and better balance your application I/O performance with optimal I/O execution. Meaning, in order to create a file; this involves many OneFS metadata structures updates especially as you add protection layers (SNAPshots, SyncIQ). We will be tuning the engine to leverage the most out of SSDs on your behalf.
The issue with 1.5% and 2% SSD ratio or atleast 20% of all nodes with SSD in GNA is balanced between the following:
1) We want to ensure that the amount of free blocks from SSD's is adequate for metadata needs. A file INODE at it's smallest will be 512bytes. However, as we add ACL->ACE AIMA items this will grow the inode. Making INODES a little variable in size.
isi status -d # gives you a high level overview of usage.
isi statistics drive -nall --orderby=inode --long --top # shows you what realtime drives have the most Inodes.
sysctl efs.lbm.drive_space # gives you a very granular level blkfree, usedino.
2) You want to ensure that no given nodepool nodes with SSD become bullied to the point were that node begins to slow down overall cluster performance.
e.g. X200, 6SSD and 6 SATA # this is not an invalid config some of you may have it. But the sizing exercise considered the following:
thanks a lot for the new answers!
> isi statistics drive -nall --orderby=inode --long --top
On our NL pool we see the Used% pretty balanced around 73%
(we always made sure that MultiScan succeeds after
disk swaps, node reboots etc.).
But the Inodes range from 590K to 2.4M per disk
in that pool; is this kind of unbalance normal or should
we take care of it?
"But the Inodes range from 590K to 2.4M per disk in that pool; is this kind of unbalance normal or should we take care of it?"
Ideally, the Inodes should be balanced. The smartpools process if you have license runs every day by default at 22:00HR. This process should be balancing the inodes for you. In the life of a filesystem, you will delete files or purge aged files, meaning it's possible that you see this from time to time. However, jobs like
When run will balance the inodes on disk out.
If it were to be an issue you might notice that some drives from isi statistics drive -nall --orderby=Queued | head -14 show some drives as unusually busy. If the size of the In or Out I/O divided by the operations <4k it's likely that this is namespace_ meta operations.
Quick way to see namespace count
Looking at the Files #/s Create/Remove are metadata-writes and Lookup are reads.