> MEMORY – isi_cache_stats -v
Another great tool, and this one actually new to me ;-)
I had been fiddling with isi statistics query before,
to get some insight into OneFS caching,
but found it hard to get a clear picture.
It seems that isi_cache_stats -v prints totals since startup,
and it is even more useful when monitoring live deltas at
regular intervals like 5s: isi_cache_stats -v 5
The one-line form appears more compact, but I don't get
the meaning of the actual numbers (after the first line with the totals):
Totals l1_data: r 3.1T 6% p 34T 73%, l1_meta: r 113T 98% p 70G 51%, l2_data: r 23T 13% p 117T 74%, l2_meta: r 13T 68% p 4.6T 99%
13/08/13 18:19:39 l1_data: r 4.7M 8% p 41M 78%, l1_meta: r 365M 99% p 48K 40%, l2_data: r 86M 54% p 70M 96%, l2_meta: r 3.1M 29% p 24K 100%
13/08/13 18:19:44 l1_data: r 5M 8% p 41M 79%, l1_meta: r 328M 99% p 96K 100%, l2_data: r 80M 56% p 60M 94%, l2_meta: r 2.2M 24% p 16K 100%
So l1_meta: r 365M in the second row would mean level1 reads, but I don't think we have 365M of those...
(isi statistics pstat says: 12521.38/s NFS3-Ops, and 15930.80/s disk IOPS at this time.)
Can you explain how to read these numbers? (All numbers in isi_cache_stats -v 5 appear reasonable.)
But the real questions are of course about the OneFS caching in general.
How can one see the cache usage for certain traffic (by user, client, operation/event, path,...)?
How is cache memory allocated or prioritized to Level 1/2 and data/metadata (four combinations)?
Could one check the cache ages separately for these four cache sections?
(similar to isi statistics query -snode.ifs.cache.oldest_page_age)
I ask this because we find that often large data transfers mainly
fill the cache without much benefit (only few % data hits later).
The node.ifs.cache.oldest_page_age goes down to 1 minute in such situations;
and it seems that this number also applies to the metadata cache.
I'd rather prefer to assign more memory to the metadata cache
(in the absence of SSD for metadata) to allow the metadata content
to last for 30 minutes or more, while the data cache is short anyway.
Does this make sense to you?