If I were a statistician or otherwise not being an engineer, then maybe I'd go with the random sample approach.
But as an engineer I can't take the risk of missing outliers that might mean missing a 100GB file among a collection of files that are mostly 20-30GB in size.
It's just a shame that without the LIN key there is no way to join any of the stats_* tables in results.db with disk_usage.
Another use for the information for the data in disk_usage would be to create a graph or presentation similar to what "Sequoia View" does for Windows filesystems.
Correct, and it gets even worse when you have a "few" GB-files file among millions of KB-files...
Random sampling is most useful for file counts -- it can give a reasonable picture of where those (100s of) million file are located within an hour or even minutes.
Peter_Sero
4 Operator
•
1.2K Posts
0
January 6th, 2016 22:00
It seems the overall capacity information is not recorded simultaneously by path/LIN and by disk_pool.
What comes closest:
list_top_n_files_by_log_size
list_top_n_files_by_phys_size
but obviously only for a small set, the largest files.
You can get pool statistics per single subtree with
isi filepool apply -r -s DIR
which runs quickest shortly after a SmartPools job.
(actually is gives you "file pool" statistics, which would need to
be mapped into "disk pools" in a post processing step.)
I am currently experimenting with statistics by progressive
random sampling of /ifs (pick random LINs, run
as long as needed, about 1000 LIN/min in a single thread). Interested?
Cheers
-- Peter
rakosnicek
4 Posts
0
January 6th, 2016 23:00
It feels like the "isi filepool apply .." is also doing a disk walk
My problem is that one directory under /ifs/data owns about 162,000,000 files.
With that many files, even 1000 LIN/min is just too slow.
Peter_Sero
4 Operator
•
1.2K Posts
0
January 7th, 2016 04:00
This idea is to sample only a tiny fraction of files, say 10 or 100 thousand, and extrapolate...
rakosnicek
4 Posts
0
January 10th, 2016 16:00
If I were a statistician or otherwise not being an engineer, then maybe I'd go with the random sample approach.
But as an engineer I can't take the risk of missing outliers that might mean missing a 100GB file among a collection of files that are mostly 20-30GB in size.
It's just a shame that without the LIN key there is no way to join any of the stats_* tables in results.db with disk_usage.
Another use for the information for the data in disk_usage would be to create a graph or presentation similar to what "Sequoia View" does for Windows filesystems.
Peter_Sero
4 Operator
•
1.2K Posts
0
January 12th, 2016 07:00
Correct, and it gets even worse when you have a "few" GB-files file among millions of KB-files...
Random sampling is most useful for file counts -- it can give a reasonable picture of where those (100s of) million file are located within an hour or even minutes.
Cheers
-- Peter
rakosnicek
4 Posts
0
January 12th, 2016 22:00
Is there a way to file an RFE to get the "stats_*" tables updated in results.db to include the LIN for each file?