Highlighted
jaangeletti
1 Copper

File Count Per Node Recommendations

Jump to solution

Looking for recommendations on number of files per node. I seem to have numerous existing customers that are having massive challenges with jobs completing. In some cases certain jobs have never successfully run. I have them all engaged with technical support but the one trend that I've noticed is that my predecessor didn't necessarily take into consideration the amount of files needed to support rather focused on the capacity required and on a hunch we've remedied one of these accounts by adding additional nodes to the cluster. Obviously, this isn't a fix I'd like to offer up often so I'd like to know if there is any good estimate of files supported per node. Or rather, where should I look at leveraging smaller drives to manipulate the CPU/TiB ratio?

0 Kudos
1 Solution

Accepted Solutions
colby_trotter
1 Copper

Re: File Count Per Node Recommendations

Jump to solution

There is no best ratio of files per node, or TiB per CPU. The common response to the question will be "it depends on the workflow and data layout". You may have billions of files in an archive, or transient data in scratch space where Isilon operates just fine. Or a much smaller data set that is problematic. you have correctly identified the real challenge in the Job Engine which is serialized (runs one job at a time) and jobs are often paused by higher priority jobs starting. The results is that very useful and nesessary jobs do not complete in a timely fashion, some times not at all.

Scaling CPU can help by scaling Job Engine threads and SSD can assist in the metadata operations of the Job Engine, however the solution (we hope) will be the improvements in the Job Engine itself.  OneFS 7.1 will include Job Engine 2.0 which can run mutiple concurrent jobs (3) thereby improve effeciency and effectiveness of internal operations.

0 Kudos
5 Replies
Peter_Sero
4 Tellurium

Re: File Count Per Node Recommendations

Jump to solution
0 Kudos
RobertGadbois
1 Copper

Re: File Count Per Node Recommendations

Jump to solution

Hi,

Are you referring to the amount of files (inodes) that can be hosted on a node or asking at how many open files can be actively open on a particular node at one time?

If it's the later than we would also need know what version on OneFS you are running as this limit was specific per OneFS version.

Robert

0 Kudos
colby_trotter
1 Copper

Re: File Count Per Node Recommendations

Jump to solution

There is no best ratio of files per node, or TiB per CPU. The common response to the question will be "it depends on the workflow and data layout". You may have billions of files in an archive, or transient data in scratch space where Isilon operates just fine. Or a much smaller data set that is problematic. you have correctly identified the real challenge in the Job Engine which is serialized (runs one job at a time) and jobs are often paused by higher priority jobs starting. The results is that very useful and nesessary jobs do not complete in a timely fashion, some times not at all.

Scaling CPU can help by scaling Job Engine threads and SSD can assist in the metadata operations of the Job Engine, however the solution (we hope) will be the improvements in the Job Engine itself.  OneFS 7.1 will include Job Engine 2.0 which can run mutiple concurrent jobs (3) thereby improve effeciency and effectiveness of internal operations.

0 Kudos
smaug1
1 Copper

Re: File Count Per Node Recommendations

Jump to solution

This OneFS Limits document can help with setting appropriate expectations and properly planning

the cluster configuration:

https://support.emc.com/docu50238_Isilon-Guidelines-for-Large-Workloads-.pdf?language=en_US

Peter_Sero
4 Tellurium

Re: File Count Per Node Recommendations

Jump to solution

Fantastic document, but the recommendations for file and directory counts mainly address the end users' view.

The Isilon admin needs to learn about the impact of the file count on the performance of the restripe jobs.

You can be perfectly within the limits recommended in this paper, but find yourself in trouble with protect, autobalance

and smartpools jobs taking too long (weeks...). Additional advise on file count is needed to keep restripe times

within acceptable limits on a given cluster.  On the same cluster, the same total amount of data will lead

to different restripe performance depending wether it is spread to all 1 MB sized files or all 1 GB sized files,

and it would be good to predict that performance when planning a cluster.

Would be glad to see some guidelines from Isilon for this.

So far the cases from the community where clusters  were stuck in restripes,

the common denominator I observed was that the ratio

"total files in node pool" : "total num of disks in node pool" (all SATA, no SSD)

exceeded by far 1 Mio : 1

Cheers

-- Peter

0 Kudos