cassij
1 Nickel

Re: Ask the Expert: Isilon Performance Analysis

Peter, I will again tackle answering your questions indirectly and directly. To being

isi_cache_statistics is simply a means to determine

   cache hits

   prefetch hits

and from them develop an understanding of your work-flow. Typically, sizing memory, leveraging accelerator nodes for L1 work-flow that is heavy repeat read and ofcourse sizing your spindle count do have adequate disks to satisfy your prefect I/O goals and cache benefits.

cache hits case: CACHE memory available is the resource. If you have a HIGH hit rate 90+%, the translation would be that your work-flow repeat reads sections of a file in the work-flow. More memory in the cluster will help.

prefetch hits case: direct benefit from file optimization prefetch from disk storage. The two prefetch optimizations are CONCURRENCY and STREAMING (prefetch 9 and 200 blocks respectively). Now there is a cost here, and that cost is to your disk spindles. If you have a low prefetch hit rate it would mean that your files are typically <1MB in size or that the work-flow is more small random I/O to files. Prefetching blocks were an application does read files from start to finish will be of high benefit in that subsequent reads will benefit from the already warm memory cache.

L1  is a drop behind read cache. Meaning if you were using a new A100 accelerator and were performing a READ predominant set read. The idea would be to cache these blocks locally on the node such that we avoid the 50 - 150μsec L2 global lookup over infiniband. i.e. it's local to the protocol read operations.

L2 is global cache. Meaning if a protocol read operation occurs on node 2 and it's not presently in cache, we may read the block from node 3. At this point we have a global cache line for that block on node 3. Subsequent reads from node 5 to same block will benefit from the global cache in node 3. Internally we call I/O operations to data nodes as Local Block Management (lbm_) and when we need to lookup data from an adjacent node the lookup uses Remote Block Management (rbm_) over IB 50-150μsec latency.

Data and Metadata: Data is your actual data blocks that make up the file. Metadata is the Inodes, internal B-Tree's that are used as part of OneFS metadata structures i.e. directories are managed as a b-tree. To expand on metadata, as a file grows in a file-system to a very large size 1TB, we need to allocate metadata blocks that contain pointers to the data blocks. In the 1TB case ~1GB of metadata. Now, fitting 1TB in L2 data cache would be ridiculously so we cache the metadata pointers, in this way we know exactly how to read and prefetch the data. Ideally in a large file you want a high amount of metadata prefetch hits.

Answering your questions:

The isi_cache_stats tool is a wrapper to the sysctl isi.cache.stats. As you indicated the data is collected from cluster uptime. The first row returned is typically the global amount since uptime or data reset.

You can run isi_cache_stats -z then isi_cache_stats 5. This will clear the global stats and then start to monitor the number of realtime blocks that are started and from which you gain benefit from.

The isi_cache_stats in the non -v case are just a summary of what you see in the -v. The only real difference is that it shows you BLOCKS as human readable rather than blocks.

BTW: Another lightweight means to look at cluster wide work-flow as you look at isi_cache_stats is

   isi perfstat

This gives you a lightweight way of generalizing your work-flow

Cluster Name:b5-2

Initiator Statistics:

            Throughput (bytes/s) Files (#/s)

ID |   Type   | Write   Read   Delete |Create  Remove  Lookup

----+----------+-------+-------+-------+-------+-------+-------

  1 | storage  |   52K |  200K |0 |0 |0 |68
  2 | storage  |   112 |0 |0 |0 |0 |53
  3 | storage  |88 |0 |0 |0 |0 |82
  4 | storage  |0 |0 |0 |0 |0 |67
  5 | storage  |   14K |0 |  8.0K |0 |0 |59

---------------+-------+-------+-------+-------+-------+-------

Totals: |   67K |  200K |  8.0K |0 |0 |   331

BEWARNED, what follows isn't something that I recommend for a non-test situation. You can flush all CACHE (read) from a node or all nodes using

   isi_flush

   or isi_for_array -s isi_flush

this will happily flush all your cache warmth for your work-flow. USE WITH care, you will impact the cache performance benefit realtime from all your active work-flow clients.

Best,

John