Welcome to this EMC
Support Community Ask the Expert conversation. This is an opportunity to learn about and discuss the best practices for Isilon performance analysis, including:
This discussion begins on August 12 and concludes on August 23. Get ready by bookmarking this page or signing up to receive email notifications.
John Cassidy has spent three decades developing, supporting complex solutions and simplifying problems; He brings the following to bear in OneFS Performance or Complex work-flow issues:
* Work-Flow profiling
* Simplification methods
* Measurement tools
* Deterministic Results
This ask the expert is now open for discussions and questions! We are looking forward to an interesting discussion!
The following is from John, who asked me to post it on his behalf. I am not sure all of his formatting or ascii art will survive so I have uploaded it as a document as well.
Document Link below.
Welcome to ask the expert for EMC OneFS performance thread.
Let’s first address the “it’s slow” problem in performance engineering terms. Let me start with a simple model that allows you to break down latency. Like any other storage solution that you have dealt with, breaking down and mastering the architecture latencies, blockers and serializers from client to server are the first steps. As we address points or questions raised over the next few days I will reference the below model. We will build tips, techniques and simplified methodology that will help us identify and set performance expectations.
CLIENT MODEL OPERATING
/ T1----READ-----> |HW (IRQ) |+--->Rx ( Rx RTT1 )
DISK + |KERNEL (SYS) |
\ <---WRITE-----T1 |SERVICES (USER)|+<---Tx ( Tx RTT1 )
Alright, not the worlds prettiest ascii art but no special application is needed either ‘-).
In the above
T1 represents the response time for Reads or Writes to local storage.
TT1 represents the Think Time processing from HW/KERNEL/APP layers.
RTT1 represents the TCP round trip time
Simple MATH: Computing READ (non-cached), scenario, you want to copy a file from Local client storage to a server.
( READ ( T1 + TT2 )) = Client side latency.
Through-put per second can be expressed as:
(1 Seconds / ( READ ( T1 + TT2 ))) * IO SIZE = expected throughput of client
If the latency on disk was 6ms and the think time is 1ms from application through kernel to HW.
(1 second / ( 6ms + 1ms )) * 32KB =
NETWORK MODEL Network
/ T2----Rx READ----> |VLAN |+--->Rx ( Rx RTT2 )
TCP + |SPANNING TREE |
\ <---Tx WRITE----T2 |Rate Limiting |+<---Tx ( Tx RTT2 )
( RTT2 )
From the above I want to introduce network influencers on bandwidth delay product (BDP). A OneFS file-server protocols SMB, NFS, HTTP, FTP, … are all TCP/IP based. The influencers to network through-put performance include whether TCP window scaling is enabled and whether selective acknowledgement is enabled. Both of these TCP layers require that a socket connection be established between client and server where the physical network environment allows them.
QOS, if enabled from client to server, may limit the available bandwidth by 20% where the 20% is reserved for other services besides TCP/IP. In the case of VLAN/LACP these affect the routing of packets between the client and the server. When packets do not arrive in order, the TCP protocol layer needs to re-assemble them. This re-assembly of packets adds overhead that can be expressed as latency; typically in microseconds, e.g. 500 microseconds is .5 milliseconds. Handling out-of-order packets can be a significant factor in achieving high end network performance.
netsat -s # From the command line on most operating systems, will give you an indication of out-of-order or SACK (select acknowledgement) being utilized. If you see SACK recoveries this also is a clue on how LOSSY the network is. (LOSSY = Packet that are DROPPED or LOST)
wireshark # Wireshark or the command line tshark break down packet captures
tshark # allowing you to see out-of-order, sack recovery episodes, spanning tree,…
^ | ^ | ^
| | | | | |
Rx Tx Rx Tx Rx Tx
| | | | | |
V ^ V | V |
+-+-------+-----+ +-+-------+-----+ +-+-------+----+
| Node1 | | Node2 | | Node 3 |
+---------------+ +---------------+ +--------------+
|HW | |HW | |HW |
|KERNEL | IB (TT3) |KERNEL |IB(TT3)|KERNEL |
|ONEFS +<-------->+ONEFS +<----->+ONEFS |
|SERVICES | |SERVICES | |SERVICES |
+---------------+ +---------------+ +--------------+
| DISKS (T3 ) | | DISKS (T3 ) | | DISKS (T3 ) |
+---------------+ +---------------+ +--------------+
All OneFS nodes can receive client traffic. SMARTCONNECT is the DNS Delegation Server which will offer and then bind a given client end point to server point. In the above diagram the take away is that TCP/IP packets from client will arrive at a node. Each node manages a fraction of the drives for the entire cluster. In the above, 2/3rds of the disk I/O will be derived from network request operations on node2 will be satisfied over IB (infiniband) from node 1 and 3. The point is that NETWORK resources are entirely managed on the node of request however the DISK I/O is leveraged in a scale-out model across all of the nodes.
TT3 = IB latency is typically in the .050ms range, very low latency. I am also throwing in the cost of
services into this latency bucket.
T3 = DISK LATENCY for a 7200K RPM SATA is 5ms, 10K RPM SAS is 3ms
Simple MATH: Computing WRITE (non-endurant cache), scenario, you want to copy a file to OneFS server.
( WRITE ( T3 + TT3 )) = Server side latency.
Through-put per second can be expressed as:
(1 Seconds / ( WRITE ( T3 + TT3 ))) * IO SIZE = expected throughput of client
Copying a file:
If we take the simple MATH from the client READ and the OneFS server WRITE we end up with
( 1 second / ( (READ ( T1 + TT1 ) + (NETWORK COST RTT2) + ( WRITE (T3 + TT3 ))) * IO SIZE
The above doesn’t factor in BDP properly but as an illustration in determining “it’s slow” it’s important to be able to measure and identify where the lionshare of the latency is. e.g.
If the sum of READ latency is 30ms, Network Cost is .2ms and WRITE latency is 3ms and the IO SIZE is 32KB, your expected throughput would be
((1 second) / ((30 + .2 + 3) * milliseconds)) * 32 KB = 963.855422 KBps
Clearly the above fits the “IT’s SLOW”. Please modify the latency figures and see how this affects through-put.
MEMORY CACHE is a good thing. Memory access from cached files is in the microsecond range as it translates to file-server protocols. This is how we can achieve high end through put. A cached read on a client will likely be 0.1ms, the network latency will be .2ms and OneFS write cache will be in the .1ms range. This is how OneFS and file-servers can achieve as high or higher performance on scale-out than traditional block storage.
Measurement Commands on OneFS
OneFS JOBs (overhead)
OneFS maintains both the protection of your data as well as the balance of data between nodes and disks. The job engine
isi job status -v # will show you active running jobs or pending
isi job sched # will show you jobs scheduled to run at certain times
isi job list # will list all of the types of jobs that can run
When jobs run they have a default impact level. The impact level is sized to number of disks per node. There are three impact levels, LOW, MEDIUM and HIGH. On a low setting the impact to the disks when a job is running will be <= 5%, medium <=20% and high will be as much as 40%.
OneFS jobs need to run on the system to maintain balance and repair data layout as a result of drive failures. I mention it here to note that when jobs are running there should be an expected hit in performance in that there is more OneFS activity on the disk drives.
I was wondering what the most common performance issue you see when troubleshooting? I know performance is a wide and encompassing topic, but I was just curious if the majority is workflow, or networking or cluster sizing/iops or something else?
Indeed performance and sizing work-flows is quite nebulous, especially so in file-serving. The reason is that in order to complete an open, read or write operation you need to consider that 33% of the issue could be work-flow or client side, 33% network and 33% server.
The common issue is not describing the performance statement, impact and setting reasoned expectations.
To give you a more specific answer many folks do not factor in the cost of data protection or quota monitoring. Reducing data risk has a trade off in performance.
On workflow the most common area to address is ensuring that the client workflow is scaling out across all the onefs nodes in the cluster in order to leverage all the CPU, net, memory and onefs disk scale-out technology.
Performance monitoring: How do we view the utilization of write cache on Isilon Array and does all the nodes aggregate the write cache, OR only the nodes that is part of a smart connect zone can use write cache.
I will answer the second part of your question first. Smartcache is enabled by default for concurrent (default) and streaming optimized directories and files. Random optimization by default implicitly disables write cache coalescing.
As you perform buffered write operations from smb, nfs or protocol writes. These writes will be coalesced in a write buffer e.g. If there are a sequence of proto.nfs write operations @32KB at next offsets to each other we will buffer this upto 1MB or 2MB (concurrent / stream optimized). This buffer is cluster wide and cache coherent between all nodes. At some point we will write out the data using both data protection and disk space utilization as guides to which nodes are written too. The journaled writes leverage our nvram as part of the optimized write to disk storage.
Answering the first part of your question. A means to measure your buffered writes is simply to measure the latency seen for protocol write operations
Isi statistics protocol --class=write --orderby=timeavg --top
In 7x onefs you should note very optimal writes in the microsecond range. When this climbs to the millisecond range, the two simple reasons would be
1) the journal cannot flush writes to disk based on rate of change. This is another way of saying that there are insufficient disks in the node pool to satisfy the demand.
Isi statistics drive -nall --orderby=timeinq --long --top
You might note that the sum of Opsin (writes) + opsout (read) exceeds a normal range for disk type. You would see > 1 queued io . The more queued the more significant it would be to look to increasing spindle count. Adding nodes almost immediately brings new disks into the fold.
2) the write work flow is not buffering. Meaning that its setting a directio flag. In this case nvram is still leveraged however, the I/o are no longer coalesced into more optimal operations.
A paper that talks to smartpools and node pools
There are sysctls that do monitor write cache but they are non-trivial to translate. This is why I suggested a more simplified approach. I.e you will more likely be spindle challenged than you will be nvram/write cache challenged.
John, thanks a lot for the detailed overall picture.
> At some point we will write out the data using both data protection and disk space utilization as guides to which nodes are written too.
Does this mean, the busy or queue situations of the individual disks are not taken into account?
One more thing: The isi statistics xyz CLI tools are really great. An aspect I'm missing though (for isi statistics client in particular) is an "export" or "share" column to sort out where certain traffic (by user or remote client host for example) is actually going to on the file system. There is the very nice isi statistics heat, but it cannot be correlated with users or remote hosts on a busy cluster with many simultaneous workloads.
Any chance we will see per "export" or per "share" data with isi statistics in the future?
> MEMORY – isi_cache_stats -v
Another great tool, and this one actually new to me ;-)
I had been fiddling with isi statistics query before,
to get some insight into OneFS caching,
but found it hard to get a clear picture.
It seems that isi_cache_stats -v prints totals since startup,
and it is even more useful when monitoring live deltas at
regular intervals like 5s: isi_cache_stats -v 5
The one-line form appears more compact, but I don't get
the meaning of the actual numbers (after the first line with the totals):
Totals l1_data: r 3.1T 6% p 34T 73%, l1_meta: r 113T 98% p 70G 51%, l2_data: r 23T 13% p 117T 74%, l2_meta: r 13T 68% p 4.6T 99%
13/08/13 18:19:39 l1_data: r 4.7M 8% p 41M 78%, l1_meta: r 365M 99% p 48K 40%, l2_data: r 86M 54% p 70M 96%, l2_meta: r 3.1M 29% p 24K 100%
13/08/13 18:19:44 l1_data: r 5M 8% p 41M 79%, l1_meta: r 328M 99% p 96K 100%, l2_data: r 80M 56% p 60M 94%, l2_meta: r 2.2M 24% p 16K 100%
So l1_meta: r 365M in the second row would mean level1 reads, but I don't think we have 365M of those...
(isi statistics pstat says: 12521.38/s NFS3-Ops, and 15930.80/s disk IOPS at this time.)
Can you explain how to read these numbers? (All numbers in isi_cache_stats -v 5 appear reasonable.)
But the real questions are of course about the OneFS caching in general.
How can one see the cache usage for certain traffic (by user, client, operation/event, path,...)?
How is cache memory allocated or prioritized to Level 1/2 and data/metadata (four combinations)?
Could one check the cache ages separately for these four cache sections?
(similar to isi statistics query -snode.ifs.cache.oldest_page_age)
I ask this because we find that often large data transfers mainly
fill the cache without much benefit (only few % data hits later).
The node.ifs.cache.oldest_page_age goes down to 1 minute in such situations;
and it seems that this number also applies to the metadata cache.
I'd rather prefer to assign more memory to the metadata cache
(in the absence of SSD for metadata) to allow the metadata content
to last for 30 minutes or more, while the data cache is short anyway.
Does this make sense to you?
We monitor the drives for fail or stall rates.
disi -I hwhealth ls # will list ECC and STALL occurences.
If a drive fails sufficiently it will be smartfailed at which point we will not target new writes. If the drive is healthy and eligible for a write based on protection level and free space, we will write to it.
sysctl efs.lbm.drive_space # local to each node, reports the total blocks and used block space.
NOTE: When you replace a failed drive it is important to allow multiscan (collect * autobalance) to run. It is ideal to have all drives
OneFS's disk scale-out model does do a good job of leveraging all the disks in the system. If a drive has too many queued I/O's we will still queue to it. When you are in this state you should notice from the isi statistics drive -nall -long --orderby=timeinq | head -14 and then repeat into tail -14, that your top 10 and bottom 10 drives have a uniform degree of queued I/O.
isi statistics <sub-command> --help, support a --csv is the comma separated output format and is as close a means to export data to excel or another tool.
The latest version of InsightIQ the fully fledged Performance and Trend analysis is another tool to consider for reporting.
Peter, I will again tackle answering your questions indirectly and directly. To being
isi_cache_statistics is simply a means to determine
and from them develop an understanding of your work-flow. Typically, sizing memory, leveraging accelerator nodes for L1 work-flow that is heavy repeat read and ofcourse sizing your spindle count do have adequate disks to satisfy your prefect I/O goals and cache benefits.
cache hits case: CACHE memory available is the resource. If you have a HIGH hit rate 90+%, the translation would be that your work-flow repeat reads sections of a file in the work-flow. More memory in the cluster will help.
prefetch hits case: direct benefit from file optimization prefetch from disk storage. The two prefetch optimizations are CONCURRENCY and STREAMING (prefetch 9 and 200 blocks respectively). Now there is a cost here, and that cost is to your disk spindles. If you have a low prefetch hit rate it would mean that your files are typically <1MB in size or that the work-flow is more small random I/O to files. Prefetching blocks were an application does read files from start to finish will be of high benefit in that subsequent reads will benefit from the already warm memory cache.
L1 is a drop behind read cache. Meaning if you were using a new A100 accelerator and were performing a READ predominant set read. The idea would be to cache these blocks locally on the node such that we avoid the 50 - 150μsec L2 global lookup over infiniband. i.e. it's local to the protocol read operations.
L2 is global cache. Meaning if a protocol read operation occurs on node 2 and it's not presently in cache, we may read the block from node 3. At this point we have a global cache line for that block on node 3. Subsequent reads from node 5 to same block will benefit from the global cache in node 3. Internally we call I/O operations to data nodes as Local Block Management (lbm_) and when we need to lookup data from an adjacent node the lookup uses Remote Block Management (rbm_) over IB 50-150μsec latency.
Data and Metadata: Data is your actual data blocks that make up the file. Metadata is the Inodes, internal B-Tree's that are used as part of OneFS metadata structures i.e. directories are managed as a b-tree. To expand on metadata, as a file grows in a file-system to a very large size 1TB, we need to allocate metadata blocks that contain pointers to the data blocks. In the 1TB case ~1GB of metadata. Now, fitting 1TB in L2 data cache would be ridiculously so we cache the metadata pointers, in this way we know exactly how to read and prefetch the data. Ideally in a large file you want a high amount of metadata prefetch hits.
Answering your questions:
The isi_cache_stats tool is a wrapper to the sysctl isi.cache.stats. As you indicated the data is collected from cluster uptime. The first row returned is typically the global amount since uptime or data reset.
You can run isi_cache_stats -z then isi_cache_stats 5. This will clear the global stats and then start to monitor the number of realtime blocks that are started and from which you gain benefit from.
The isi_cache_stats in the non -v case are just a summary of what you see in the -v. The only real difference is that it shows you BLOCKS as human readable rather than blocks.
BTW: Another lightweight means to look at cluster wide work-flow as you look at isi_cache_stats is
This gives you a lightweight way of generalizing your work-flow
|Throughput (bytes/s)||Files (#/s)|
ID | Type | Write Read Delete |Create Remove Lookup
|1 | storage | 52K | 200K |||0 |||0 |||0 |||68|
|2 | storage | 112 |||0 |||0 |||0 |||0 |||53|
|3 | storage |||88 |||0 |||0 |||0 |||0 |||82|
|4 | storage |||0 |||0 |||0 |||0 |||0 |||67|
|5 | storage | 14K |||0 | 8.0K |||0 |||0 |||59|
|Totals:||| 67K | 200K | 8.0K |||0 |||0 | 331|
BEWARNED, what follows isn't something that I recommend for a non-test situation. You can flush all CACHE (read) from a node or all nodes using
or isi_for_array -s isi_flush
this will happily flush all your cache warmth for your work-flow. USE WITH care, you will impact the cache performance benefit realtime from all your active work-flow clients.