Reply to Message

Reply to Message

View discussion in a popup

Replying to:
2 Iron

Re: Ask the Expert: Isilon Performance Analysis

This ask the expert is now open for discussions and questions! We are looking forward to an interesting discussion!

The following is from John, who asked me to post it on his behalf. I am not sure all of his formatting or ascii art will survive so I have uploaded it as a document as well.

Document Link below.

Welcome to ask the expert for EMC OneFS performance thread.

Performance Expectation

Let’s first address the “it’s slow” problem in performance engineering terms. Let me start with a simple model that allows you to break down latency. Like any other storage solution that you have dealt with, breaking down and mastering the architecture latencies, blockers and serializers from client to server are the first steps. As we address points or questions raised over the next few days I will reference the below model. We will build tips, techniques and simplified methodology that will help us identify and set performance expectations.


                              SYSTEM     (TT1)


       /  T1----READ----->   |HW (IRQ) |+--->Rx ( Rx RTT1 )
DISK  +                      |KERNEL   (SYS) |

       \ <---WRITE-----T1   |SERVICES (USER)|+<---Tx ( Tx RTT1 )

                             |APPS     (USER)|


Alright, not the worlds prettiest ascii art but no special application is needed either ‘-).

In the above

T1     represents the response time for Reads or Writes to local storage.

TT1   represents the Think Time processing from HW/KERNEL/APP layers.

RTT1 represents the TCP round trip time

Simple MATH:  Computing READ (non-cached), scenario, you want to copy a file from Local client storage to a server.

( READ ( T+ TT)) = Client side latency.

Through-put per second can be expressed as:

(1 Seconds / ( READ ( T+ TT))) * IO SIZE = expected throughput of client

If the latency on disk was 6ms and the think time is 1ms from application through kernel to HW.

(1 second / ( 6ms + 1ms )) *  32KB  = 


NETWORK MODEL                 Network



       /  T2----Rx READ----> |VLAN           |+--->Rx ( Rx RTT2 )
TCP   +                      |SPANNING TREE  |

       \ <---Tx WRITE----T2 |Rate Limiting  |+<---Tx ( Tx RTT2 )

                             |QOS            |

                             |LACP           |


                                  V     ^

                                  |     |

                                  R     T

                                  x     x


                                  ( RTT2 )

From the above I want to introduce network influencers on bandwidth delay product (BDP). A OneFS file-server protocols SMB, NFS, HTTP, FTP, … are all TCP/IP based. The influencers to network through-put performance include whether TCP window scaling is enabled and whether selective acknowledgement is enabled. Both of these TCP layers require that a socket connection be established between client and server where the physical network environment allows them.

QOS, if enabled from client to server, may limit the available bandwidth by 20% where the 20% is reserved for other services besides TCP/IP. In the case of VLAN/LACP these affect the routing of packets between the client and the server. When packets do not arrive in order, the TCP protocol layer needs to re-assemble them. This re-assembly of packets adds overhead that can be expressed as latency; typically in microseconds, e.g. 500 microseconds is .5 milliseconds. Handling out-of-order packets can be a significant factor in achieving high end network performance.

netsat -s  # From the command line on most operating systems, will give you an indication of out-of-order or SACK (select acknowledgement)  being utilized. If you see SACK recoveries this also is a clue on how LOSSY the network is. (LOSSY = Packet that are DROPPED or LOST)

wireshark # Wireshark or the command line tshark break down packet captures

tshark        # allowing you to see out-of-order, sack recovery episodes, spanning tree,…



           ^                     |       ^               |       ^

    |      |                     |       |               |       |

    Rx     Tx                    Rx      Tx               Rx      Tx

    |      |                     |       |               |       |     

    V      ^                     V       |               V       |    

  +-+-------+-----+          +-+-------+-----+       +-+-------+----+ 

      | Node1                       |                        | Node2                       |                  |  Node 3                  |

  +---------------+          +---------------+       +--------------+                         

  |HW             |          |HW             |       |HW            |
  |KERNEL         | IB (TT3) |KERNEL         |IB(TT3)|KERNEL        |

  |ONEFS          +<-------->+ONEFS          +<----->+ONEFS         |

  |SERVICES       |          |SERVICES       |       |SERVICES      |

  +---------------+          +---------------+       +--------------+                         

      | DISKS (T3 )              |                         | DISKS (T3 )              |                 | DISKS (T3 )             |

  +---------------+          +---------------+       +--------------+                         

All OneFS nodes can receive client traffic. SMARTCONNECT is the DNS Delegation Server which will offer and then bind a given client end point to server point. In the above diagram the take away is that TCP/IP packets from client will arrive at a node. Each node manages a fraction of the drives for the entire cluster. In the above, 2/3rds of the disk I/O will be derived from network request operations on node2 will be satisfied over IB (infiniband) from node 1 and 3. The point is that NETWORK resources are entirely managed on the node of request however the DISK I/O is leveraged in a scale-out model across all of the nodes.

    TT3 = IB latency is typically in the .050ms range, very low latency. I am also throwing in the cost of

services into this latency bucket.

    T3 = DISK LATENCY for a 7200K RPM SATA is 5ms, 10K RPM SAS is 3ms

Simple MATH:  Computing WRITE (non-endurant cache), scenario, you want to copy a file to OneFS server.

( WRITE ( T+ TT)) = Server side latency.

Through-put per second can be expressed as:

(1 Seconds / ( WRITE ( T+ TT))) * IO SIZE = expected throughput of client

Copying a file:

If we take the simple MATH from the client READ and the OneFS server WRITE we end up with

( 1 second / ( (READ ( T1 + TT1 ) + (NETWORK COST RTT2) + ( WRITE (T3 + TT3 ))) * IO SIZE

The above doesn’t factor in BDP properly but as an illustration in determining “it’s slow” it’s important to be able to measure and identify where the lionshare of the latency is. e.g.

If the sum of READ latency is 30ms, Network Cost is .2ms and WRITE latency is 3ms and the IO SIZE is 32KB, your expected throughput would be

((1 second) / ((30 + .2 + 3) * milliseconds)) * 32 KB = 963.855422 KBps

Clearly the above fits the “IT’s SLOW”. Please modify the latency figures and see how this affects through-put.

MEMORY CACHE is a good thing. Memory access from cached files is in the microsecond range as it translates to file-server protocols. This is how we can achieve high end through put. A cached read on a client will likely be 0.1ms, the network latency will be .2ms and OneFS write cache will be in the .1ms range. This is how OneFS and file-servers can achieve as high or higher performance on scale-out than traditional block storage.


Measurement Commands on OneFS

  • CPU       - top, isi statistics system --nodes --top
  • NET        - isi statistics protocol --top --orderby=timeavg

                                ping, iperf

  • DISK                                      - isi statistics drive -nall –top --long
  • MEMORY                            – isi_cache_stats -v
  • EXTERNAL Latencies       – anti-virus, DNS, AD, LDAP, NIS

OneFS JOBs (overhead)

OneFS maintains both the protection of your data as well as the balance of data between nodes and disks. The job engine

isi job status -v  # will show you active running jobs or pending

isi job sched # will show you jobs scheduled to run at certain times

isi job list   # will list all of the types of jobs that can run

When jobs run they have a default impact level. The impact level is sized to number of disks per node. There are three impact levels, LOW, MEDIUM and HIGH. On a low setting the impact to the disks when a job is running will be <= 5%, medium <=20% and high will be as much as 40%.

OneFS jobs need to run on the system to maintain balance and repair data layout as a result of drive failures. I mention it here to note that when jobs are running there should be an expected hit in performance in that there is more OneFS activity on the disk drives.