Periodic slow NFS client write performance

Question

Hi all, I've noticed random intermittent but frequent slow write performance on a NFS V3 TCP client, as measured over a 10 second nfs-iostat interval sample: write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)                  97.500         291.442           2.989        0 (0.0%)         457.206         515.703                 121.500         470.055           3.869        0 (0.0%)         265.747         268.599                 124.800         470.765           3.772        0 (0.0%)         154.299         155.938 Its taking sometimes 30+ seconds to for the NFS writes to complete, as reported by the Linux client. As you can see, its not writing a lot of data, but there is a lot of flushing to NFS as it is a MySQL DB. Here's my mount options: rw,relatime,vers=3,rsize=4096,wsize=4096,namlen=255,hard,nolock,proto=tcp,retrans=3,sec=sys,mountvers=1,mountproto=udp,local_lock=all I know wsize and rsize should be larger, but I am getting similar but less frequent behaviour from other clients that use the Isilon much larger server defaults of rsize=131072,wsize=524288. On looking at isi statistics I see the following: # isi version Isilon OneFS v6.5.4.4 B_6_5_4_76(RELEASE): 0x60504000040004C:Tue Oct 11 18:21:38 PDT 2011    root@fastbuild-05.west.isilon.com:/build/mnt/obj.RELEASE/build/mnt/src/sys/IQ.amd64.release # isi status Cluster Name: XXXXX Cluster Health:     [ OK ] Cluster Storage:  HDD                 SSD            Size:             39T (43T Raw)       0              VHS Size:         4.3T                Used:             13T (33%)           0 (n/a)        Avail:            26T (67%)           0 (n/a)                           Health Throughput (bps)    HDD Storage      SSD Storage ID |IP Address     |DASR|  In   Out  Total| Used / Size      |Used / Size ---+---------------+----+-----+-----+-----+------------------+-----------------   1|XX.XX.XX.XX    | OK | 328K|    0| 328K|  3.2T/ 9.7T( 33%)|    (No SSDs)      2|XX.XX.XX.XX    | OK |    0| 1.6M| 1.6M|  3.2T/ 9.7T( 33%)|    (No SSDs)      3|XX.XX.XX.XX    | OK | 478K| 577K| 1.1M|  3.2T/ 9.7T( 33%)|    (No SSDs)      4|XX.XX.XX.XX    | OK | 478K|  24K| 502K|  3.3T/ 9.7T( 34%)|    (No SSDs)    ------------------------+-----+-----+-----+------------------+----------------- Cluster Totals:        | 1.3M| 2.2M| 3.5M|   13T/  39T( 33%)|    (No SSDs)    It doesn't really seem that busy... however in a 10 second interval sample I see: # isi statistics client       Ops         In        Out    TimeAvg       Node      Proto            Class   UserName        LocalName       RemoteName         N/s        B/s        B/s         us                                                             0.4       1.7K       54.4  3620423.0          4       nfs3            write    UNKNOWN      XX.XX.XX.XX host1                                    0.2      830.4       27.2  2515569.0          4       nfs3            write    UNKNOWN      XX.XX.XX.XX host1        0.6       40.8       72.0   221096.3          2       nfs3           delete       root      XX.XX.XX.XX host2        0.6       40.0       48.8   206670.0          2       nfs3           delete       root      XX.XX.XX.XX host3        0.2       13.6       24.0   101271.0          2       nfs3           delete       root      XX.XX.XX.XX host3        0.4       27.2       48.0    64634.0          2       nfs3           delete       root      XX.XX.XX.XX host4        0.4       27.2       48.0    64576.0          2       nfs3           delete       root      XX.XX.XX.XX host3 Interestingly, the DB host above is on node 3. In addition, when I run 'isi statistics heat --nodes=all' around the same time it looks like I have 10K+ LIN locks active. So, clearly 3+ seconds for a NFS write is an issue. My question is if 10K LIN locks is considered high? And if so, what could be causing so many locks? Regards and thanks for your time, Gary

Peter_Sero · Accepted Answer

Hello Gary,

it seems the writes are smaller than 4K already, and I wouldn't restrict the wsize

to a value lower than what the client likes to send in one op.

(Divide the In rate by the Ops rate in your example, or use the --long option

to see InAvg/InMin/InMax, i.e. the distribution of actual write sizes in B or KB).

Unless you put the real DB data files on SSD, the flash will (only) help

in navigating to the data blocks in the file. As you do many updates

to existing block, the accessed file layout information might be held largely in

the RAM cache anyway, and probably you wouldn't see much improvement.

But it could help in principle, would never harm in any case.

Snapshots can hurt with random updates, as copy-on-write

might be chosen here. If you have Snaphots, delete them, or

run test on fresh copies of the DB files which are not covered by Snapshots.

Forget to mention that the coalescer in 7.0 had been improved for latency in 2012.

But wether it can do wonders where there is nothing to coalesce in the end

due to heavy scattering? It will most likely behave differently, and probably not worse than

the 6.5 coalescer for your case. In 7.1 more changes have been made, as Jim just

explained in the context of many-small-files within the ongoing Ask The Expert discussion on 7.1.

Cheers

-- Peter

Peter_Sero · Answer

The locks shown by 'isi statistics heat' are One-FS internal locks when a node updates a file, rather than application/protocol lock operations. You'll see plenty of them with random IO and small writes blocks. It's difficult to get the full picture from the statistics excerpts, but I would guess the NFS client is simply filling the node's NVRAM, while the OneFS write coalescer hopes it can do good work in the end (i.e. to coalesce many small writes into fewer and larger physical writes). However when it is full and it is time to write the data to the disks, and the write chunks are too fragmented over all, and a large number of small disk transfers need to made at once. At this time the NVRAM cannot sustain a high rate of new writes. So the intermittent phases of slow writes would correlate with filling up and flushing the NVRAM. In mixed loads with most of the clients doing streaming writes, this effect doesn't become so prominent. The random IO clients would just see so-so performance all the time. A simple test can be to add some streaming write load to the node from another client... You can also try to observe, in 2-second intervals, and all simultaneously: - isi statistics client (for exactly the MySQL connection) - isi statistics drive --long (check out sorting by OpsIn or TimeInQ or Queued) - sysctl efs.bam.coalescer_stats (many things going on here; you will see patterns in time for sure) A couple of further thoughts: - Try switching off the coalescer, either disable SmartCaching in SmartPools or check wether DIRECTIO can be used by MySQL - The access pattern for the MySQL file should be set to RANDOM - Different database/table modules available fro MySQL   can show different write patterns, and hence, different coalescing behavior in OneFS - Same will be true for the acclaimed drop-in MySQL replacement MariaDB   with its further options for tables and (application-side) caching. Let us know what you find, good luck! -- Peter

flyingkiwiguy · Answer

Thanks Peter for the fast and informative reply.

Would small (i.e. 4K) wsize NFS3 client options make the coalescer work harder? I'm assuming adding SSDs to the nodes will allow the NVRAM cache to be flushed faster and thus better buffer random NFS writes?

This is an NFS environment I have inherited, and I'm in the process of reverse engineering how it was (mis)constructed. I've determined there's some Citrix virtual disks mounted off the Isilon cluster that are generating as much if not more LIN locks as the MySQL DB is.

Regards,

Gary

Isilon

Was this post helpful?