Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

4813

February 22nd, 2016 05:00

tryLock() on Isilon over NFS hangs

Hello,

We (Zhendong.Li@emc.com) are running some performance test for InfoArchive which utilizes xDB.

If we install xDB on a NFS mounted isilon it hangs on access to the xDB bootstrap file. The situation is like this.

  1. BootStrapFile on lockdisk => fine.
  2. BootStrapFile on Isilon(access through NFS)      => readFile success, but lockBootStrapFile hangs.
  3. BootStrapFile on RedHat NFS server (access through NFS)      => fine.

Looking at the mount options for Isilon and RedHat we see this:

Isilon:

[dmadmin@apollo ~]$ nfsstat -m

/mnt/data from 10.108.1.11:/ifs/data/apollo/table

Flags: rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.108.1.11,mountvers=3,mountport=300,mountproto=udp,local_lock=none,addr=10.108.1.11

RedHat NFS Server:

zhendong@cfadmin:~$ nfsstat -m

/datatest from 10.32.122.222:/data/

Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.32.122.222,mountvers=3,mountport=900,mountproto=udp,local_lock=none,addr=10.32.122.222

We have narrowed this down to the tryLock() call in xDB. One of the xDB engineers looked at this further (philip.arickx@emc.com) and he reported below:

I ran the small TryLock program, provided to us by the performance team, on our Isilon VM. It simply tries to take a lock on a file using tryLock(). The code is appended below.

Indeed, as they report, when the lock is taken on a file on the local filesystem, it succeeds. If the target file is on the Isilon mount, the tryLock() call hangs.

The javadoc for FileChannel says :

“This method does not block. An invocation always returns immediately, either having acquired a lock on the requested region or having failed to do so. If it fails to acquire a lock because an overlapping lock is held by another program then it returns null. If it fails to acquire a lock for any other reason then an appropriate exception is thrown.”

The closest I could find on google for now is : https://bugs.openjdk.java.net/browse/JDK-8065927 which concludes with “I wonder if the hang is just a timeout trying to connect to the NFS locking daemon? Is this running on the serve that is exporting the home directory? I doubt very much that we have a JDK issue here, this seems to something that at the lower level or in the configuration that is causing a syscall to hang.

Also, http://arstechnica.com/civis/viewtopic.php?t=1177811 is also about NFS locks hanging on Isilon but isn’t very helpful in general.

I do see timeouts in /var/log/messages : “Feb 19 07:49:45 isilon kernel: [8839693.795649] lockd: server 10.108.1.11 not responding, still trying”

In xDB we have 18 usage instances of tryLock in 10 classes.

I tried to mount with nfs v4 (instead of v3) but that isn’t supported on the Isilon (at least in the current configuration, over which I have no control).

I also tried some variations : shared lock versus exclusive lock, size = 0L instead of MAX_VALUE, and rws RandomAccessFile (trying to take an exclusive lock on a non-writable channel fails with java.nio.channels.NonWritableChannelException)

Bottom line for now :

  • This is probably an NFS-on-Isilon issue somehow, but we’ll need to dig into Isilon and NFS client configuration a bit to figure out whether it can be resolved at all.

Does this ring a bell or should I open a ticket for this (is so where)?

Is it possible that a firewall blocks a port that Isilon needs (if so which ports does Isilon need)?

Any help is appreciated, thanks.

Michiel

////////////////////////////////////////

import java.io.BufferedReader;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.nio.channels.Channel;

import java.nio.channels.FileChannel;

import java.nio.channels.FileLock;

public class TryLock {

 

  public void readFile(String file) {

    System.out.println("File content:");

    FileReader fReader = null;

    BufferedReader bReader = null;

    try {

      fReader = new FileReader(file);

      bReader = new BufferedReader(fReader);

      String line = bReader.readLine();

      while(line != null) {

System.out.println(line);

        line = bReader.readLine();

      }

    } catch (FileNotFoundException e) {

      e.printStackTrace();

    } catch (IOException e) {

      e.printStackTrace();

    } finally {

      try {

        bReader.close();

       fReader.close();

      } catch (IOException e) {

e.printStackTrace();

      }

    }

  }

  public void lockBootStrapFile(String file) {

    File bsFile = new File(file);

    try {

//      RandomAccessFile bootstrapFile = new RandomAccessFile(bsFile, "r");

      RandomAccessFile bootstrapFile = new RandomAccessFile(bsFile, "rws");

      for (int i = 0; i < 3; i++) {

System.out.println("Trying to acquire the lock");

       

        FileChannel ch = bootstrapFile.getChannel();

//        FileLock bsLock = ch.tryLock(0L, Long.MAX_VALUE, true);

//        FileLock bsLock = ch.tryLock(0L, Long.MAX_VALUE, false);

//        FileLock bsLock = ch.tryLock(0L, 0L, true);

        FileLock bsLock = ch.tryLock(0L, 0L, false);

        if (bsLock == null) {

System.out.println("Sleep 5 seconds to acquire the lock again");

Thread.sleep(5000);

        } else {

System.out.println("Acquired a shared lock on file " + bsFile.getAbsolutePath());

System.out.println("Lock will be released 3 seconds later");

Thread.sleep(3000);

bsLock.release();

break;

        }

      }

    } catch (IOException e) {

      // TODO Auto-generated catch block

      e.printStackTrace();

    } catch (InterruptedException e) {

      // TODO Auto-generated catch block

      e.printStackTrace();

    }

  }

  public static void main(String[] args) {

    System.out.println("Input file path: " + args[0]);

    TryLock tryLock = new TryLock();

    // tryLock.readFile(args[0]);

    tryLock.lockBootStrapFile(args[0]);

  }

}

11 Posts

February 23rd, 2016 01:00

The Isilon is running 7.2.0.1.

In the release notes I found a fix in 7.2.0.4 :

"If an NFS client attempted to send an NLM asynchronous request to lock a file and received an error in response to the request, a socket was opened but was not closed. Over time, it was possible for the maximum number of open sockets to be reached. If this occurred, processes could not open new sockets on the affected node. As a result, affected nodes might have been slow to respond to file lock requests, or lock requests sent to an affected node might have timed out. If lock requests timed out, NFS clients could have been prevented from accessing files or applications on the cluster."

I wonder if that might be the issue. I'll try and get the Isilon upgraded to 7.2.0.5.

Also : I just enabled NFSv4 on the Isilon. The locking works just fine when mounting over NFSv4.

Oh and the log message is from the VM which acts as NFS client ; I (confusingly I guess) set the hostname to isilon, which then causes the message to be logged as "isilon kernel"...

Philip

1.2K Posts

February 22nd, 2016 06:00

OneFS ports are listed here Ports used by OneFS


The log message looks kind of suspicious to me -- is that a prehistoric OneFS?


The test code runs fine with OneFS 7.2.0.4

-- Peter

1.2K Posts

February 23rd, 2016 01:00

> to be logged as "isilon kernel"...



You can also try the Isilon simulator aka virtual nodes with 7.2.0.5 first,

ideally with and without the firewall.


-- Peter

11 Posts

April 18th, 2016 01:00

In the meantime, Isilon OneFS has been upgraded to version 8.0.0. I ran the tests again, with the same results.

I.e. with NFSv4 I see no locking issues, but the test still hangs if NFSv3 is used.

Xeers

Philip

No Events found!

Top