Problems with NFS and flock (PHP)

Question

Hi, I know that flock on NFS isn't the best combination, but some of our customers are experiencing problems I've never seen before. They're site is written in PHP, and they use an adodb library that caches database results. The library uses flock() ( http://www.php.net/manual/en/function.flock.php) to lock these cachefiles. Under some circumstances, the Apache process that handles this request will hang indefinitely, waiting for the flock to return. I haven't been able to reproduce this behaviour, but when the cachingmechanism of the adbodb library is enabled, it is bound to happen. We've transferred the website in question recently to our NS20, and then the problems started. There are no problems when I transfer the website back to its old storage, a self-installed Linux NFS-server. Settings on the NS20 (mounts and exports): [nasadmin@ns20 nasadmin]$ server_mount server_2 server_2 : [..] home1-c1 on /home1-c1 uxfs,perm,rw [nasadmin@ns20 nasadmin]$ server_export server_2 -list /home1-c1 server_2 : export '/home1-c1' root=10.1.1.0/24 Settings on clients (Linux): root@web1.c1 ~ # mount [..] home1.c1:/home1-c1 on /home/users20 type nfs (rw,nosuid,bg,intr,hard,rsize=32768,wsize=32768,nfsvers=3,addr=10.1.1.201) Do these problems seem familiar to anyone? Are there specific mount or export options I should or could use?

Rainer_EMC · Answer

Hmmh,sounds very much like a Linux bug between that adodb library and Linux NFS.With Linux in the beginning NFS was implemented a bit 'loose' in terms of features/interoperability and IIRC file locking came later.If you want to test file locking itself you can use the lock tests from from the Connectathon test suite http://www.connectathon.org/nfstests.htmlWhat I would try:- ask in PHP and Linux forums- updated to the latest versions of Linux, NFS, PHP - see if there are any Linux or PHP options to change the lockingYou could also go troubleshooting and take a tcpdump - if you catch it in the act you can at least find out if that flock never gets sent on the wire or if there is no return.If you dont need shared access an alternative might be using ISCSI - there you put a native file system on the ISCSI LUN and get native locking behaviourWhat DART version are you using ?Its always worth a try looking at the release notes if something in that area was fixed in a newer code.RainerP.S.: your mount options look fine

Gertjan_OL · Answer

Thanks for your reply. Comments are inline.

sounds very much like a Linux bug between that adodb
library and Linux NFS.

With Linux in the beginning NFS was implemented a bit
"loose" in terms of features/interoperability and
IIRC file locking came later.

That's true, Linux' NFS implementation is not known to be the best around

. However, our problem lies with locking on the NS20 now, not on Linux. As far as I know, the NFS-client on Linux is not known to be buggy.

If you want to test file locking itself you can use
the lock tests from from the Connectathon test suite
http://www.connectathon.org/nfstests.html

I've tried those tests during the tesingperiod of our NS20, a few months ago. They didn't show any problems.

You could also go troubleshooting and take a tcpdump
- if you catch it in the act you can at least find
out if that flock never gets sent on the wire or if
there is no return.

I was afraid you were going to say that

. Ah well, first step is to create a situation to easily reproduce the error.

If you dont need shared access an alternative might
be using ISCSI - there you put a native file system
on the ISCSI LUN and get native locking behaviour

We need shared access, we use a bunch of webservers behind loadbalancers.

What DART version are you using ?

[nasadmin@ns20 nasadmin]$ server_version server_2
server_2 : Product: EMC Celerra File Server Version: T5.5.32.4

Its always worth a try looking at the release notes
if something in that area was fixed in a newer code.

I'll check them out.

P.S.: your mount options look fine

Thanks.

Rainer_EMC · Answer

That's true, Linux' NFS implementation is not known
to be the best around

. However, our problem lies
with locking on the NS20 now, not on Linux. As far as
I know, the NFS-client on Linux is not known to be
buggy.

well, NFS locking is tricky to get right. Not every detail is documented in the RFCs.
My experience is that most established vendors used the Sun reference implementation for interoperability in grey area's where Linux was more testing against Linux

I've tried those tests during the tesingperiod of our
NS20, a few months ago. They didn't show any
problems.

ok - knowing about Connectathon does make you an advanced user

I was afraid you were going to say that

. Ah well,
first step is to create a situation to easily
reproduce the error.

I know - and the tricky thing is how to trigger on the offending packets so that you dont have to capture and go through GB of data

[nasadmin@ns20 nasadmin]$ server_version server_2
server_2 : Product: EMC Celerra File Server Version:
T5.5.32.4

ok thats fairly recent.

Of course there is always the chance of a bug - however the Celerra lockd code is quite stable. The last bug fix I could find was for HP-UX clients in 5.5.30
I would say the chances of it being in the Linux NFS client code are greater

I would still open a service request with EMC support - maybe they can provide some debugging options

Celerra

Was this post helpful?