1 Rookie

 • 

106 Posts

March 3rd, 2015 11:00

The detail here that this only happens with 100GB+ file sizes is an interesting one.  I can certainly imagine that this would take a long time to complete the testing with that much transfer.  I assume there is a workflow that is regularly transferring files of this size that this has become an issue? 

I will highly recommend working with your account team to renew the support for your Isilon hardware - you really are missing a lot of great new things without the support. 

With files of that size, I wonder if there isn't a better way to transfer that would offer the reliability you seek.  I certainly don't know why an NFS connection would not be reliable, but at that size, there could be further options to take advantage of.  I'm thinking packaging the file as a tar ball, then transferring?  Using FTP instead of NFS?  I will admit this line of thinking is a bit of wild speculation - but my instinct is that with this file size, there may be something in this version of NFS transfers that reduces the stability.  No information to back that up, just a gut feel that I'd have to research further, but does support what you are saying. 

1 Rookie

 • 

106 Posts

February 23rd, 2015 07:00

From the output it looks like you are using NFSv3 with the nolock and async mount options.  This would be the first change I would test.  The async write method tells the client it does not need to confirm packet reception by the server and can just continue sending packets until the file is completely transferred.  Therefore if there is any dropped packets or other confusion during transfer, the resulting file would be corrupted or at least different from the original.

I would drop the async mount option entirely and then test a transfer.  It should take slightly longer but reliably transfer. 

450 Posts

February 23rd, 2015 08:00

The other thing that I would keep in mind is that today, you are running OneFS 6.5.x which is going End Of Primary Support in about 4 months.  So keep in mind an upgrade needs to be on your near-term horizon.  That may or may not change the behavior you're seeing here. but it is something to start planning.  The current target code release is OneFS 7.1.1.2.  This is not the latest GA version, but our current target.

~Chris Klosterman

Senior Solution Architect

EMC Isilon Offer & Enablement Team

email: chris.klosterman@emc.com

5 Posts

February 24th, 2015 05:00

Hi cadiletta,

First, Thank you for your answer.

I have in mind, of course, the two parameters you mention.

But the fact are : even on mac with no nfs.conf, or with an empty one, I have the same issue.

But, I do agree with you, i will push my test further but I am not confident regarding the first thing above.

I think to create a script on mac which copy and copy back a file like 5 times and compute md5 for comparison.

Then, I'll modify parameters (async, nolock,...) until obtaining a reliable copy.

I was just wondering if I would miss a configuration, specific to Isilon, which could explain this behavior...

Thank you for your help

Pierre

450 Posts

February 24th, 2015 07:00

Pierre,

It appears that there is a KB related to this behavior :


166056 : OneFS: Mac OS X clients create a non-matching copy when copying a file to an Isilon cluster            
https://support.emc.com/kb/166056

It would appear based on my review, that moving your MTU from 1500 to 9K should alleviate the issue(on the cluster, the switches, and the client).  From my review of a couple of related support cases, it appears to be a problem with what data the OSX NFS client is sending while under heavy load.  The packet captures seem to indicate that Isilon is taking what is sent over the wire, and writing it as requested by the client, however the client is sending some garbled information (again only intermittently under heavy load). It actually looks like it sends the same payload twice with a 2 different offsets.

For further investigation, I would suggest you open an EMC Support SR to confirm that this is the issue you are experiencing(give them the KB number above), and at the same time, an Apple support case.  What version of OSX are you running?  Perhaps this has been fixed with a newer OSX release.

Last: It's still in your best interests to work on upgrading your Isilon cluster to a newer release in the near term.  Depending on your maintenance agreement EMC's support organization may be able to do that for you at no charge, or you can do it yourself.  Even if this does not alleviate your issue, newer releases of OneFS are going to be easier to troubleshoot than 6.5.

Chris Klosterman

Senior Solution Architect

EMC Isilon Offer & Enablement Team

email: chris.klosterman@emc.com

twitter: @croaking

5 Posts

March 3rd, 2015 07:00

Hi Chris, and others !


Just a small post to tell you this thread is not neglected.

I am continuing testing and try to find a correct configuration.

Things are arduous  because the described behavior is not systematic and It only happen with big files (100G min), so test are very long

Well, As I said, I created a script which compute md5, copy 5 times, compute md5 of copy, copy back the 5 files and compute md5.

Things are :

  • With Isilon parameters on nfs.conf and sysctl.conf, copy are longer than with empty nfs and sysctl.
  • Whatever the configuration, I still have md5 differences randomly.
  • md5 differences always come when I copy from Macs to Isilon.
  • On the old hardware mac pro, on of isilon parameters make the mac to reboot while coping (from Macs to Isilon as well)
  • I have to investigate more to find which parameter makes mac to reboot.


I haven't test the MTU tips yet because it could have effect on others computer in production.


Thank you Chris for your suggestion, fortunately, I can't read KB anymore because our Isilon is no more under support.

For the same reason, I can't upgrade it, (even if I would like...)

We asked to Isilon France which arrangement we could find to do it without being under support, no answer.

Well, Here I am, as a challenge to solve this problem in that terms


Thanks everyone !


Pierre

9 Legend

 • 

20.4K Posts

March 3rd, 2015 11:00

I remember 3 years ago we had an issue with large transfers, ended up being an issue with Nexus 5k NX-OS bug.  It would be interesting to test from a Redhat/CentOS box ..or maybe you can plugin your OSX box directly to the cluster by-passing any switching in-between.

No Events found!

Top