We have an application that utilizes Isilon for its temporary files and final output files. It runs many hours to generate the solution. It was working well with OneFS 22.214.171.124, but after we upgraded to 126.96.36.199 the application started failing at the final step where it cleans up the temporary files and writes the final output file. The error message indicates that it could not remove a temporary file because it was in use by another process.
My question is, I see they added a new SMB parameter to OneFS 188.8.131.52 - ChangeNotifyInterval that defaults to 100ms. Could this be causing the issues we are seeing?
What was the behavior of OneFS for Change Notifications prior to 184.108.40.206?
Did it just send out every change immediately instead of saving them up and reporting them in 100ms intervals?
Would setting the value lower resolve the issue?
How low does the value need to be set?
I've got an open SR with EMC regarding this issue, but it seems Isilon support is overwhelmed for some reason right now. I've gotten very little response from them on this SR and the changes between the two OneFS versions. EMC Support wants me to setup a WireShark Client capture and the TCP Dump capture on the Isilon. But with at 20+ hour run time, the logs get quite large and monitoring for when the failure occurs to turn off the capture is problematic.
change notifications in SMB are a feature where the server sends all connected clients that connected a tree an information "hey there are new elements in this tree", to make changes a other client made available for the connected clients. the client then re-reads the content of the changed tree.
there are problems with this, if you have a *lot* of clients connected to the same tree and frequent changes trigger changenotifications and all connected clients re-read the tree. this results in a lot of metadata-queries which can result in massive network-transfers.
This is why ChangeNotification can be configured with "norecurse" which only sends notifications for the tree the user is connected to and not for subfolders.
this process *should* not require any locks on files from isilon. This process CAN be triggered if lockstates change (Client saves and closes file, thus releasing the lock).
you can test this by disabling change notifications. This results in applications / users having to manually refresh the structure they are in (F5 in Windows Explorer)
I have seen issues with "open files" and "locks" more with things like "statusbar" of Windows Explorer or "(Image-)previews" which sometimes make exclusive locks (for reasons unknown) and prevent file operation.
for temporary files: most of the time temporary files are in use.
Things you can do:
-Setup Network trace and take a look (some knowledge about SMB, TCP required)
- check open Connections / open files with "isi smb" commands (Connection Information are nodelocal!!)
- disable ChangeNotifications (don't think this will help)
I am also experiencing problems since upgrading from 220.127.116.11 to 18.104.22.168
We have some software which uses watch folders residing on the Isilon storage. The software will monitor a file and once the file stops growing, analysis will begin. Since upgrading to 22.214.171.124 we receive a report that the file could not be opened by the software. If you leave the file on Isilon for a period of time, the file can then be analysed without a problem.
I would be interested in the response from EMC regarding the open SR. Let's hope it's resolved soon, especially as this version is target code.
I resolved the issue by lowering the ChangeNotificationInterval from the default of 100ms to 50ms.
I did this change on my own - EMC did not recommend the change, but, it did resolve the issue we were seeing.
Perhaps you could try this change as well to see if the application behaves better.
Hi Michael, thanks for the reply. Looking further into my problems here I'm not convinced that the change in OneFS version actually caused my issue. It could still be related, however my cause was the amount of headless machines we have running as transcode engines, which all connect to the same share via the same local isilon user account. We were up to about 13 machines all using the same account, reducing the count to 9 has brought back satisfactory operation.
I am yet to find any documentation as to how many users can simultaneously connect using the same credentials, although thinking about how permissions are assigned by the cluster, multiple access tokens for the same user may well be problematic.
Hi Michael, just wanted to update everyone on this issue. It looks as though changing the login credentials for this user to a unique local account has not cured the issue after all. I will look at your suggestion of changing the interval from 100ms to 50ms but I will also raise a support case with EMC regarding the issue.
Hopefully EMC will resolve this change in behaviour soon.