Unsolved
This post is more than 5 years old
13 Posts
0
2374
July 19th, 2016 08:00
High res timestamp on Isilon
Hello Isilon gurus - I'm having a bit of an odd problem.
One of my in house pieces of software makes use of the high res timestamp on NFS mtime.
And we've found that our (new) Isilon, seems to change timestamp on us occasionally - making newly written files slightly older. And this seems to happen more often if we 'stat' the file from another location.
Our proof of concept code doesn't do anything more complicated than an 'open-write-fstat-fsync' then an 'open-read-stat'.
But the time read from the stat calls is different by a few milliseconds. (and in the past on the latter 'read' operation).
I have a proof of concept code snippet, and can reliably reproduce this by running the above in a loop, and calling 'stat' from another nfs location.
Has anyone run into this? We don't really care much _what_ the time stamp says, but that it changes between file write and reopen.
I'm trying to figure out what might be causing this, and if there's any sort of workaround - I'm thinking it might be to do with the different nodes and their timestamps.


kipcranford
125 Posts
1
July 19th, 2016 13:00
Can you post the code snippet? And the OneFS release you're on?
sobrique
13 Posts
0
July 20th, 2016 01:00
Hello, yes. I was going to, but had trouble with pasting into the forums for some reason.
So it's hosted here:
http://pastebin.com/qBrUkxAZ
It's OneFS 8.0.1
It's triggered by running the code on an nfs mount, and then (somewhere else that has it mounted - e.g. the isilon console will do)
sobrique
13 Posts
0
July 20th, 2016 02:00
I've also tried a packet capture of the above - there's a response to 'GETATTR' from the NFS server, which gives two different timestamps on the file - so it looks like it's _not_ client workstation specific.
sjones51
252 Posts
1
July 20th, 2016 10:00
Hi sobrique,
This looks similar to an issue seen prior to 7.2, but seems to be of a different variety. Either way, setting the time delta on the export may provide some relief.
Here is the knowledge base article about the older issue:
https://support.emc.com/kb/457853
Page 442 of the OneFS 8.0.0 CLI Administration Guide provides the commands for changing and reverting the time delta on an export:
https://support.emc.com/docu65065_OneFS-8.0.0-CLI-Administration-Guide.pdf?language=en_US
sobrique
13 Posts
0
July 21st, 2016 00:00
Yes, we've tried setting --time-delta to a lower number. This doesn't solve the issue, but as might be expected - does make it less frequent, as it only occurs when you're on an appropriate boundary.
isi exports modify --zone --time-delta 1
This sets the resolution to 1s, and means it's considerably less frequent.
time mismatch: 1469087524.000000000 != 1469087523.000000000
It serves as a workaround in the short term though.
kipcranford
125 Posts
1
July 21st, 2016 10:00
Hi sobrique,
How long does this take to reproduce in your environment? I've been running your steps for many hours without hitting the problem, and just want an expectation of how long I might need to wait.
sobrique
13 Posts
0
July 21st, 2016 14:00
Without running the "stat" loop on another host, it can take a while.
With it, it's seconds (if that).
Which implies you aren't getting it. Which OneFS version are you on? (we are 8.0.1)
sobrique
13 Posts
0
July 21st, 2016 14:00
It's also Linux, we have tried both RHEL 6.5 and 7
sobrique
13 Posts
0
July 21st, 2016 15:00
We have 6 nodes, and they are x210s.
It occurs really quickly when I run on the isilon cli, so it's interesting that you are getting something different. When running tcpdump it's less than 800 packets before a mismatch occurs.
And yes, OneFS 8.0.0.1 - we needs the "swap drives without power cycling node" fix.
But this problem occurs repeatably on both primary and replica cluster.
I wonder what could be different here.
kipcranford
125 Posts
1
July 21st, 2016 15:00
> Without running the "stat" loop on another host, it can take a while.
> With it, it's seconds (if that).
>
> Which implies you aren't getting it. Which OneFS version are you on? (we are 8.0.1)
Here's my setup:
I haven't seen a time mismatch yet, but i'll let it keep running.
To clarify the release... you're saying this is 8.0.1, not 8.0.0.1? 8.0.1 hasn't been released yet and is only in Beta. 8.0.0.1 is currently the most recent GA code that we have.
kipcranford
125 Posts
1
July 22nd, 2016 08:00
> I wonder what could be different here.
Ok, I did get this to reproduce:
time mismatch: 1469198843.243407821 != 1469198843.242582628
delta: 0.000825193
time mismatch: 1469198843.253932236 != 1469198843.253127882
delta: 0.000804354
time mismatch: 1469198843.307776035 != 1469198843.306946869
delta: 0.000829166
time mismatch: 1469198843.313143443 != 1469198843.312333116
delta: 0.000810327
It seems that this will never reproduce when everything is accessing the same node. When I changed my setup to having each client mount a different node, I saw a few hits. When I changed the setup to having the writer access node 1 and the stat loop run locally on node 3, I got tons of mismatches.
.8 ms seems to be about the average delta I see. I need to dig into what's really going on a bit more.
kipcranford
125 Posts
0
July 25th, 2016 11:00
> .8 ms seems to be about the average delta I see. I need to dig into what's really going on a bit more.
I don't fully understand what's going on here, so I've filed a bug with Engineering.
If this is not already an SR, please open one with Isilon Support. This will help drive the problem resolution from the other direction. If I find out anything from my internal investigation, I'll update what I can here.
Also, can you describe to me the effects this issue has in your environment? You mention in-house applications that notice this time discrepancy -- does this break those workloads? Or is this phenomenon simply something you've observed but that is fairly innocuous?
sobrique
13 Posts
0
July 26th, 2016 00:00
There is an SR open for this - along with an attached 'isi_gather_info' and a tcpdump picking out the differing GETATTR calls.
It's not disastrous for us, because this has come up in testing - we had noticed one of our database mechanisms reporting 'corrupt' data files, because of a time mismatch between index and data. (e.g. data file newer than the index file, which shouldn't happen in this application in normal use).
However, as it stands - we're stalled on migrating because of this - we can _probably_ try and do a rewrite on our end to work around it. Lowering the resolution might help, but we're not too keen on that, as it doesn't actually make the problem go away, it just means it happens less often.
It sort of depends if this is a 'proper bug' for which a fix might be expected, and what sort of time frame that would take.