Unsolved
This post is more than 5 years old
13 Posts
0
7742
September 5th, 2013 23:00
SMB change notify and performance
Hi
I was looking into the protocol statistics on the Isilon clusters and noticed that the op class change_notify is at top list of the operations taking the longest time to complete.
Would it be considered a performance boost to set smb change notify to none? I know that this would require XP clients to manually refresh their file browsers in order to see updated content, but we have like 95% win 7 clients either way.


MRWA
83 Posts
0
September 6th, 2013 09:00
My understanding was that change notify was set off by default, is there something else in your workflow that requires this?
BernieC
76 Posts
1
September 6th, 2013 09:00
You should not disable change notification as this will break Windows Vista (and beyond) clients' ability to refresh directories. See KB 91441 on support.emc.com. Also, this post from ECN has a little more detail, from the recent Ask the Expert series on SMB.
What you could do is set change notify to norecurse, which will cause it not to have to notify clients of changes down multiple levels of the tree.
MRWA
83 Posts
0
September 9th, 2013 08:00
Here is a link to the pertinent portion of the Ask The Expert thread that Bernie mentioned:
https://community.emc.com/message/748427#748427
arott
13 Posts
0
September 13th, 2013 04:00
Thanks for the input. As stated we have almost no windows XP clients left in the company.
I have recently seen a rather odd performance issue on the isilon clusters. After we upgraded from 6.5.5 to 7.0.2 we have seen a extreme increase in file_state operations and most of the Ops and TimeAvg comes from change_notify:
take a look at the output:
Ops In Out TimeAvg TimeStdDev Node Proto Class Op
N/s B/s B/s us us
2.6 255.3 122.1 256993232.0 814915584.0 5 smb2 file_state change_notify
112.5 10K 14K 141.9 226.3 5 smb2 file_state close
153.0 15K 12K 9647.5 190915.2 6 smb2 file_state change_notify
526.4 48K 67K 140.6 820.0 6 smb2 file_state close
0.4 46.2 28.7 2159.5 2940.9 6 smb2 file_state lock
1.6 156.2 71.1 85465352.0 210455536.0 7 smb2 file_state change_notify
57.0 5.2K 7.3K 178.8 663.5 7 smb2 file_state close
0.4 46.2 28.7 2294.0 3121.2 7 smb2 file_state lock
0.6 58.2 30.3 3141439.8 3203243.0 8 smb2 file_state change_notify
30.5 2.8K 3.8K 194.0 376.8 8 smb2 file_state close
3.2 311.8 112.8 29923476.0 69968784.0 9 smb2 file_state change_notify
108.2 9.9K 14K 128.6 48.5 9 smb2 file_state close
2.1 202.5 72.5 25473304.0 72545008.0 10 smb2 file_state change_notify
186.1 17K 24K 177.5 513.7 10 smb2 file_state close
1.5 174.6 108.4 618.5 1471.0 10 smb2 file_state lock
I am aware that change_notify and the TimeAvg is expected to be rather high, but i find it odd that we went from a small amount of microsecons before the upgrade and into several seconds afterwards.
InsightIQ confirms this trend as well.
Peter_Sero
4 Operator
•
1.2K Posts
0
September 13th, 2013 07:00
Isn't the situation on node 6 pretty much ok?
Are/were the clients connected to node 6 somehow different
(by number or by OS or by application or ...) from clients connected
to other nodes?
Can you check the sysctl setting for tcp and smb2
on node 6 vs other nodes? Maybe during the 7.0 upgrade
things have gotten out of sync.
On the "slow" nodes, you can check the tcp connections
to the clients with netstat - do you see non-zero values
in the Send-Q(queue) column for those clients which
are targeted by change_notify? This would be mean that
the Isilon side is sending data, but the client is unable
to process it in time (or a latent network problem).
-- Peter
arott
13 Posts
0
September 13th, 2013 12:00
There was a couple of Applications accessing those nodes except node 6. Lots of IO operations from those which explains the different values. Also we have two clusters, and the second cluster also shows the same pattern in regards to file_state.
Sysctl values are fine though. We verified those right after the upgrade.
Only latent connections found with netstat is those against the ICAP servers. Cant see any delay for the Client Workstations or servers though.
Change_notify has been set to norecurse since last year actually. Still not sure if i would gain anything by disabling it alltogether though.
Peter_Sero
4 Operator
•
1.2K Posts
0
September 15th, 2013 05:00
You simply might have hit some (unknown? new?) oddity in OneFS's SMB services
at high load, have you opened a support case?
If you are willing to further investigate and share, here are a couple of more thoughts:
250s average service time per operation (node 5) is obviously ridiculous,
and finished 2.6 ops per second means that ~650 operations are "being acted on inside
the node" concurrently.The node will be super-busy or
is waiting for something (or both), but that should be identifiable (suggestions follow).
And how does this number, 650, correlate with the number of clients?
(Even on a highly loaded cluster, that product, Ops/s x TimeAvg(converted to seconds),
for all operations together, should be around 10, or at least <50).
Was the output of 'isi statistics client' an excerpt, or shows it all traffic?
The current picture is that the physical network bandwidth is NOT being used up yet.
What is actually going on:
- at CPU level ('top') - lwiod or others super busy or using excessive amounts of memory, or even crashing?
- at filesystem level ('isi statistics heat') - hot spots, locking?
- at disk level ('isi statistics drive') - drives busy/late?
Are the SMB clients watching widely 'shared' folders, or each
one watching only private ('home') folders?
Checked the impact of
isi smb settings global modify --enable-security-signatures
on/off ?
Increased or reduced the number of SMB server worker threads?
(You are not using any 'protocol enhancing', WAN-optimizing network appliance
between the clients and the Isilon, are you?)
-- Peter
arott
13 Posts
0
October 4th, 2013 04:00
Some more statistics. Disk statistics shows around 50-60 iops for the SATA drives and 50 iops for the SAS drives
Number of connections (connected/active)
5 1520 164
6 1983 186
7 1448 226
8 1496 249
9 1818 256
10 1595 217
Here is a excerpt from node 5 (it issue is the same on all nodes):
Ops In Out TimeAvg TimeStdDev Node Proto Class Op
N/s B/s B/s us us
243.7 63K 33K 643.7 687.6 5 smb2 create create
1.5 147.2 103.2 153958064.0 366812000.0 5 smb2 file_state change_notify
135.0 12K 16K 189.8 1052.8 5 smb2 file_state close
6.6 760.8 472.2 124.7 38.5 5 smb2 file_state lock
0.2 22.0 22.0 105.0 0.0 5 smb2 file_state oplock_break
130.5 14K 109K 7607.1 68325.1 5 smb2 namespace_read query_directory
119.3 13K 25K 486.4 833.6 5 smb2 namespace_read query_info
2.1 440.0 148.1 2644.7 2996.2 5 smb2 namespace_write set_info
0.2 15.2 0.0 27.0 0.0 5 smb2 other cancel
3.2 371.3 22K 3494.6 6810.8 5 smb2 read read
0.4 30.5 30.5 210.0 63.6 5 smb2 session_state logoff
0.4 46.1 99.4 71.5 24.7 5 smb2 session_state negotiate
0.4 1.2K 101.3 271018.0 6950.9 5 smb2 session_state session_setup
1.5 195.1 124.4 313.9 400.1 5 smb2 session_state tree_connect
3.0 213.3 213.3 87.6 34.5 5 smb2 session_state tree_disconnect
58.6 689K 4.9K 476.1 819.0 5 smb2 write write
"top" on node 5:
last pid: 3469; load averages: 0.68, 1.21, 1.47 up 13+20:29:32 13:39:29
617 processes: 2 running, 615 sleeping
CPU: 1.6% user, 0.0% nice, 14.2% system, 3.5% interrupt, 80.7% idle
Mem: 1064M Active, 14G Inact, 7099M Wired, 575M Cache, 511M Buf, 614M Free
Swap:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
43565 root 57 96 0 3461M 367M ucond 0 0:00 3.91% lwio
65070 root 4 8 0 113M 15900K cpendi 0 0:07 0.98% isi_job_d
4245 root 1 4 0 55692K 7508K kqread 5 106:07 0.00% isi_mcp
Most of the file shares are shared folders with lots of users but we have some home areas as well.
Smb signature is not enabled.
Have not modified the SMB worker threads either. Not sure how to do this?
There is a set of riverbed WAN accelerators between the clients and the isilon cluster.
Peter_Sero
4 Operator
•
1.2K Posts
0
October 4th, 2013 06:00
Very interesting. A couple of thoughts:
(1) It doesn't seem that the cluster is overloaded (CPU, disk ops).
(2) Oplocks apparently are also not a problem per se (Ops rate, TimeAvg).
You might still test and switch oplocks off; they might interfere with
change_notify and, beware, with (4) below.
(3) For SMB worker threads, check out the SMB section on the 7.0 Command Reference (not the Admin Guide):
isi smb settings global modify --srv-num-workers {integer}(4) The WAN accelerator is the most suspicious element to me here. After having
learned about a story where a (not Riverbed, but another big name) accelerator
chose to mix up "\" and "/" in CIFS file paths, and that breaking the CIFS workflows
right after an OS upgrade on the NAS server side(!), I have no illusions on the
"transparency" of the magic stuff those boxes perform, nor on the compatibility claims
stated by manufacturers.
You might try to:
- switch off individual "optimization features" or "acceleration levels" offered on the Riverbed.
- (with a good network team) analyze the CIFS packets entering and leaving the Riverbed.
- consult Riverbed support/engineering.
Best of luck, and keep us posted
-- Peter