SMB change notify and performance

Question

Hi

I was looking into the protocol statistics on the Isilon clusters and noticed that the op class change_notify is at top list of the operations taking the longest time to complete.

Would it be considered a performance boost to set smb change notify to none? I know that this would require XP clients to manually refresh their file browsers in order to see updated content, but we have like 95% win 7 clients either way.

MRWA · Answer

My understanding was that change notify was set off by default, is there something else in your workflow that requires this?

BernieC · Answer

You should not disable change notification as this will break Windows Vista (and beyond) clients' ability to refresh directories. See KB 91441 on support.emc.com. Also, this post from ECN has a little more detail, from the recent Ask the Expert series on SMB.

What you could do is set change notify to norecurse, which will cause it not to have to notify clients of changes down multiple levels of the tree.

MRWA · Answer

Here is a link to the pertinent portion of the Ask The Expert thread that Bernie mentioned: https://community.emc.com/message/748427#748427

arott · Answer

Thanks for the input. As stated we have almost no windows XP clients left in the company.

I have recently seen a rather odd performance issue on the isilon clusters. After we upgraded from 6.5.5 to 7.0.2 we have seen a extreme increase in file_state operations and most of the Ops and TimeAvg comes from change_notify:

take a look at the output:

Ops In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s us us

2.6 255.3 122.1 256993232.0 814915584.0 5 smb2 file_state change_notify

112.5 10K 14K 141.9 226.3 5 smb2 file_state close

153.0 15K 12K 9647.5 190915.2 6 smb2 file_state change_notify

526.4 48K 67K 140.6 820.0 6 smb2 file_state close

0.4 46.2 28.7 2159.5 2940.9 6 smb2 file_state lock

1.6 156.2 71.1 85465352.0 210455536.0 7 smb2 file_state change_notify

57.0 5.2K 7.3K 178.8 663.5 7 smb2 file_state close

0.4 46.2 28.7 2294.0 3121.2 7 smb2 file_state lock

0.6 58.2 30.3 3141439.8 3203243.0 8 smb2 file_state change_notify

30.5 2.8K 3.8K 194.0 376.8 8 smb2 file_state close

3.2 311.8 112.8 29923476.0 69968784.0 9 smb2 file_state change_notify

108.2 9.9K 14K 128.6 48.5 9 smb2 file_state close

2.1 202.5 72.5 25473304.0 72545008.0 10 smb2 file_state change_notify

186.1 17K 24K 177.5 513.7 10 smb2 file_state close

1.5 174.6 108.4 618.5 1471.0 10 smb2 file_state lock

I am aware that change_notify and the TimeAvg is expected to be rather high, but i find it odd that we went from a small amount of microsecons before the upgrade and into several seconds afterwards.

InsightIQ confirms this trend as well.

Peter_Sero · Answer

Isn't the situation on node 6 pretty much ok?

Are/were the clients connected to node 6 somehow different

(by number or by OS or by application or ...) from clients connected

to other nodes?

Can you check the sysctl setting for tcp and smb2

on node 6 vs other nodes? Maybe during the 7.0 upgrade

things have gotten out of sync.

On the "slow" nodes, you can check the tcp connections

to the clients with netstat - do you see non-zero values

in the Send-Q(queue) column for those clients which

are targeted by change_notify? This would be mean that

the Isilon side is sending data, but the client is unable

to process it in time (or a latent network problem).

-- Peter

arott · Answer

There was a couple of Applications accessing those nodes except node 6. Lots of IO operations from those which explains the different values. Also we have two clusters, and the second cluster also shows the same pattern in regards to file_state.

Sysctl values are fine though. We verified those right after the upgrade.

Only latent connections found with netstat is those against the ICAP servers. Cant see any delay for the Client Workstations or servers though.

Change_notify has been set to norecurse since last year actually. Still not sure if i would gain anything by disabling it alltogether though.

Peter_Sero · Answer

You simply might have hit some (unknown? new?) oddity in OneFS's SMB services

at high load, have you opened a support case?

If you are willing to further investigate and share, here are a couple of more thoughts:

250s average service time per operation (node 5) is obviously ridiculous,

and finished 2.6 ops per second means that ~650 operations are "being acted on inside

the node" concurrently.The node will be super-busy or

is waiting for something (or both), but that should be identifiable (suggestions follow).

And how does this number, 650, correlate with the number of clients?

(Even on a highly loaded cluster, that product, Ops/s x TimeAvg(converted to seconds),

for all operations together, should be around 10, or at least <50).

Was the output of 'isi statistics client' an excerpt, or shows it all traffic?

The current picture is that the physical network bandwidth is NOT being used up yet.

What is actually going on:

- at CPU level ('top') - lwiod or others super busy or using excessive amounts of memory, or even crashing?

- at filesystem level ('isi statistics heat') - hot spots, locking?

- at disk level ('isi statistics drive') - drives busy/late?

Are the SMB clients watching widely 'shared' folders, or each

one watching only private ('home') folders?

Checked the impact of

isi smb settings global modify --enable-security-signatures

on/off ?

Increased or reduced the number of SMB server worker threads?

(You are not using any 'protocol enhancing', WAN-optimizing network appliance

between the clients and the Isilon, are you?)

-- Peter

arott · Answer

Some more statistics. Disk statistics shows around 50-60 iops for the SATA drives and 50 iops for the SAS drives

Number of connections (connected/active)

5 1520 164

6 1983 186

7 1448 226

8 1496 249

9 1818 256

10 1595 217

Here is a excerpt from node 5 (it issue is the same on all nodes):

Ops In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s us us

243.7 63K 33K 643.7 687.6 5 smb2 create create

1.5 147.2 103.2 153958064.0 366812000.0 5 smb2 file_state change_notify

135.0 12K 16K 189.8 1052.8 5 smb2 file_state close

6.6 760.8 472.2 124.7 38.5 5 smb2 file_state lock

0.2 22.0 22.0 105.0 0.0 5 smb2 file_state oplock_break

130.5 14K 109K 7607.1 68325.1 5 smb2 namespace_read query_directory

119.3 13K 25K 486.4 833.6 5 smb2 namespace_read query_info

2.1 440.0 148.1 2644.7 2996.2 5 smb2 namespace_write set_info

0.2 15.2 0.0 27.0 0.0 5 smb2 other cancel

3.2 371.3 22K 3494.6 6810.8 5 smb2 read read

0.4 30.5 30.5 210.0 63.6 5 smb2 session_state logoff

0.4 46.1 99.4 71.5 24.7 5 smb2 session_state negotiate

0.4 1.2K 101.3 271018.0 6950.9 5 smb2 session_state session_setup

1.5 195.1 124.4 313.9 400.1 5 smb2 session_state tree_connect

3.0 213.3 213.3 87.6 34.5 5 smb2 session_state tree_disconnect

58.6 689K 4.9K 476.1 819.0 5 smb2 write write

"top" on node 5:

last pid: 3469; load averages: 0.68, 1.21, 1.47 up 13+20:29:32 13:39:29

617 processes: 2 running, 615 sleeping

CPU: 1.6% user, 0.0% nice, 14.2% system, 3.5% interrupt, 80.7% idle

Mem: 1064M Active, 14G Inact, 7099M Wired, 575M Cache, 511M Buf, 614M Free

Swap:

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND

43565 root 57 96 0 3461M 367M ucond 0 0:00 3.91% lwio

65070 root 4 8 0 113M 15900K cpendi 0 0:07 0.98% isi_job_d

4245 root 1 4 0 55692K 7508K kqread 5 106:07 0.00% isi_mcp

Most of the file shares are shared folders with lots of users but we have some home areas as well.

Smb signature is not enabled.

Have not modified the SMB worker threads either. Not sure how to do this?

There is a set of riverbed WAN accelerators between the clients and the isilon cluster.

Peter_Sero · Answer

Very interesting. A couple of thoughts:

(1) It doesn't seem that the cluster is overloaded (CPU, disk ops).

(2) Oplocks apparently are also not a problem per se (Ops rate, TimeAvg).

You might still test and switch oplocks off; they might interfere with

change_notify and, beware, with (4) below.

(3) For SMB worker threads, check out the SMB section on the 7.0 Command Reference (not the Admin Guide):

isi smb settings global modify --srv-num-workers {integer}

(4) The WAN accelerator is the most suspicious element to me here. After having

learned about a story where a (not Riverbed, but another big name) accelerator

chose to mix up "\" and "/" in CIFS file paths, and that breaking the CIFS workflows

right after an OS upgrade on the NAS server side(!), I have no illusions on the

"transparency" of the magic stuff those boxes perform, nor on the compatibility claims

stated by manufacturers.

You might try to:

- switch off individual "optimization features" or "acceleration levels" offered on the Riverbed.

- (with a good network team) analyze the CIFS packets entering and leaving the Riverbed.

- consult Riverbed support/engineering.

Best of luck, and keep us posted

-- Peter

Isilon

SMB change notify and performance

Was this post helpful?