Unsolved

This post is more than 5 years old

13 Posts

7742

September 5th, 2013 23:00

SMB change notify and performance

Hi

I was looking into the protocol statistics on the Isilon clusters and noticed that the op class change_notify is at top list of  the operations taking the longest time to complete.

Would it be considered a performance boost to set smb change notify to none? I know that this would require XP clients to manually refresh their file browsers in order to see updated content, but we have like 95% win 7 clients either way.

83 Posts

September 6th, 2013 09:00

My understanding was that change notify was set off by default, is there something else in your workflow that requires this?

76 Posts

September 6th, 2013 09:00

You should not disable change notification as this will break Windows Vista (and beyond) clients' ability to refresh directories.  See KB 91441 on support.emc.com.  Also, this post from ECN has a little more detail, from the recent Ask the Expert series on SMB.

What you could do is set change notify to norecurse, which will cause it not to have to notify clients of changes down multiple levels of the tree.

83 Posts

September 9th, 2013 08:00

Here is a link to the pertinent portion of the Ask The Expert thread that Bernie mentioned:

https://community.emc.com/message/748427#748427

13 Posts

September 13th, 2013 04:00

Thanks for the input. As stated we have almost no windows XP clients left in the company.

I have recently seen a rather odd performance issue on the isilon clusters. After we upgraded from 6.5.5 to 7.0.2 we have seen a extreme increase in file_state operations and most of the Ops and TimeAvg comes from change_notify:

take a look at the output:

   Ops    In   Out     TimeAvg   TimeStdDev Node Proto      Class            Op

   N/s   B/s   B/s          us           us

   2.6 255.3 122.1 256993232.0  814915584.0    5  smb2 file_state change_notify

112.5   10K   14K       141.9        226.3    5  smb2 file_state         close

153.0   15K   12K      9647.5     190915.2    6  smb2 file_state change_notify

526.4   48K   67K       140.6        820.0    6  smb2 file_state         close

   0.4  46.2  28.7      2159.5       2940.9    6  smb2 file_state          lock

   1.6 156.2  71.1  85465352.0  210455536.0    7  smb2 file_state change_notify

  57.0  5.2K  7.3K       178.8        663.5    7  smb2 file_state         close

   0.4  46.2  28.7      2294.0       3121.2    7  smb2 file_state          lock

   0.6  58.2  30.3   3141439.8    3203243.0    8  smb2 file_state change_notify

  30.5  2.8K  3.8K       194.0        376.8    8  smb2 file_state         close

   3.2 311.8 112.8  29923476.0   69968784.0    9  smb2 file_state change_notify

108.2  9.9K   14K       128.6         48.5    9  smb2 file_state         close

   2.1 202.5  72.5  25473304.0   72545008.0   10  smb2 file_state change_notify

186.1   17K   24K       177.5        513.7   10  smb2 file_state         close

   1.5 174.6 108.4       618.5       1471.0   10  smb2 file_state          lock

I am aware that change_notify and the TimeAvg is expected to be rather high, but i find it odd that we went from a small amount of microsecons before the upgrade and into several seconds afterwards.

InsightIQ confirms this trend as well.

4 Operator

 • 

1.2K Posts

September 13th, 2013 07:00

Isn't the situation on node 6 pretty much ok?

Are/were the clients connected to node 6 somehow different

(by number or by OS or by application or ...) from clients connected

to other nodes?

Can you check the sysctl setting for  tcp and smb2

on node 6 vs other nodes? Maybe during the 7.0 upgrade

things have gotten out of sync.

On the "slow" nodes, you can check the tcp connections

to the clients with netstat - do you see non-zero values

in the Send-Q(queue) column for those clients which

are targeted by change_notify? This would be mean that

the Isilon side is sending data, but the client is unable

to process it in time (or a latent network problem).

-- Peter

13 Posts

September 13th, 2013 12:00

There was a couple of Applications accessing those nodes except node 6. Lots of IO operations from those which explains the different values. Also we have two clusters, and the second cluster also shows the same pattern in regards to file_state.

Sysctl values are fine though. We verified those right after the upgrade.

Only latent connections found with netstat is those against the ICAP servers. Cant see any delay for the Client Workstations or servers though.

Change_notify has been set to norecurse since last year actually. Still not sure if i would gain anything by disabling it alltogether though.

4 Operator

 • 

1.2K Posts

September 15th, 2013 05:00

You simply might have hit some (unknown? new?) oddity in OneFS's SMB services

at high load, have you opened a support case?

If you are willing to further investigate and share, here are a couple of more thoughts:

250s average service time per operation (node 5) is obviously ridiculous,

and finished 2.6 ops per second means that ~650 operations are "being acted on inside

the node" concurrently.The node will be super-busy or

is waiting for something (or both), but that should be identifiable (suggestions follow).

And how does this number, 650, correlate with the number of clients?

(Even on a highly loaded cluster, that product, Ops/s x TimeAvg(converted to seconds),

for all operations together, should be around 10, or at least  <50).

Was the output of 'isi statistics client' an excerpt, or shows it all traffic?

The current picture is that the physical network bandwidth is NOT being used up yet.

What is actually going on:

- at CPU level ('top') - lwiod or others super busy or using excessive amounts of memory, or even crashing?

- at filesystem level ('isi statistics heat') - hot spots, locking?

- at disk level ('isi statistics drive') - drives busy/late?

Are the SMB clients watching widely 'shared' folders, or each

one watching only private ('home') folders?

Checked the impact of

isi smb settings global modify  --enable-security-signatures

on/off ?

Increased or reduced the number of SMB server worker threads?

(You are not using any 'protocol enhancing', WAN-optimizing network appliance

between the clients and the Isilon, are you?)

-- Peter

13 Posts

October 4th, 2013 04:00

Some more statistics.  Disk statistics shows around 50-60 iops for the SATA drives and 50 iops for the SAS drives

Number of connections (connected/active)

       5                           1520                          164

       6                           1983                          186

       7                           1448                          226

       8                           1496                          249

       9                           1818                          256

      10                           1595                          217

Here is a excerpt from node 5 (it issue is the same on all nodes):

   Ops    In   Out      TimeAvg   TimeStdDev Node Proto           Class              Op

   N/s   B/s   B/s           us           us

243.7   63K   33K        643.7        687.6    5  smb2          create          create

   1.5 147.2 103.2  153958064.0  366812000.0    5  smb2      file_state   change_notify

135.0   12K   16K        189.8       1052.8    5  smb2      file_state           close

   6.6 760.8 472.2        124.7         38.5    5  smb2      file_state            lock

   0.2  22.0  22.0        105.0          0.0    5  smb2      file_state    oplock_break

130.5   14K  109K       7607.1      68325.1    5  smb2  namespace_read query_directory

119.3   13K   25K        486.4        833.6    5  smb2  namespace_read      query_info

   2.1 440.0 148.1       2644.7       2996.2    5  smb2 namespace_write        set_info

   0.2  15.2   0.0         27.0          0.0    5  smb2           other          cancel

   3.2 371.3   22K       3494.6       6810.8    5  smb2            read            read

   0.4  30.5  30.5        210.0         63.6    5  smb2   session_state          logoff

   0.4  46.1  99.4         71.5         24.7    5  smb2   session_state       negotiate

   0.4  1.2K 101.3     271018.0       6950.9    5  smb2   session_state   session_setup

   1.5 195.1 124.4        313.9        400.1    5  smb2   session_state    tree_connect

   3.0 213.3 213.3         87.6         34.5    5  smb2   session_state tree_disconnect

  58.6  689K  4.9K        476.1        819.0    5  smb2           write           write

"top" on node 5:

last pid:  3469;  load averages:  0.68,  1.21,  1.47                                               up 13+20:29:32  13:39:29

617 processes: 2 running, 615 sleeping

CPU:  1.6% user,  0.0% nice, 14.2% system,  3.5% interrupt, 80.7% idle

Mem: 1064M Active, 14G Inact, 7099M Wired, 575M Cache, 511M Buf, 614M Free

Swap:

  PID USERNAME   THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND

43565 root        57  96    0  3461M   367M ucond   0   0:00  3.91% lwio

65070 root         4   8    0   113M 15900K cpendi  0   0:07  0.98% isi_job_d

4245 root         1   4    0 55692K  7508K kqread  5 106:07  0.00% isi_mcp

Most of the file shares are shared folders with lots of users but we have some home areas as well.

Smb signature is not enabled.

Have not modified the SMB worker threads either. Not sure how to do this?

There is a set of riverbed WAN accelerators between the clients and the isilon cluster.

4 Operator

 • 

1.2K Posts

October 4th, 2013 06:00

Very interesting. A couple of thoughts:

(1) It doesn't seem that  the cluster is overloaded (CPU, disk ops).

(2) Oplocks apparently are also not a problem per se (Ops rate, TimeAvg).

You might still test and switch oplocks off; they might interfere with

change_notify and, beware, with (4) below.

(3) For SMB worker threads, check out the SMB section on the 7.0 Command Reference (not the Admin Guide):

isi smb settings global modify --srv-num-workers {integer}

(4) The WAN accelerator is the most suspicious element to me here. After having

learned about a story where a (not Riverbed, but another big name) accelerator

chose to mix up "\" and "/"  in CIFS file paths, and that breaking the CIFS workflows

right after an OS upgrade on the NAS server side(!), I have no illusions on the

"transparency" of the magic stuff those boxes perform, nor on the compatibility claims

stated by manufacturers.

You might try to:

- switch off individual "optimization features" or "acceleration levels" offered on the Riverbed.

- (with a good network team) analyze the CIFS packets entering and leaving the Riverbed.

- consult Riverbed support/engineering.

Best of luck, and keep us posted

-- Peter

Top