prashant_shah
2 Bronze

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

This might be a pretty basic question but I haven't yet found a good explanation for this.  When setting up an SMB share, we are given two choices.  "Apply Windows Default ACLs" and "Do not change existing permissions".  When I attended training, we were advised to do "Do not change existing permissions".  However, when I ran into issues at a client site and called support, I was told to use the other option.  What is the difference between the two and what's the general use case?  Thanks.

0 Kudos

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Why isi_for_array tcpdump vs isi_netlogger?

0 Kudos
pabromitis
1 Nickel

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Hello prashant_shah,

This option is often mis-understood so I am glad you asked.

When a cluster is setup, /ifs is configured with the following default permissions:

ISI7021-1# ls -led /ifs

drwxrwxrwx    9 root  wheel  158 Jul 17 07:46 /ifs

OWNER: user:root

GROUP: group:wheel

SYNTHETIC ACL

0: user:root allow dir_gen_read,dir_gen_write,dir_gen_execute,std_write_dac,delete_child

1: group:wheel allow dir_gen_read,dir_gen_write,dir_gen_execute,delete_child

2: everyone allow dir_gen_read,dir_gen_write,dir_gen_execute,delete_child

If you create a directory through webui or cli, the directory will get the following permissions:

ISI7021-1# ls -led /ifs/tmp

drwxr-xr-x    2 root  wheel  0 Jul 17 07:46 /ifs/tmp

OWNER: user:root

GROUP: group:wheel

SYNTHETIC ACL

0: user:root allow dir_gen_read,dir_gen_write,dir_gen_execute,std_write_dac,delete_child

1: group:wheel allow dir_gen_read,dir_gen_execute

2: everyone allow dir_gen_read,dir_gen_execute

If you create a new share pointing to the /ifs/tmp directory and select "Do not change existing permissions", it will leave the permissions as:

ISI7021-1# ls -led /ifs/tmp

drwxr-xr-x    2 root  wheel  0 Jul 17 07:46 /ifs/tmp

OWNER: user:root

GROUP: group:wheel

SYNTHETIC ACL

0: user:root allow dir_gen_read,dir_gen_write,dir_gen_execute,std_write_dac,delete_child

1: group:wheel allow dir_gen_read,dir_gen_execute

2: everyone allow dir_gen_read,dir_gen_execute

If you create a new share pointing to the /ifs/tmp directory and select "Apply Windows Default ACLs" the equivalent will be run against the directory:

chmod -D /ifs/tmp

chmod -c dacl_auto_inherited,dacl_protected /ifs/tmp

chmod +a# 0 group Administrators allow dir_gen_all,object_inherit,container_inherit /ifs/tmp

chmod +a# 1 group creator_owner allow dir_gen_all,object_inherit,container_inherit,inherit_only /ifs/tmp

chmod +a# 2 group everyone allow dir_gen_read,dir_gen_execute /ifs/tmp

chmod +a# 3 group Users allow dir_gen_read,dir_gen_execute,object_inherit,container_inherit /ifs/tmp

chmod +a# 4 group Users allow std_synchronize,add_file,add_subdir,container_inherit /ifs/tmp

That ends up converting the ACL to:

ISI7021-1# ls -led /ifs/tmp

drwxrwxr-x +  2 root  wheel  0 Jul 17 07:46 /ifs/tmp

OWNER: user:root

GROUP: group:wheel

CONTROL:dacl_auto_inherited,dacl_protected

0: group:Administrators allow dir_gen_all,object_inherit,container_inherit

1: creator_owner allow dir_gen_all,object_inherit,container_inherit,inherit_only

2: everyone allow dir_gen_read,dir_gen_execute

3: group:Users allow dir_gen_read,dir_gen_execute,object_inherit,container_inherit

4: group:Users allow std_synchronize,add_file,add_subdir,container_inherit

This may or may not be a good thing for the permissions on your directories.  Lets say that /ifs/tmp was a NFS export and you explicitly wanted those Mode Bit Rights set based due to Unix client application requirements.  By selecting the "Apply Windows Default ACLs" option, you have now overwritten the original ACL which may break the application.  Thus, there is risk associated with using "Apply Windows Default ACLs" with a currently existing directory.

On the flip side, lets say that /ifs/tmp was a brand new directory created from cli that you want windows users to be able to create and delete files in.  When creating the share, if you set "Do not change existing permissions" and then had the users attempt to save files there, they would get access denied because "Everyone" only gets Read access.  In fact, even as Administrator, you would not be able to modify the security tab of the directory to add Windows users because the Mode Bits limit access to only Root.

In summary, a pretty good rule of thumb is as follows:

-- If you have an existing directory structure that you want to add a share to, you most likely do not want to change the ACL so you should select the "Do not change existing permissions" option.

-- If you are creating a new share for a new directory you will likely be changing permissions to the ACL to grant Windows users rights to perform operations.  Thus, you should set the "Apply Windows Default ACLs" option and then once the share is created, go into the Security tab from Windows and assign permissions to users as needed.

pabromitis
1 Nickel

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Hello Mark,

isi_netlogger is a wrapper for tcpdump and isi_netlogger is cluster aware, thus the reason you do not need to run it with isi_for_array since it has the -c switch to select all nodes.

Tcpdump is native to freebsd and is not cluster aware, therefor you have to isi_for_array when you want to run it across multiple nodes.

I used to use isi_netlogger quite a bit but have since switched to just using tcpdump.  One of my favorite commands to run depending on the scenario is:

tcpdump -s 0 -i <interface> -w /ifs/data/Isilon_Support/`hostname`.$(date +%m%d%Y_%H%M%S).<interface>.pcap &


Or if I need multiple nodes:


isi_for_array 'tcpdump -s 0 -i <interface> -w /ifs/data/Isilon_Support/`hostname`.$(date +%m%d%Y_%H%M%S).<interface>.pcap' &


Also, when taking traces for SMB issues, please make sure to use the -s 0 switch.  By default tcpdump will truncate frames to 96 bytes.  People used to set it to around 400 for SMB1 but if you do that for SMB2, you will lose compounded commands so it is best to capture the entire frame.

prashant_shah
2 Bronze

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Great explanation!  Thank you.

0 Kudos

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

When looking to run a tcp dump to troubleshoot an SMB collection between a client and an Isilon cluster I have always limited my collection to a specific client (ie tcpdump -s 0 -i <interface> host <windows client>".  I've done that because we have up to a thousand client connections per node 0 that is just a lot to dig through.  Is that a bad idea,  since I am missing any other interactions which may or may not be relevant such as interactions between isilon and domain controllers?

0 Kudos
pabromitis
1 Nickel

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Hello Mark,

Collecting packet traces is an art; you have to know enough about the problem in order to identify how to filter.  We used to be able to get away with no capture filters but as interfaces have gotten faster, its just not reasonable.  Two things tend to happen on a 10G interface when you don't use a capture filter:

1.) The trace becomes massive very quickly because of all the traffic

2.) Tcpdump cannot flush to disk fast enough so you end up with dropped frames making the trace unreliable

Even with just filtering on a single client, they can push enough load that the trace ends up with dropped frames.  At that point it becomes a question of what are you trying to accomplish with the trace.

When troubleshooting a failure via packet trace, I usually do the following:

-- Connect to \\cluster <do not add a share>

   -- If this works, you can almost always get away with filtering on just the client ip from a cluster side trace, because the problem is outside of authentication.

-- If this fails, connect to a node ip without a share \\x.x.x.x

     -- If this works, you are troubleshooting a kerberos type problem, the trace you need is from the client so you can see the traffic between client -> DC and Client -> cluster

    -- If this fails, both NTLM and Kerberos are failing, the trace you need is cluster side, and you can filter on the client, and all of the DCs that are in the same AD Site as the cluster.

**It should be noted that all of the above are assuming your client has a direct form of connection to the cluster.  ie they are not going through a firewall or wan accelerator.  If they do go through one of those devices, you will probably need port mirrors of various interfaces to get a full understanding of where the problem is.

Hopefully after reading this, it will make a little more sense as to why when working with support, we may ask you to take multiple traces.  The reality is, we are often troublehsooting while collecting packet traces and we are using them to narrow in on where the problem is.

0 Kudos

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

The problem I am working on now is an odd one.  We have a drive mapping set via Group Policy to a DFS Server with Isilon as the share DFS is encapsulating.  (ie client -> \\DFSServer\Share -> \\IsilonCluster\Share).  When the user first logs in they get a generic "Access Denied".  If they try to UNC to either the DFS path or Isilon path - it works completely fine. We map a 2nd drive via cli or gui to the same path Group Policy maps to - and it works fine.  But the drive mapped via Group Policy still fails.  Log out and log back in (not a reboot) and the problem goes away.  Disable SMB2 and the problem goes away.  Map a drive direct to Isilon via Group Policy,  and it works.

What i'm struggling to understand is when authentication happens and when Isilon interacts with a Domain Controller for both SMB2 and SMB1.

0 Kudos
pabromitis
1 Nickel

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

Hello Mark,

While SMB1 and SMB2 use two different code paths, there technically should not be much difference between them as DFS works over IOCTL.  On the positive side, you have a working and failing example that we can compare and it seems you have narrowed the issue down to something that should be reproducible in a lab.  The down side is, the first place I would look is the client side trace to figure out if it is failing against the DFS server or the Isilon cluster.  Since you are using GPO, that means you would need to port mirror the client port as you reboot the box.  If you want to PM the case number, I can take a look at the data that we have to see if I can identify where the failure is.

What i'm struggling to understand is when authentication happens and when Isilon interacts with a Domain Controller for both SMB2 and SMB1.

For both SMB1 and SMB2, authentication and communication with the DC always occurs as follows:

Step 1.) Figure out what version of SMB to use (smb1 or smb2)

Client -> SMB Negotiate Protocol Request -> Server

Client <- SMB Negotiate Protocol Response <- Server

Step 2.) Perform Authentication

Client -> Session Setup Request -> Server

  -- For NTLM the Server talks to the DC at this point

  -- For Kerberos, its the clients job to get the Kerb Ticket so the Server does not have to talk to the DC at all

Client <- Session Setup Response <- Server

Step 3.) Access the shares and do all other operations (ie findfirst, reads, writes, etc)

Client -> Tree Connect Request -> Server

Client <- Tree Connect Response <-Server

Once Step 2 is complete, Authentication is done, the Windows Token has been established and kept in memory for the life of the SMB Session.  When the client accesses files and permission checking is required in Step 3 and beyond, there is no need to talk to the DC to lookup group memberships.  Once the client tears down the SMB session, (for example a Session Logoff or TCP,RST) the client will have to go back through Step 2 before it can move on to Step 3 and beyond.

When you add DFS in the mix, the client has to:

-- Perform Step 1-3 against the DFS Server

-- Get redirected to the cluster via a dfs referral

-- Go through Steps 1-3 against the cluster

-- Finally connect to the path on the cluster

Since you have GPO in play as well, that initial connection against the cluster may be under the Clients Machine Context rather than the Clients User Context which means it may be coming in as an anonymous user which could be causing the Access Denied. 

The best course of action would be:

-- Start a port mirror of the client

-- Reboot the client and generate the error

-- Look at the trace for the Following:

-- Apply wireshark filter smb2.nt_status != 0 and figure out what frame the Access Denied is coming in

-- Determine if it is the DFS Server or the Isilon Cluster throwing the error

-- Follow the TCP Stream (right click option on the problem frame) and go to the beginning to locate the Session Setup to determine what User account was actually being used

pabromitis
1 Nickel

Re: Ask the Expert: SMB Protocol on an Isilon Cluster

We have had a lot of great discussions last week so I figured I would kick this week off with one of my least favorite topics: SMB Performance

When it comes to performance, 99% of the time there is no silver bullet to fix the issue.  Since Performance will likely be a future ATE, I am going to focus this conversation around what to look for from a SMB perspective.

Most of the time, when someone comes to me and says SMB is slow, I ask the following questions:

1.) How are you measuring slow?  Wall Clock or something that can accurately measure time?

2.) Define slow?  Did a job that always ran in 10 seconds now take 20 seconds?

3.) What has changed? Are there new jobs running on the system that never ran before? *Interesting note, in all the years I have been in support, nothing has ever changed

From an Isilon Perspective, we have 4 ways to measure performance:

1.) InsightIQ - This tool is a VM that sits on your network and collects data from you cluster and stores the data in a local database.  You can then give the database to support who can pull it into their own InsightIQ system to extrapolate the data.

2.) Historical Counters - Yes!  We collect historical counters so you could call support and tell us, "Hey, I saw a slow down last Monday" and we can pull the historical stats to start our analysis.  There are some caveats with historical counters:

  -- Not all counters are historical

  -- They are less accurate as time elapses

3.) Freebsd stats - Standard performance tools built into freebsd to troubleshoot issues

4.) isi statistics - Isilon specific counters that measure OneFS counters that can give us a better understanding of where latency is seen

To limit the size of this post, I am going to talk about the Freebsd and isi statistic stats that I look at when troubleshooting SMB Performance issues.

One of the first places people look during an issue is PS or Top.  These are good places to start to help you understand how much CPU that lwio and other likewise services are using but are also the greatest cause of confusion.  The lwio process within OneFS is a multithreaded process but when you look at it with the default output of PS and Top, it looks single threaded.  Engineers tend to become concerned when they see it approaching 100% and become confused when it is over 100%.  Do not be afraid of lwio running at 100% when using the default output of PS.

For example:

Regular ps looks scary at 110%:

b5-2-1# ps -fwulp `pgrep lwio`

USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN

root  3311  110.0  0.1 130836 15688  ??  I    24May13 387:36.82 lw-container lwi     0  3171   0  96  0 ucond

But Ps with the flag to show threads does not look so bad:

b5-2-1# ps -fwulHp `pgrep lwio`

USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.03 lw-container lwi     0  3171   0  20  0 sigwait

root  3311  20.0  0.1 130836 15688  ??  I    24May13   1:39.00 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.09 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.33 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13 378:10.17 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.32 lw-container lwi     0  3171   0   4  0 kqread

root  3311  1.0   0.1 130836 15688  ??  I    24May13   0:00.26 lw-container lwi     0  3171   0   4  0 kqread

root  3311  1.0   0.1 130836 15688  ??  I    24May13   0:00.48 lw-container lwi     0  3171   0   4  0 kqread

root  3311  0.0   0.1 130836 15688  ??  I    24May13   7:43.44 lw-container lwi     0  3171   0   4  0 kqread

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.03 lw-container lwi     0  3171   0  96  0 ucond

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.08 lw-container lwi     0  3171   0  96  0 ucond

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:02.60 lw-container lwi     0  3171   0  96  0 ucond

I am going to steal a quote from the great Tim Wright:

"It is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming  >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning."

After we have eased our concerns over CPU, the next place to look is the isi statistic commands so we can understand what kind of work the clients are doing.  When running isi statistics, there are a couple of things to be aware of:

-- You only need to run the command from one node to capture stats for all nodes (--nodes=all)

-- Use the --degraded switch so if one of the nodes does not respond to the counter fast enough, it does not stop the continual output

To start, it is always good to know how many clients are connecting to the nodes:

isi statistics query --nodes=all --stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics query --nodes=all --stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 --interval 5 --repeat 12 --degraded

  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb

  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2

       1                            560                            1                           18

       3                            554                            0                           17

       4                            558                            0                           3

average                       551                        0                            25

The above output shows there are 560 SMB Sessions to node 1 with 1 active SMB1 session and 18 active SMB2 sessions.  The 560 SMB Sessions represent clients that are connected to the node that did not send any requests during the time the counter was run; thus, they are considered idle connections.  The 19 active sessions represents clients that sent a smb1/2 request during the time this counter was collected that the node had not responded to yet.  This counter can be indicative of an issue but will not tell you directly where the problem is.  As the active counter increases (specifically when it because a higher parentage of all onnections) it usually means there is something latent.

When I start looking to see if SMB is latent, I prefer the following stat over the connection count:

isi statistics protocol --nodes=all --protocols=smb1,smb2 --total --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=smb1,smb2 --total --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s      us         us                   

  10.1  1.2K   10K  2081.6     4037.0    1  smb1     *  *

706.4  140K  129K  180817.1   2589.5    1  smb2     *  *

   0.4  30.3  33.5  5085.5     5895.1    3  smb1     *  *

812.7   18K  8.2K  151469.2   6842.4    3  smb2     *  *

   0.4  30.4  33.6  1542.0     1074.8    4  smb1     *  *

  71.6   23K   13K  25407.8      714.0   4  smb2     *  *

The above stat tells you if the clients are using SMB1 or SMB2 and what the overall latency looks like.  The only problem I have with this stat is it includes Change Notify in the calculation of latency so it will throw off the time average.  A good rule of thumb is to look for a good number of ops (like 1k) and make sure the Standard Deviation is not abnormal.  The stat above does suggest that operations to nodes 1, 3 and 4 are showing signs of latency.  They have 700-800 Ops with a Time Avg of 150ms - 180ms.

This leads us to our next logical counter which breaks down the protocol by type. For that I use this command:

isi statistics protocol --nodes=all --protocols=smb1,smb2 --orderby=Class --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=smb1,smb2 --orderby=Class --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s      us         us                   

   0.2  63.8  39.0   1098.0        0.0    3  smb2         create          create

   2.5 604.9 340.0   2353.2     5749.7    4  smb2         create          create

   0.2  15.1  21.1   3009.0        0.0    1  smb2     file_state           close

   0.9  82.7 115.1    132.5       30.4    4  smb2     file_state           close

   1.0 112.1 100.7    253.8       79.8    3  smb2 namespace_read      query_info

   0.9  97.1  72.4     83.0       41.6    4  smb2 namespace_read      query_info

695.2   81K   46M 190364.5      223.9    3  smb2           read            read

   3.1 368.1   91K   1556.9     4474.0    4  smb2           read            read

   0.2  26.5  18.9    201.0        0.0    4  smb2  session_state    tree_connect

   0.2  63.8  39.0   1098.0        0.0    3  smb2         create          create

   0.2  16.6  23.1    400.0        0.0    1  smb2     file_state           close

   1.0 112.1 100.7    253.8       79.8    3  smb2 namespace_read      query_info

695.2   81K   46M 190364.5       223.9    3  smb2           read            read

If we compare the data from above to the previous output, we can see that for node 3, out of the 812 Ops that were SMB2, 659 Ops were Read and the average latency was 190 ms.  We are on the path of finding our culprit.

Since we have established that SMB2 is latent and it appears to be impacting reads, the next place to look would be disk:

isi statistics drive --nodes=all --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics drive --nodes=all --interval 5 --repeat 12 --degraded

Drive    Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued  Busy  Used Inodes

LNN:bay           N/s     B/s      B    N/s      B/s       B      ms  N/s      ms            %     %      

    1:1     SATA  72.2    2.3M    32K  129.8     2.2M     17K     0.6  0.0    58.8    8.7  93.5 100.0   3.4M

    1:2     SATA  56.8    1.9M    34K  157.4     2.9M     18K     0.4  0.0   208.8   31.0  65.1 100.0   3.0M

    1:3     SATA  86.0    2.4M    28K   88.6     1.6M     18K     0.4  0.0   133.1   22.0  84.3 100.0   3.0M

    1:4     SATA  54.0    2.1M    38K  118.6     2.3M     20K     0.4  0.0    52.3   11.1  72.7 100.0   2.5M

    1:5     SATA  74.0    2.5M    34K  106.6     2.1M     20K     0.4  0.0    52.3    9.3  57.3 100.0   3.3M

    1:6     SATA  66.2    2.5M    38K  100.6     2.0M     20K     0.4  0.0    53.3    8.4  86.1 100.0   3.2M

    1:7     SATA  47.4    1.6M    34K   94.2     1.8M     20K     0.4  0.0    46.4    7.8  49.7 100.0   3.3M

    1:8     SATA  65.4    2.3M    35K  145.8     2.5M     17K     0.4  0.0    37.8    7.5  75.1 100.0   3.4M

    1:9     SATA  51.2    2.1M    40K  119.2     2.1M     18K     0.4  0.0    35.8    6.7  56.3 100.0   2.5M

    1:10    SATA  62.0    2.0M    32K  101.2     2.2M     22K     0.4  0.0    33.8    6.0  56.5 100.0   3.4M

    1:11    SATA 126.6    3.2M    25K   76.2     1.4M     18K     0.3  0.0   201.1   33.5 100.0 100.0   3.0M

    1:12    SATA  66.2    2.0M    31K  117.8     1.9M     16K     0.3  0.0   106.9   21.3  85.1 100.0   3.0M

    3:1     SATA  40.0    1.4M    36K  107.4     1.8M     17K     0.3  0.0    89.2   17.1  37.5 100.0   2.9M

    3:2     SATA  54.2    1.8M    33K  113.4     1.9M     17K     0.3  0.0    68.4   14.7  60.7 100.0   3.0M

    3:3     SATA  56.0    2.1M    38K  112.2     2.0M     17K     0.3  0.0    65.6   14.4  40.7 100.0   3.3M

    3:4     SATA  73.8    2.3M    32K  113.6     2.0M     17K     0.3  0.0   114.3   13.9  54.5 100.0   2.3M

    3:5     SATA  66.8    2.1M    32K  106.8     1.9M     18K     0.3  0.0    74.0   11.2  50.5 100.0   3.5M

    3:6     SATA  78.4    2.7M    34K  138.2     2.2M     16K     0.3  0.0    75.8   11.1  82.1 100.0   3.4M

    3:7     SATA  58.4    2.2M    38K  127.8     2.1M     16K     0.3  0.0    77.1   11.0  54.7 100.0   3.4M

    3:8     SATA  54.6    2.0M    37K   90.4     1.4M     16K     0.3  0.0    75.1   10.7  39.9 100.0   3.0M

    3:9     SATA  56.2    2.0M    36K  139.4     2.5M     18K     0.3  0.0    59.9   10.4  61.5 100.0   3.3M

    3:10    SATA  59.0    1.9M    33K  110.2     1.8M     16K     0.3  0.0    55.2   10.2  49.3 100.0   3.3M

    3:11    SATA  55.0    2.0M    37K  122.2     1.9M     16K     0.3  0.0    59.4    9.3  46.1 100.0   2.5M

    3:12    SATA  51.4    1.8M    35K  102.0     2.1M     20K     0.3  0.0    50.3    9.1  47.7 100.0   3.3M

    4:1     SATA  52.2    1.8M    34K  117.2     2.1M     18K     0.3  0.0    53.5    8.8  51.7 100.0   2.8M

    4:2     SATA  58.8    2.1M    35K  107.2     2.0M     18K     0.3  0.0    47.8    8.7  48.9 100.0   3.3M

    4:3     SATA  64.8    2.3M    35K  120.6     2.2M     18K     0.3  0.0    44.2    8.6  57.1 100.0   3.4M

    4:4     SATA  50.8    1.8M    35K   77.8     1.7M     22K     0.3  0.0    53.8    8.6  38.1 100.0   2.7M

    4:5     SATA  58.4    2.2M    38K  135.6     2.4M     18K     0.3  0.0    51.8    8.4  48.9 100.0   3.4M

    4:6     SATA  65.0    2.4M    37K  108.8     2.1M     19K     0.3  0.0    55.9    8.3  55.3 100.0   3.3M

    4:7     SATA  57.0    2.1M    37K  106.8     2.2M     21K     0.3  0.0    46.9    8.2  49.5 100.0   3.3M

    4:8     SATA  58.8    2.0M    34K  149.0     2.7M     18K     0.3  0.0    46.2    8.2  58.7 100.0   3.3M

    4:9     SATA  53.2    1.8M    33K  124.0     2.3M     19K     0.3  0.0    45.0    8.2  56.3 100.0   3.4M

    4:10    SATA  76.0    2.3M    30K  103.8     1.9M     18K     0.3  0.0    44.8    8.0  60.1 100.0   3.3M

    4:11    SATA  60.2    2.1M    35K  116.0     1.9M     17K     0.3  0.0    42.6    7.9  65.5 100.0   3.1M

    4:12    SATA  59.0    2.1M    35K  100.2     1.8M     18K     0.3  0.0    48.2    7.8  41.3 100.0   2.4M

The above output shows the source of our problem, these poor sata disks are doing an average of 170 Ops (In and Out) and are struggling to keep up.  The SMB2 Reads that are latent are a victim of the disks which in this case is spindle bound due to other contention.

Outside of tracking down a SMB Performance issue due to disk, a couple other useful counters to look at are:

Is Directory enumeration bad:

isi statistics protocol  --nodes=all --protocols=smb1,smb2 --orderby=Out --classes=namespace_read --interval 5 --repeat 12 --degraded

isi-ess-east-1#   isi statistics protocol  --nodes=all --protocols=smb1,smb2 --orderby=Out --classes=namespace_read --interval 5 --repeat 12 --degraded

   Ops    In   Out TimeAvg TimeStdDev Node Proto          Class               Op

   N/s   B/s   B/s      us         us                                          

  13.0  1.5K   44K   510.1     1331.5    1  smb2 namespace_read  query_directory

227.9   25K   35K   226.6      791.0    3  smb2 namespace_read  query_directory

  60.1  6.9K   31K   400.1     3127.8    4  smb2 namespace_read  query_directory

   5.2 720.3  5.6K   822.5      293.8    1  smb1 namespace_read trans2:findfirst

   2.2 305.8  4.7K  6452.5    19478.9    3  smb1 namespace_read trans2:findfirst

  20.5  2.3K  4.3K  1158.9     6969.6    1  smb2 namespace_read  query_directory

   0.2  29.4  3.0K  1293.0        0.0    3  smb1 namespace_read  trans2:findnext

When looking at this stat focus on the FindFist/FindFirstNext. You are looking for a small amount of Ops that cause a large amount of Out B/s.  The Time Avg might look normal but what we are really looking for is when a client does a * enumeration of a very large directory.  The request itself will be 1 op but will amount to a tremendous amount of bytes that will need to be returned.

Is one of the Authentication Providers causing a delay:

isi statistics protocol --nodes=all --protocols=lsass_in,lsass_out --total --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=lsass_in,lsass_out --total --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node    Proto         Class                         Op

N/s B/s B/s      us         us                                                      

0.4 0.0 0.0  8977.0     5208.5    4 lsass_in session_state lsa:id:ioctlSmiley Tongueac_to_ntoken

0.4 0.0 0.0   383.0        2.8    4 lsass_in session_state      ntlm:accept_sec_ctxt1

0.4 0.0 0.0 10256.5       27.6    4 lsass_in session_state      ntlm:accept_sec_ctxt2

0.7 0.0 0.0   576.2      390.3    4 lsass_in session_state         ntlm:acquire_creds

0.4 0.0 0.0   136.0      144.2    4 lsass_in session_state       ntlm:delete_sec_ctxt

0.7 0.0 0.0    37.0        6.8    4 lsass_in session_state            ntlm:free_creds

1.8 0.0 0.0    48.2       24.6    4 lsass_in session_state            ntlm:query_ctxt

lsa:id:ioctlSmiley Tongueac_to_ntoken - Represents how long it takes a DC to complete a Sid2Name lookup

ntlm:accept_sec_ctxt2 - Represnts how long it took a DC to complete NTLM authentication

If either of the above show some Ops where the Time Avg is high, its time to start looking at the DC as causing the delay.

Good luck Performance Troubleshooting and remember, its very rare that the protocol itself (ie SMB) is the one causing the latency