Start a Conversation

Unsolved

This post is more than 5 years old

80751

July 1st, 2013 07:00

Ask the Expert: SMB Protocol on an Isilon Cluster

Welcome to this EMC Support Community Ask the Expert conversation.

YOU MAY ALSO BE INTERESTED ON THESE ATE EVENTS...

Ask the Expert: Are you ready to manage deep archiving workloads with Isilon’s HD400 node and OneFS 7.2.0? Find out more…

Ask the Expert: Isilon Performance Analysis

https://community.emc.com/message/821939#821939

This discussion will focus on supporting the SMB Protocol on an Isilon Cluster, including:

  • Differences between SMB1 and SMB2
  • What do the various isi auth and isi smb configuration options do
  • What logs and commands are used to diagnose issues
  • General troubleshooting concepts for SMB on an Isilon Cluster

Your host:

profile-image-display.jspa?imageID=7965&size=350

Peter Abromitis has been in support for over 8 years and is specialized in the Windows Protocol area. He is currently tasked as the Subject Matter Expert for Windows Protocols within Isilon Support, which involves everything from troubleshooting problems with SMB1, SMB2, Active Directory, and Permissions through standard Isilon Tools and Packet Traces; helping and developing TSEs as they progress through their career; and driving supportability needs into OneFS to make the lives of both customers and support engineers easier when dealing with issues on an Isilon Cluster.


Please see the posts below to read the full discussion. For a high-level summary of some of the key topics, Pete has posted this document: https://community.emc.com/docs/DOC-26337

17 Posts

July 18th, 2013 05:00

Hello Mark,

Collecting packet traces is an art; you have to know enough about the problem in order to identify how to filter.  We used to be able to get away with no capture filters but as interfaces have gotten faster, its just not reasonable.  Two things tend to happen on a 10G interface when you don't use a capture filter:

1.) The trace becomes massive very quickly because of all the traffic

2.) Tcpdump cannot flush to disk fast enough so you end up with dropped frames making the trace unreliable

Even with just filtering on a single client, they can push enough load that the trace ends up with dropped frames.  At that point it becomes a question of what are you trying to accomplish with the trace.

When troubleshooting a failure via packet trace, I usually do the following:

-- Connect to \\cluster

   -- If this works, you can almost always get away with filtering on just the client ip from a cluster side trace, because the problem is outside of authentication.

-- If this fails, connect to a node ip without a share \\x.x.x.x

     -- If this works, you are troubleshooting a kerberos type problem, the trace you need is from the client so you can see the traffic between client -> DC and Client -> cluster

    -- If this fails, both NTLM and Kerberos are failing, the trace you need is cluster side, and you can filter on the client, and all of the DCs that are in the same AD Site as the cluster.

**It should be noted that all of the above are assuming your client has a direct form of connection to the cluster.  ie they are not going through a firewall or wan accelerator.  If they do go through one of those devices, you will probably need port mirrors of various interfaces to get a full understanding of where the problem is.

Hopefully after reading this, it will make a little more sense as to why when working with support, we may ask you to take multiple traces.  The reality is, we are often troublehsooting while collecting packet traces and we are using them to narrow in on where the problem is.

467 Posts

July 18th, 2013 18:00

The problem I am working on now is an odd one.  We have a drive mapping set via Group Policy to a DFS Server with Isilon as the share DFS is encapsulating.  (ie client -> \\DFSServer\Share -> \\IsilonCluster\Share).  When the user first logs in they get a generic "Access Denied".  If they try to UNC to either the DFS path or Isilon path - it works completely fine. We map a 2nd drive via cli or gui to the same path Group Policy maps to - and it works fine.  But the drive mapped via Group Policy still fails.  Log out and log back in (not a reboot) and the problem goes away.  Disable SMB2 and the problem goes away.  Map a drive direct to Isilon via Group Policy,  and it works.

What i'm struggling to understand is when authentication happens and when Isilon interacts with a Domain Controller for both SMB2 and SMB1.

17 Posts

July 19th, 2013 06:00

Hello Mark,

While SMB1 and SMB2 use two different code paths, there technically should not be much difference between them as DFS works over IOCTL.  On the positive side, you have a working and failing example that we can compare and it seems you have narrowed the issue down to something that should be reproducible in a lab.  The down side is, the first place I would look is the client side trace to figure out if it is failing against the DFS server or the Isilon cluster.  Since you are using GPO, that means you would need to port mirror the client port as you reboot the box.  If you want to PM the case number, I can take a look at the data that we have to see if I can identify where the failure is.

What i'm struggling to understand is when authentication happens and when Isilon interacts with a Domain Controller for both SMB2 and SMB1.

For both SMB1 and SMB2, authentication and communication with the DC always occurs as follows:

Step 1.) Figure out what version of SMB to use (smb1 or smb2)

Client -> SMB Negotiate Protocol Request -> Server

Client <- SMB Negotiate Protocol Response <- Server

Step 2.) Perform Authentication

Client -> Session Setup Request -> Server

  -- For NTLM the Server talks to the DC at this point

  -- For Kerberos, its the clients job to get the Kerb Ticket so the Server does not have to talk to the DC at all

Client <- Session Setup Response <- Server

Step 3.) Access the shares and do all other operations (ie findfirst, reads, writes, etc)

Client -> Tree Connect Request -> Server

Client <- Tree Connect Response <-Server

Once Step 2 is complete, Authentication is done, the Windows Token has been established and kept in memory for the life of the SMB Session.  When the client accesses files and permission checking is required in Step 3 and beyond, there is no need to talk to the DC to lookup group memberships.  Once the client tears down the SMB session, (for example a Session Logoff or TCP,RST) the client will have to go back through Step 2 before it can move on to Step 3 and beyond.

When you add DFS in the mix, the client has to:

-- Perform Step 1-3 against the DFS Server

-- Get redirected to the cluster via a dfs referral

-- Go through Steps 1-3 against the cluster

-- Finally connect to the path on the cluster

Since you have GPO in play as well, that initial connection against the cluster may be under the Clients Machine Context rather than the Clients User Context which means it may be coming in as an anonymous user which could be causing the Access Denied. 

The best course of action would be:

-- Start a port mirror of the client

-- Reboot the client and generate the error

-- Look at the trace for the Following:

-- Apply wireshark filter smb2.nt_status != 0 and figure out what frame the Access Denied is coming in

-- Determine if it is the DFS Server or the Isilon Cluster throwing the error

-- Follow the TCP Stream (right click option on the problem frame) and go to the beginning to locate the Session Setup to determine what User account was actually being used

17 Posts

July 22nd, 2013 08:00

We have had a lot of great discussions last week so I figured I would kick this week off with one of my least favorite topics: SMB Performance

When it comes to performance, 99% of the time there is no silver bullet to fix the issue.  Since Performance will likely be a future ATE, I am going to focus this conversation around what to look for from a SMB perspective.

Most of the time, when someone comes to me and says SMB is slow, I ask the following questions:

1.) How are you measuring slow?  Wall Clock or something that can accurately measure time?

2.) Define slow?  Did a job that always ran in 10 seconds now take 20 seconds?

3.) What has changed? Are there new jobs running on the system that never ran before? *Interesting note, in all the years I have been in support, nothing has ever changed

From an Isilon Perspective, we have 4 ways to measure performance:

1.) InsightIQ - This tool is a VM that sits on your network and collects data from you cluster and stores the data in a local database.  You can then give the database to support who can pull it into their own InsightIQ system to extrapolate the data.

2.) Historical Counters - Yes!  We collect historical counters so you could call support and tell us, "Hey, I saw a slow down last Monday" and we can pull the historical stats to start our analysis.  There are some caveats with historical counters:

  -- Not all counters are historical

  -- They are less accurate as time elapses

3.) Freebsd stats - Standard performance tools built into freebsd to troubleshoot issues

4.) isi statistics - Isilon specific counters that measure OneFS counters that can give us a better understanding of where latency is seen

To limit the size of this post, I am going to talk about the Freebsd and isi statistic stats that I look at when troubleshooting SMB Performance issues.

One of the first places people look during an issue is PS or Top.  These are good places to start to help you understand how much CPU that lwio and other likewise services are using but are also the greatest cause of confusion.  The lwio process within OneFS is a multithreaded process but when you look at it with the default output of PS and Top, it looks single threaded.  Engineers tend to become concerned when they see it approaching 100% and become confused when it is over 100%.  Do not be afraid of lwio running at 100% when using the default output of PS.

For example:

Regular ps looks scary at 110%:

b5-2-1# ps -fwulp `pgrep lwio`

USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN

root  3311  110.0  0.1 130836 15688  ??  I    24May13 387:36.82 lw-container lwi     0  3171   0  96  0 ucond

But Ps with the flag to show threads does not look so bad:

b5-2-1# ps -fwulHp `pgrep lwio`

USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.03 lw-container lwi     0  3171   0  20  0 sigwait

root  3311  20.0  0.1 130836 15688  ??  I    24May13   1:39.00 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.09 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.33 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13 378:10.17 lw-container lwi     0  3171   0   4  0 kqread

root  3311  20.0  0.1 130836 15688  ??  I    24May13   0:00.32 lw-container lwi     0  3171   0   4  0 kqread

root  3311  1.0   0.1 130836 15688  ??  I    24May13   0:00.26 lw-container lwi     0  3171   0   4  0 kqread

root  3311  1.0   0.1 130836 15688  ??  I    24May13   0:00.48 lw-container lwi     0  3171   0   4  0 kqread

root  3311  0.0   0.1 130836 15688  ??  I    24May13   7:43.44 lw-container lwi     0  3171   0   4  0 kqread

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.03 lw-container lwi     0  3171   0  96  0 ucond

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:00.08 lw-container lwi     0  3171   0  96  0 ucond

root  3311  0.0   0.1 130836 15688  ??  I    24May13   0:02.60 lw-container lwi     0  3171   0  96  0 ucond

I am going to steal a quote from the great Tim WIt is entirely normal and expected to see multiple threads consuming 15, 20, 25% cpu at times. *If* you see one or more threads that are consistently and constantly consuming 100% cpu, *then* you probably have a problem. If you just see the sum of all the lwio threads consuming  >100% cpu, that is not likely to be a problem. Certain operations including auth can be somewhat cpu-intensive. Imagine hundreds of users connecting to the cluster in the morning."

After we have eased our concerns over CPU, the next place to look is the isi statistic commands so we can understand what kind of work the clients are doing.  When running isi statistics, there are a couple of things to be aware of:

-- You only need to run the command from one node to capture stats for all nodes (--nodes=all)

-- Use the --degraded switch so if one of the nodes does not respond to the counter fast enough, it does not stop the continual output

To start, it is always good to know how many clients are connecting to the nodes:

isi statistics query --nodes=all --stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics query --nodes=all --stats node.clientstats.connected.smb,node.clientstats.active.cifs,node.clientstats.active.smb2 --interval 5 --repeat 12 --degraded

  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb

  NodeID node.clientstats.connected.smb node.clientstats.active.cifs node.clientstats.active.smb2

       1                            560                            1                           18

       3                            554                            0                           17

       4                            558                            0                           3

average                       551                        0                            25

The above output shows there are 560 SMB Sessions to node 1 with 1 active SMB1 session and 18 active SMB2 sessions.  The 560 SMB Sessions represent clients that are connected to the node that did not send any requests during the time the counter was run; thus, they are considered idle connections.  The 19 active sessions represents clients that sent a smb1/2 request during the time this counter was collected that the node had not responded to yet.  This counter can be indicative of an issue but will not tell you directly where the problem is.  As the active counter increases (specifically when it because a higher parentage of all onnections) it usually means there is something latent.

When I start looking to see if SMB is latent, I prefer the following stat over the connection count:

isi statistics protocol --nodes=all --protocols=smb1,smb2 --total --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=smb1,smb2 --total --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s      us         us                   

  10.1  1.2K   10K  2081.6     4037.0    1  smb1     *  *

706.4  140K  129K  180817.1   2589.5    1  smb2     *  *

   0.4  30.3  33.5  5085.5     5895.1    3  smb1     *  *

812.7   18K  8.2K  151469.2   6842.4    3  smb2     *  *

   0.4  30.4  33.6  1542.0     1074.8    4  smb1     *  *

  71.6   23K   13K  25407.8      714.0   4  smb2     *  *

The above stat tells you if the clients are using SMB1 or SMB2 and what the overall latency looks like.  The only problem I have with this stat is it includes Change Notify in the calculation of latency so it will throw off the time average.  A good rule of thumb is to look for a good number of ops (like 1k) and make sure the Standard Deviation is not abnormal.  The stat above does suggest that operations to nodes 1, 3 and 4 are showing signs of latency.  They have 700-800 Ops with a Time Avg of 150ms - 180ms.

This leads us to our next logical counter which breaks down the protocol by type. For that I use this command:

isi statistics protocol --nodes=all --protocols=smb1,smb2 --orderby=Class --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=smb1,smb2 --orderby=Class --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node Proto Class Op

N/s B/s B/s      us         us                   

   0.2  63.8  39.0   1098.0        0.0    3  smb2         create          create

   2.5 604.9 340.0   2353.2     5749.7    4  smb2         create          create

   0.2  15.1  21.1   3009.0        0.0    1  smb2     file_state           close

   0.9  82.7 115.1    132.5       30.4    4  smb2     file_state           close

   1.0 112.1 100.7    253.8       79.8    3  smb2 namespace_read      query_info

   0.9  97.1  72.4     83.0       41.6    4  smb2 namespace_read      query_info

695.2   81K   46M 190364.5      223.9    3  smb2           read            read

   3.1 368.1   91K   1556.9     4474.0    4  smb2           read            read

   0.2  26.5  18.9    201.0        0.0    4  smb2  session_state    tree_connect

   0.2  63.8  39.0   1098.0        0.0    3  smb2         create          create

   0.2  16.6  23.1    400.0        0.0    1  smb2     file_state           close

   1.0 112.1 100.7    253.8       79.8    3  smb2 namespace_read      query_info

695.2   81K   46M 190364.5       223.9    3  smb2           read            read

If we compare the data from above to the previous output, we can see that for node 3, out of the 812 Ops that were SMB2, 659 Ops were Read and the average latency was 190 ms.  We are on the path of finding our culprit.

Since we have established that SMB2 is latent and it appears to be impacting reads, the next place to look would be disk:

isi statistics drive --nodes=all --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics drive --nodes=all --interval 5 --repeat 12 --degraded

Drive    Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued  Busy  Used Inodes

LNN:bay           N/s     B/s      B    N/s      B/s       B      ms  N/s      ms            %     %      

    1:1     SATA  72.2    2.3M    32K  129.8     2.2M     17K     0.6  0.0    58.8    8.7  93.5 100.0   3.4M

    1:2     SATA  56.8    1.9M    34K  157.4     2.9M     18K     0.4  0.0   208.8   31.0  65.1 100.0   3.0M

    1:3     SATA  86.0    2.4M    28K   88.6     1.6M     18K     0.4  0.0   133.1   22.0  84.3 100.0   3.0M

    1:4     SATA  54.0    2.1M    38K  118.6     2.3M     20K     0.4  0.0    52.3   11.1  72.7 100.0   2.5M

    1:5     SATA  74.0    2.5M    34K  106.6     2.1M     20K     0.4  0.0    52.3    9.3  57.3 100.0   3.3M

    1:6     SATA  66.2    2.5M    38K  100.6     2.0M     20K     0.4  0.0    53.3    8.4  86.1 100.0   3.2M

    1:7     SATA  47.4    1.6M    34K   94.2     1.8M     20K     0.4  0.0    46.4    7.8  49.7 100.0   3.3M

    1:8     SATA  65.4    2.3M    35K  145.8     2.5M     17K     0.4  0.0    37.8    7.5  75.1 100.0   3.4M

    1:9     SATA  51.2    2.1M    40K  119.2     2.1M     18K     0.4  0.0    35.8    6.7  56.3 100.0   2.5M

    1:10    SATA  62.0    2.0M    32K  101.2     2.2M     22K     0.4  0.0    33.8    6.0  56.5 100.0   3.4M

    1:11    SATA 126.6    3.2M    25K   76.2     1.4M     18K     0.3  0.0   201.1   33.5 100.0 100.0   3.0M

    1:12    SATA  66.2    2.0M    31K  117.8     1.9M     16K     0.3  0.0   106.9   21.3  85.1 100.0   3.0M

    3:1     SATA  40.0    1.4M    36K  107.4     1.8M     17K     0.3  0.0    89.2   17.1  37.5 100.0   2.9M

    3:2     SATA  54.2    1.8M    33K  113.4     1.9M     17K     0.3  0.0    68.4   14.7  60.7 100.0   3.0M

    3:3     SATA  56.0    2.1M    38K  112.2     2.0M     17K     0.3  0.0    65.6   14.4  40.7 100.0   3.3M

    3:4     SATA  73.8    2.3M    32K  113.6     2.0M     17K     0.3  0.0   114.3   13.9  54.5 100.0   2.3M

    3:5     SATA  66.8    2.1M    32K  106.8     1.9M     18K     0.3  0.0    74.0   11.2  50.5 100.0   3.5M

    3:6     SATA  78.4    2.7M    34K  138.2     2.2M     16K     0.3  0.0    75.8   11.1  82.1 100.0   3.4M

    3:7     SATA  58.4    2.2M    38K  127.8     2.1M     16K     0.3  0.0    77.1   11.0  54.7 100.0   3.4M

    3:8     SATA  54.6    2.0M    37K   90.4     1.4M     16K     0.3  0.0    75.1   10.7  39.9 100.0   3.0M

    3:9     SATA  56.2    2.0M    36K  139.4     2.5M     18K     0.3  0.0    59.9   10.4  61.5 100.0   3.3M

    3:10    SATA  59.0    1.9M    33K  110.2     1.8M     16K     0.3  0.0    55.2   10.2  49.3 100.0   3.3M

    3:11    SATA  55.0    2.0M    37K  122.2     1.9M     16K     0.3  0.0    59.4    9.3  46.1 100.0   2.5M

    3:12    SATA  51.4    1.8M    35K  102.0     2.1M     20K     0.3  0.0    50.3    9.1  47.7 100.0   3.3M

    4:1     SATA  52.2    1.8M    34K  117.2     2.1M     18K     0.3  0.0    53.5    8.8  51.7 100.0   2.8M

    4:2     SATA  58.8    2.1M    35K  107.2     2.0M     18K     0.3  0.0    47.8    8.7  48.9 100.0   3.3M

    4:3     SATA  64.8    2.3M    35K  120.6     2.2M     18K     0.3  0.0    44.2    8.6  57.1 100.0   3.4M

    4:4     SATA  50.8    1.8M    35K   77.8     1.7M     22K     0.3  0.0    53.8    8.6  38.1 100.0   2.7M

    4:5     SATA  58.4    2.2M    38K  135.6     2.4M     18K     0.3  0.0    51.8    8.4  48.9 100.0   3.4M

    4:6     SATA  65.0    2.4M    37K  108.8     2.1M     19K     0.3  0.0    55.9    8.3  55.3 100.0   3.3M

    4:7     SATA  57.0    2.1M    37K  106.8     2.2M     21K     0.3  0.0    46.9    8.2  49.5 100.0   3.3M

    4:8     SATA  58.8    2.0M    34K  149.0     2.7M     18K     0.3  0.0    46.2    8.2  58.7 100.0   3.3M

    4:9     SATA  53.2    1.8M    33K  124.0     2.3M     19K     0.3  0.0    45.0    8.2  56.3 100.0   3.4M

    4:10    SATA  76.0    2.3M    30K  103.8     1.9M     18K     0.3  0.0    44.8    8.0  60.1 100.0   3.3M

    4:11    SATA  60.2    2.1M    35K  116.0     1.9M     17K     0.3  0.0    42.6    7.9  65.5 100.0   3.1M

    4:12    SATA  59.0    2.1M    35K  100.2     1.8M     18K     0.3  0.0    48.2    7.8  41.3 100.0   2.4M

The above output shows the source of our problem, these poor sata disks are doing an average of 170 Ops (In and Out) and are struggling to keep up.  The SMB2 Reads that are latent are a victim of the disks which in this case is spindle bound due to other contention.

Outside of tracking down a SMB Performance issue due to disk, a couple other useful counters to look at are:

Is Directory enumeration bad:

isi statistics protocol  --nodes=all --protocols=smb1,smb2 --orderby=Out --classes=namespace_read --interval 5 --repeat 12 --degraded

isi-ess-east-1#   isi statistics protocol  --nodes=all --protocols=smb1,smb2 --orderby=Out --classes=namespace_read --interval 5 --repeat 12 --degraded

   Ops    In   Out TimeAvg TimeStdDev Node Proto          Class               Op

   N/s   B/s   B/s      us         us                                          

  13.0  1.5K   44K   510.1     1331.5    1  smb2 namespace_read  query_directory

227.9   25K   35K   226.6      791.0    3  smb2 namespace_read  query_directory

  60.1  6.9K   31K   400.1     3127.8    4  smb2 namespace_read  query_directory

   5.2 720.3  5.6K   822.5      293.8    1  smb1 namespace_read trans2:findfirst

   2.2 305.8  4.7K  6452.5    19478.9    3  smb1 namespace_read trans2:findfirst

  20.5  2.3K  4.3K  1158.9     6969.6    1  smb2 namespace_read  query_directory

   0.2  29.4  3.0K  1293.0        0.0    3  smb1 namespace_read  trans2:findnext

When looking at this stat focus on the FindFist/FindFirstNext. You are looking for a small amount of Ops that cause a large amount of Out B/s.  The Time Avg might look normal but what we are really looking for is when a client does a * enumeration of a very large directory.  The request itself will be 1 op but will amount to a tremendous amount of bytes that will need to be returned.

Is one of the Authentication Providers causing a delay:

isi statistics protocol --nodes=all --protocols=lsass_in,lsass_out --total --interval 5 --repeat 12 --degraded

isi-ess-east-1# isi statistics protocol --nodes=all --protocols=lsass_in,lsass_out --total --interval 5 --repeat 12 --degraded

Ops  In Out TimeAvg TimeStdDev Node    Proto         Class                         Op

N/s B/s B/s      us         us                                                      

0.4 0.0 0.0  8977.0     5208.5    4 lsass_in session_state lsa:id:ioctl:pac_to_ntoken

0.4 0.0 0.0   383.0        2.8    4 lsass_in session_state      ntlm:accept_sec_ctxt1

0.4 0.0 0.0 10256.5       27.6    4 lsass_in session_state      ntlm:accept_sec_ctxt2

0.7 0.0 0.0   576.2      390.3    4 lsass_in session_state         ntlm:acquire_creds

0.4 0.0 0.0   136.0      144.2    4 lsass_in session_state       ntlm:delete_sec_ctxt

0.7 0.0 0.0    37.0        6.8    4 lsass_in session_state            ntlm:free_creds

1.8 0.0 0.0    48.2       24.6    4 lsass_in session_state            ntlm:query_ctxt

lsa:id:ioctl:pac_to_ntoken - Represents how long it takes a DC to complete a Sid2Name lookup

ntlm:accept_sec_ctxt2 - Represnts how long it took a DC to complete NTLM authentication

If either of the above show some Ops where the Time Avg is high, its time to start looking at the DC as causing the delay.

Good luck Performance Troubleshooting and remember, its very rare that the protocol itself (ie SMB) is the one causing the latency

467 Posts

July 22nd, 2013 10:00

This is great information!  Is there a way to monitor the statistics so we can trend and alert on them if they get to an unacceptable limit?

isilon# isi statistics protocol --nodes=5 --protocols=smb1,smb2 --orderby=Class --interval 5 --repeat 12 --degraded

.....

2.0 196.6 134.5 32535068.0 74775888.0    5  smb2      file_state    change_notify

That looks awfully high to me, so in theory i'd like to know about that via some alerting system... If it's a problem at least...

17 Posts

July 22nd, 2013 10:00

This is great information!  Is there a way to monitor the statistics so we can trend and alert on them if they get to an unacceptable limit?

You can use InsightIQ for this:

https://support.emc.com/docu47071_InsightIQ-Installation-and-Setup-Guide-Version-2.5.1.pdf?language=en_US

2.0 196.6 134.5 32535068.0 74775888.0    5  smb2      file_state    change_notify

Ah yes, change notify.  When a client opens an explorer window against a share, the client sets a change notification request so that it can refresh when something has changed.  It may take 1 second for some other client to trigger a change, it make take 2 days if nothing changes in the directory while explorer is open.  Thus, this counter from a latency perspective is useless and will always be abnormal and because it is included in our overall latency counter, can skew results.

467 Posts

July 22nd, 2013 10:00

Peter Abromitis wrote:

This is great information!  Is there a way to monitor the statistics so we can trend and alert on them if they get to an unacceptable limit?

You can use InsightIQ for this:

https://support.emc.com/docu47071_InsightIQ-Installation-and-Setup-Guide-Version-2.5.1.pdf?language=en_US

I didn't think Insight IQ could do alerting? I know it's pretty good at reporting on the performance of the cluster,  but i'd like to know about it before the client(s) call me saying "Isilon is slow!"

17 Posts

July 22nd, 2013 12:00

Right, InsightIQ can be used for monitoring but not alerting; sorry about the confusion.

15 Posts

July 22nd, 2013 22:00

The LDAP as authentication source in back end and PDC running on linux box (samba), can we create the authentication source as PDC in Isilon to access SMB shares in Mavericks ??

17 Posts

July 23rd, 2013 07:00

Hello VP1, here is what you need to setup this configuration:

https://support.emc.com/kb/89368

467 Posts

July 23rd, 2013 12:00

One issue i've ran into with isi_netlogger (this may be caused by tcpdump havn't investigated) is that it dies after very long captures... (hours)

We have an issue which happens very sporatically,  and by the time we identify it the problem is gone and not reproducable... We were trying to run a tcpdump (via isi_netlogger) for very long duration (overnight) and when we'd come in the next working the isi_netlogger command errored and no archive was created...

What is your thoughts on the best way to packet capture an event when you don't know when it will happen?

17 Posts

July 23rd, 2013 13:00

Yeah, that would make sense as to why you are having problems with isi_netlogger.  The beauty of OneFS running on FreeBSD is you can script just about anything.  If your issue causes a message to be logged, you can write a script that:

1.) Starts a trace

2.) Check for a failure to be logged

3.) If after 5 minutes no failure has been seen, stop the trace and start the process over again

I can send you an example script if you would like.

467 Posts

July 23rd, 2013 20:00

No errors are logged on Isilon... It is specific to an application,  when we monitor and alert on... but by the time the error is generated inside the application it's too late.. The app writes a temporary file to an smb share on Isilon.  Then it tries to access the temporary file for some processing and gets a generic error saying "file not found" inside the application.  I'm trying to capture what it's trying to access where the file not found is generated,  but no luck..

I guess i'll set up collection every 20 minutes,  then kill it, and repeat... and hope!

4 Posts

July 25th, 2013 11:00

We have been having an issue with SMB connections going stale. Currently we have under 300 active connections, but over 5100 total connections. Is there a way to drop the inactive connections without affecting the active connections? What would be causing this in the first place? We are on 7.0.1.5

1 Rookie

 • 

20.4K Posts

July 25th, 2013 13:00

Any way to improve OSX users experience when using Isilon via CIFS. Browsing shares is much slower than on Windows. I did read this paper "docu45329_Using-Mac-OS-X-Clients-with-Isilon-OneFS-6.5" but we don't have SSD nodes and changing view in Finder did not do anything.

Thanks

No Events found!

Top