{ISILON} OneFS: Intermittent slow SMB authentication or share enumeration performance; isi_cbind_d DNS delays

Summary: intermittent delayed or unresponsive (timeout) SMB authentication and/or share enumeration

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Users may experience intermittent latency or timeouts when attempting to access shares stored on an Isilon cluster; access is restored with no intervention within seconds/minutes.

Authentication and/or share enumeration may take multiples of 5 seconds (e.g. 25 seconds, for example) to complete, but on periodic intervals (~15 minutes per node precisely). One node may experience the issue, while others do not. Also, when using SmartConnect round-robin, the issue may appear more frequent as each node experiences this issue independently of each other.

NOTE: This KB covers one possible cause for slow SMB authentication and share enumeration; other causes may exist. The recommendation for resolution here is a means of ruling out one probable cause

The following may appear in the isi_cbind_d logs around the time of the issues:
isi_cbind_d[76119]: [0x800703400]bind: CBIND_send_query(1161) Warning: Stallset dns has no available stalls

Which can be found by running:
# isi_for_array 'zegrep Stallset /var/log/isi_cbind_d.log*'

Cause

By default, every 15 minutes we will expire our AD LDAP DC connection proactively. The DC connection expiration happens at the time the connection is used (i.e. during authentication), which will start the process of:
  1. Lock the AD DC connection Mutex (this blocks all requests until a new DC is selected)
  2. Expire the existing connection
  3. Enumerate a list of DC's to connect to (DNS SRV record; _ldap._tcp.dc._msdcs.domain.com)
  4. Resolve DNS names to IP address from the list of DC's (A record DNS lookups)
  5. Send CLDAP ping to all DC's; wait for fastest responders (stop sending/receiving after 10 ms from first response)
  6. Intelligently select from the responding DC's (semi-random using historical DC statistics)
It is during the DNS A record lookup phase (4, above) where delays may be introduced. By default on an Isilon cluster running OneFS 8.x, the groupnet DNS cache is enabled (isi_cbind_d). When isi_cbind_d (DNS cache daemon) is unable to service DNS lookups (both fails to respond to the request and fails to reach out to an external DC), the kernel DNS resolver will failover to the next available DNS server via /etc/resolv.conf (or more specifically, the DNS servers configured for the groupnet in question), which incurs a 5 second timeout/delay for each A record query.

The total delay attributed to DNS lookups is equal to 5 times the number of A records we need to resolve. If the record count exceeds 12 records, this will trigger a 60 second timeout/reset from the client, which errors will be seen on the client relating to inability to reach the cluster/share.

NOTE: The most common culprit in this situation is attributed to isi_cbind_d DNS lookups, but can also be the result of other factors as well

Resolution

Assuming the following:
  1. The issue self-remediates with no intervention from admins
  2. Delays hit 5 second increments of time
  3. Frequency on a per-node basis is ~15 minutes
The quickest path to resolution and validation is to disable DNS caching for the groupnet in use:
# isi network groupnets modify <groupnet> --dns-cache-enabled=false
 

Alternatively (if you would prefer to keep the DNS cache enabled), a mitigation would be to restart the DNS caching service on all nodes:
# isi_for_array 'killall -9 isi_cbind_d'

And verify it has been restarted on all nodes (based on timestamp when the process started):
# isi_for_array 'ps auxwp `pgrep isi_cbind_d`'
NOTE: This may need to be repeated if the issue re-occurs in the future, at which time manual intervention will be required to proactively restart or resort to disabling the cache until an upgrade can be administered (details below)

If the above does not resolve your issue, there may be some other factors in play, which would require DellEMC Isilon Support assistance.

OneFS versions 8.0.0.6/8.0.1.3/8.1.0.2/8.1.1.1 contain the fix (ID 205142) for isi_cbind_d that is a probable cause for the DNS failures described in this KB and, once available, Isilon recommends upgrading to that release at which point DNS cache can be re-enabled.

If the issue still persists after upgrading to the above noted versions, or the DNS caching is set to false on the groupnet(s), additional details and data collection may be necessary to determine the exact cause.

If there are any questions regarding this issue and related paths to resolution, or if assistance is required, contact Isilon Support.

Affected Products

PowerScale OneFS
Article Properties
Article Number: 000170774
Article Type: Solution
Last Modified: 08 Jul 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.