PowerScale: OneFS: Clients are unable to authenticate or connect to nodes when the AD domain shows "Offline"

Summary: An offline Active Directory Provider impacts any client that uses it for authentication, regardless of protocol. Troubleshooting steps to resolve authentication to cluster nodes when AD shows offline are below. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms



Clients are unable to authenticate to some or all nodes in the cluster, causing intermittent or total Data Unavailability (DU). Access over any protocol is impacted if the user is reliant on an offline AD provider, though the most common one we see is SMB. There are many circumstances that may potentially lead to this scenario, either cluster-wide or on a node-specific basis. SMB clients cannot connect to nodes in the cluster when the Local Security Authority Subsystem Service (LSASS)This hyperlink is taking you to a website outside of Dell Technologies. loses its connection with the domain controller.

When the LSASS loses its connection with the domain controller, errors similar to the following appear in the /var/log/lsassd.log file:

2012-06-11T12:58:42-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online
2012-06-11T13:03:57-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now offline
2012-06-12T21:05:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online
2012-06-13T16:35:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now offline
2012-06-13T16:40:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online

 

Alternatively, if the status of Active Directory is reviewed with isi auth status, similar output may be presented:

testcluster-1# isi auth status
ID                                           Active Server      Status
-----------------------------------------------------------------------
lsa-activedirectory-provider:TESTDOMAIN.COM dc1.testdomain.com offline
lsa-local-provider:System                    -                  active
lsa-file-provider:System                     -                  active
-----------------------------------------------------------------------
Total: 3
 
 
 
Below are some examples of common situations that lead to an offline AD provider, with steps on how to identify and resolve.
 

Different cluster added to the same AD domain using the same machine account as the existing cluster:
If a new cluster was recently joined to the same domain, verify that the new cluster is not using the same machine account as the original cluster. Run the following command on the new cluster (and showing domain online) to verify the hostname and machine account:
 
# isi auth ads view <domain>

(Relevant output)
Hostname: isitest.example.lab.com
<snipped>
Machine Account: ISITEST$

 
If the new cluster is using the machine account of the original cluster (both using ISITEST and has the same Hostname), leave the domain from the new cluster and rejoin the domain from the old cluster. Then rejoin the new cluster, assuring that the same machine account name is not specified. If there are issues, contact Support.
 

Machine account password/account deleted from AD:
If the cluster was joined to Active Directory but now it does not show anything in isi auth status (nothing showed for lsa-activedirectory), check to see if the machine account was deleted on the active directory side. The cluster can be rejoined to the domain to create a new machine account in active directory and restore authentication.
 

The DNS is refusing the SRV lookup. If this is happening, verify that DNS is configured to accept queries from the relevant node IPs.
dig @<DNS IP> SRV _ldap._tcp.dc._msdcs.domain.com

; <<>> DiG 9.10.0-P2 <<>> @<DNS IP> SRV _ldap._tcp.dc._msdcs.<domain>
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 52396
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_ldap._tcp.dc._msdcs.<domain>. IN SRV

;; Query time: 0 msec
;; SERVER: <IP>#53(<IP>)
;; WHEN: <date>
;; MSG SIZE rcvd: 59
 

 

Cause

The Local Security Authority Subsystem Service (LSASS)This hyperlink is taking you to a website outside of Dell Technologies. can lose its connection to the connected domain controller because of hardware issues, connectivity issues, or from DNS cache poisoningThis hyperlink is taking you to a website outside of Dell Technologies.. Alternatively, this can happen because a node was added to the cluster and the node did not have network connectivity. If the domain controller is unavailable for any reason, when a node is rebooted or LSASS is restarted, some of the domain information may remain unpopulated. Usually, the Domain GUID for the primary domain remains unpopulated. This results in authentication not being able to be complete for users connecting to that node.

Resolution

Workaround 1 - Investigate hardware connectivity issues:

  • Look for hardware connectivity issues in the /var/log/messages file. Messages in the log file indicate if the node network port is in an "Up" state or not. Look at the /var/log/messages file on other nodes in the cluster to determine if the network connectivity issue occurs cluster-wide.
  • Look at the system and application event logs in the domain controller. These log files might include errors about driver or hardware issues that relate to the loss of network connectivity.


Workaround 2 - Investigate DNS cache poisoning:

If the network connectivity issue is not hardware-related, you should inspect the service (SRV) records for the _mscds DNS zone for the Active Directory domains. A packet trace of the DNS request from the cluster to the DNS server shows incorrect or missing information. If incorrect information is registered in the SRV records, or if the domain controllers do not have all the records in the _mscds DNS zone, the nodes in the cluster report that the domain is offline when they attempt to contact the domain controller. Updating the SRV records with current information or changing to a different DNS server should resolve DNS cache poisoning.

Example 1

This packet trace shows the list of DNS servers and SRV records that were returned to the nodes in the cluster. SRV record information was not available for dc2.domain.com.

No. Time             Source   Destination  Protocol  Length  Info
5   16:40:19.061003  1.1.1.1  1.1.1.2      DNS       110     Standard query SRV _ldap._tcp.dc._msdcs.domain.com
6   16:40:19.062626  1.1.1.2  1.1.1.1      DNS       1270    Standard query response SRV 0 100 389 dc1.domain.com SRV 0 100 389 dc2.domain.com SRV 0 100 389 SRV 0 100 389 dc3.domain.com
7   16:40:19.063146  1.1.1.1  1.1.1.2      DNS       87      Standard query A dc2.domain.com
8   16:40:20.797403  1.1.1.2  1.1.1.1      DNS       146     Standard query response, No such name
Example 2

In this packet trace, a node looks up the SRV record for _ldap._tcp.dc._msdcs.domain.com, but no information is returned to the client.

No. Time             Source   Destination  Protocol  Length  Info
15  16:40:21.458636  1.1.1.1  1.1.1.2      DNS       100     Standard query SRV _ldap._tcp.dc._msdcs.domain.com
16  16:40:21.783630  1.1.1.2  1.1.1.1      DNS       100     Standard query response, No such name


In this situation, please work with your networking and Active Directory team to ensure that the DNS SRV records are accurate and resolve to the domain controller.

Workaround 3 - Refresh LSASS:

  1. Open an SSH connection to the node and log in using the "root" account.
  2.  Verify that the authentication daemon is not connected to AD, where <nodeID> is the node number of the recently added node:

    isi_for_array -n <nodeID> 'isi auth ads list'

    If the node is joined to the domain, output similar to the following appears:
     
    cluster-1: Name            Authentication Status DC Name Site
    cluster-1: --------------------------------------------------------------------
    cluster-1: LAB.EXAMPLE.COM Yes            online -       Default-First-Site-Name
    cluster-1: --------------------------------------------------------------------
    cluster-1: Total: 1

     
  3.  Verify that the Domain GUID is not populated. If lsass does not retrieve its configuration correctly, it will not populate a Domain GUID value:

    isi_for_array -n <nodeID> /usr/likewise/bin/lw-lsa get-status | egrep -A 12 "Domain:" | egrep "Domain (SID|GUID)"

    Output similar to the following appears:
     
    cluster-1:   Domain SID: S-1-5-21-584721463-3180705917-972194821
    cluster-1:   Domain SID: S-1-5-21-584721463-3180705917-972194821
    cluster-1:   Domain GUID:

     
  4. Run the following command on the newly added node:

    isi_for_array -n < node_range> /usr/likewise/bin/lwsm refresh lsass
     
  5. Verify that the new node is reporting that it is connected to an AD provider.

    isi_for_array -n < node_range> 'isi auth ads list -v'
     
  6.  Ensure that the GUID value appears:

    isi_for_array -n <node_range> /usr/likewise/bin/lw-lsa get-status | egrep -A 12 "Domain:" | egrep "Domain (SID|GUID)"

    Output similar to the following appears:
     
    cluster-1:   Domain SID: S-1-5-21-584721463-3180705917-972194821
    cluster-1:   Domain SID: S-1-5-21-584721463-3180705917-972194821
    cluster-1:   Domain GUID: 61b2a8c6-af25-1941-8d57-59073b7ceb19

     
  7. On the windows client, verify that the user can authenticate to the cluster by mapping a drive, specifying the IP address of the newly added node.

Workaround 4 - Restart LSASS:

1. Open an SSH connection to the recently added node and log in using the "root" account.
2. List available domain controllers, where <domain_name> is the fully qualified domain name (FQDN) of the domain the cluster is joined:

isi auth ads trusts controllers list --provider=<domain_name> -v

3. Forcibly connect to a domain controller, where <domain_name> is the FQDN of the domain the cluster is joined, and <dc_name> is the FQDN of the domain controller:

isi auth ads modify <domain_name> --domain-controller <dc_name> --v

4. Refresh AD status:

isi_classic auth ads status --refresh --all

The Status should change to online as shown below:
 

Active Directory Services Status:
Mode:                unprovisioned
Status:              online
Primary Domain:      LAB.EXAMPLE.COM
NetBios Domain:      LAB
Domain Controller:   dc1.lab.example.com
Hostname:            cluster.lab.example.com
Machine Account:     CLUSTER$

 

5. If the status still shows as offline, restart the authentication daemon (Note that this will interrupt authentication to the node for up to a minute.): 

Single node:
pkill -f 'lw-container lsass'
Multiple nodes (nodes 1-3 here as an example):
isi_for_array -n1-3 'pkill -f "lw-container\ lsass"'


6. Repeat Step 4.
7. On the Windows client, please verify that the user can authenticate to the cluster by mapping a drive, specifying the IP address of the node on which LSASS was restarted.

Additional Information

Troubleshooting note for domain connectivity issues:
If you leave and rejoin the domain, confirm that the Active Directory provider shows up in the authentication providers list for the relevant zone after you rejoin.

For this example.com domain, it must be re-added since it was removed from the Auth Providers section:

isi zone zones list -v:

                       Name: accesszonedev1
                       Path: /ifs/accesszone1
                   Groupnet: groupnet0
              Map Untrusted: -
             Auth Providers:  lsa-ldap-provider:Primary, lsa-file-provider:System, lsa-local-provider:accesszone1 **No Active Directory Provider** <<<<<<<<<<<<<
               NetBIOS Name: -
         User Mapping Rules:
       Home Directory Umask: 0077
         Skeleton Directory: /usr/share/skel
         Cache Entry Expiry: 4H
Negative Cache Entry Expiry: 1m
                    Zone ID: 2


The WebUI may be the easiest way to add it back and make sure it is searched in the order wanted. Please see the applicable administration guide based on your version of OneFS. PowerScale OneFS Info Hubs

Affected Products

Isilon, PowerScale OneFS
Article Properties
Article Number: 000055836
Article Type: Solution
Last Modified: 16 Oct 2024
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.