PowerScale: OneFS: Clients are unable to authenticate or connect to nodes when the AD domain shows "Offline"
Summary: An offline Active Directory Provider impacts any client that uses it for authentication, regardless of protocol. Troubleshooting steps to resolve authentication to cluster nodes when AD shows offline are below. ...
Symptoms
Clients are unable to authenticate to some or all nodes in the cluster, causing intermittent or total Data Unavailability (DU). Access over any protocol is impacted if the user is reliant on an offline AD provider, though the most common one we see is SMB. There are many circumstances that may potentially lead to this scenario, either cluster-wide or on a node-specific basis. SMB clients cannot connect to nodes in the cluster when the Local Security Authority Subsystem Service (LSASS) loses its connection with the domain controller.
When the LSASS loses its connection with the domain controller, errors similar to the following appear in the /var/log/lsassd.log file:
2012-06-11T12:58:42-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online
2012-06-11T13:03:57-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now offline
2012-06-12T21:05:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online
2012-06-13T16:35:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now offline
2012-06-13T16:40:03-07:00 <30.6> cluster1-13(id13) lsassd[66251]: 0x28f016a0:Domain 'domain.com' is now online
Alternatively, if the status of Active Directory is reviewed with isi auth status, similar output may be presented:
testcluster-1# isi auth status
ID Active Server Status
-----------------------------------------------------------------------
lsa-activedirectory-provider:TESTDOMAIN.COM dc1.testdomain.com offline
lsa-local-provider:System - active
lsa-file-provider:System - active
-----------------------------------------------------------------------
Total: 3
Different cluster added to the same AD domain using the same machine account as the existing cluster:
If a new cluster was recently joined to the same domain, verify that the new cluster is not using the same machine account as the original cluster. Run the following command on the new cluster (and showing domain online) to verify the hostname and machine account:
# isi auth ads view <domain>
(Relevant output)
Hostname: isitest.example.lab.com
<snipped>
Machine Account: ISITEST$
Machine account password/account deleted from AD:
If the cluster was joined to Active Directory but now it does not show anything in isi auth status (nothing showed for lsa-activedirectory), check to see if the machine account was deleted on the active directory side. The cluster can be rejoined to the domain to create a new machine account in active directory and restore authentication.
The DNS is refusing the SRV lookup. If this is happening, verify that DNS is configured to accept queries from the relevant node IPs.
dig @<DNS IP> SRV _ldap._tcp.dc._msdcs.domain.com
; <<>> DiG 9.10.0-P2 <<>> @<DNS IP> SRV _ldap._tcp.dc._msdcs.<domain>
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 52396
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_ldap._tcp.dc._msdcs.<domain>. IN SRV
;; Query time: 0 msec
;; SERVER: <IP>#53(<IP>)
;; WHEN: <date>
;; MSG SIZE rcvd: 59
Cause
The Local Security Authority Subsystem Service (LSASS) can lose its connection to the connected domain controller because of hardware issues, connectivity issues, or from DNS cache poisoning
. Alternatively, this can happen because a node was added to the cluster and the node did not have network connectivity. If the domain controller is unavailable for any reason, when a node is rebooted or LSASS is restarted, some of the domain information may remain unpopulated. Usually, the Domain GUID for the primary domain remains unpopulated. This results in authentication not being able to be complete for users connecting to that node.
Resolution
Workaround 1 - Investigate hardware connectivity issues:
- Look for hardware connectivity issues in the /var/log/messages file. Messages in the log file indicate if the node network port is in an "Up" state or not. Look at the /var/log/messages file on other nodes in the cluster to determine if the network connectivity issue occurs cluster-wide.
- Look at the system and application event logs in the domain controller. These log files might include errors about driver or hardware issues that relate to the loss of network connectivity.
Workaround 2 - Investigate DNS cache poisoning:
If the network connectivity issue is not hardware-related, you should inspect the service (SRV) records for the _mscds DNS zone for the Active Directory domains. A packet trace of the DNS request from the cluster to the DNS server shows incorrect or missing information. If incorrect information is registered in the SRV records, or if the domain controllers do not have all the records in the _mscds DNS zone, the nodes in the cluster report that the domain is offline when they attempt to contact the domain controller. Updating the SRV records with current information or changing to a different DNS server should resolve DNS cache poisoning.
Example 1
This packet trace shows the list of DNS servers and SRV records that were returned to the nodes in the cluster. SRV record information was not available for dc2.domain.com.
No. Time Source Destination Protocol Length Info
5 16:40:19.061003 1.1.1.1 1.1.1.2 DNS 110 Standard query SRV _ldap._tcp.dc._msdcs.domain.com
6 16:40:19.062626 1.1.1.2 1.1.1.1 DNS 1270 Standard query response SRV 0 100 389 dc1.domain.com SRV 0 100 389 dc2.domain.com SRV 0 100 389 SRV 0 100 389 dc3.domain.com
7 16:40:19.063146 1.1.1.1 1.1.1.2 DNS 87 Standard query A dc2.domain.com
8 16:40:20.797403 1.1.1.2 1.1.1.1 DNS 146 Standard query response, No such name
Example 2
In this packet trace, a node looks up the SRV record for _ldap._tcp.dc._msdcs.domain.com, but no information is returned to the client.
No. Time Source Destination Protocol Length Info
15 16:40:21.458636 1.1.1.1 1.1.1.2 DNS 100 Standard query SRV _ldap._tcp.dc._msdcs.domain.com
16 16:40:21.783630 1.1.1.2 1.1.1.1 DNS 100 Standard query response, No such name
In this situation, please work with your networking and Active Directory team to ensure that the DNS SRV records are accurate and resolve to the domain controller.
Workaround 3 - Refresh LSASS:
- Open an SSH connection to the node and log in using the "root" account.
- Verify that the authentication daemon is not connected to AD, where <nodeID> is the node number of the recently added node:
isi_for_array -n <nodeID> 'isi auth ads list'
If the node is joined to the domain, output similar to the following appears:
cluster-1: Name Authentication Status DC Name Site
cluster-1: --------------------------------------------------------------------
cluster-1: LAB.EXAMPLE.COM Yes online - Default-First-Site-Name
cluster-1: --------------------------------------------------------------------
cluster-1: Total: 1
- Verify that the Domain GUID is not populated. If lsass does not retrieve its configuration correctly, it will not populate a Domain GUID value:
isi_for_array -n <nodeID> /usr/likewise/bin/lw-lsa get-status | egrep -A 12 "Domain:" | egrep "Domain (SID|GUID)"
Output similar to the following appears:
cluster-1: Domain SID: S-1-5-21-584721463-3180705917-972194821
cluster-1: Domain SID: S-1-5-21-584721463-3180705917-972194821
cluster-1: Domain GUID:
- Run the following command on the newly added node:
isi_for_array -n < node_range> /usr/likewise/bin/lwsm refresh lsass
- Verify that the new node is reporting that it is connected to an AD provider.
isi_for_array -n < node_range> 'isi auth ads list -v'
- Ensure that the GUID value appears:
isi_for_array -n <node_range> /usr/likewise/bin/lw-lsa get-status | egrep -A 12 "Domain:" | egrep "Domain (SID|GUID)"
Output similar to the following appears:
cluster-1: Domain SID: S-1-5-21-584721463-3180705917-972194821
cluster-1: Domain SID: S-1-5-21-584721463-3180705917-972194821
cluster-1: Domain GUID: 61b2a8c6-af25-1941-8d57-59073b7ceb19
- On the windows client, verify that the user can authenticate to the cluster by mapping a drive, specifying the IP address of the newly added node.
Workaround 4 - Restart LSASS:
1. Open an SSH connection to the recently added node and log in using the "root" account.
2. List available domain controllers, where <domain_name> is the fully qualified domain name (FQDN) of the domain the cluster is joined:isi auth ads trusts controllers list --provider=<domain_name> -v
3. Forcibly connect to a domain controller, where <domain_name> is the FQDN of the domain the cluster is joined, and <dc_name> is the FQDN of the domain controller:isi auth ads modify <domain_name> --domain-controller <dc_name> --v
4. Refresh AD status:isi_classic auth ads status --refresh --all
The Status should change to online as shown below:
Active Directory Services Status:
Mode: unprovisioned
Status: online
Primary Domain: LAB.EXAMPLE.COM
NetBios Domain: LAB
Domain Controller: dc1.lab.example.com
Hostname: cluster.lab.example.com
Machine Account: CLUSTER$
5. If the status still shows as offline, restart the authentication daemon (Note that this will interrupt authentication to the node for up to a minute.): Single node:
pkill -f 'lw-container lsass'
Multiple nodes (nodes 1-3 here as an example):
isi_for_array -n1-3 'pkill -f "lw-container\ lsass"'
6. Repeat Step 4.
7. On the Windows client, please verify that the user can authenticate to the cluster by mapping a drive, specifying the IP address of the node on which LSASS was restarted.
Additional Information
Troubleshooting note for domain connectivity issues:
If you leave and rejoin the domain, confirm that the Active Directory provider shows up in the authentication providers list for the relevant zone after you rejoin.
For this example.com domain, it must be re-added since it was removed from the Auth Providers section:
isi zone zones list -v:
Name: accesszonedev1
Path: /ifs/accesszone1
Groupnet: groupnet0
Map Untrusted: -
Auth Providers: lsa-ldap-provider:Primary, lsa-file-provider:System, lsa-local-provider:accesszone1 **No Active Directory Provider** <<<<<<<<<<<<<
NetBIOS Name: -
User Mapping Rules:
Home Directory Umask: 0077
Skeleton Directory: /usr/share/skel
Cache Entry Expiry: 4H
Negative Cache Entry Expiry: 1m
Zone ID: 2
The WebUI may be the easiest way to add it back and make sure it is searched in the order wanted. Please see the applicable administration guide based on your version of OneFS. PowerScale OneFS Info Hubs