Great question, unfortunately the answer is, it depends. I am not familiar with your direct issue so I am going to speak to what we generally see. The lsassd service handles authentication requests and when things are working normally it should be in an Online state. OneFS has a concept that if there is a problem with our domain connectivity, lsassd will then go into an Offline state. When in an Offline state, the client may or may not be able to authenticate depending on the the authentication mechanism they use. The lsassd service will stay in an Offline state for 5 Minutes at which point it will perform a new Domain Controller discovery and select a new DC. The 5 minutes is tunable:
isi auth config modify --check-online-interval=
isi auth ads modify --check-online-interval=
How is authentication impacted when lsassd goes Offline?
-- If a user connects to a cluster and the client chooses to use NTLM for authentication, it will fail because in an Offline state we do not have a connection to a Domain Controller.
-- If a user connects to a cluster and it uses Kerberos:
-- If the user connected earlier and we already have the SID from the user token resolved to a username in our SID Cache, it will work.
-- If the user connects and we do not have the SID in our SID Cache, it will fail as we will be unable to complete a SID2Name lookup to the domain controller.
Are existing user connections impacted when lsassd goes Offline?
-- No, the existing user connections will continue as normal. The only time they will experience an issue is if the client does something to trigger a new authentication request. Even in that scenario it is highly likely that the new authentication request will work as it is likely using Kerberos and our Sid Cache is populated.
Why does lsassd go Offline?
Our lsassd process goes offline when it detects problems with connectivity to a domain controller. Depending on what type of failure it detects determines whether lsassd will go Offline or trigger a failover to another DC. This process is documented in the following KB:
Why was my answer "It depends?"
Lsassd can go Offline because of an external event (a DC reset our TCP Connection) or an internal event (a bug with Lsassd). If it happens to be an external event, the resolution will need to come from the DC side. From the sounds of it, since support has declared your issue fixed in a newer release, they are indicating it is a bug so an upgrade would be relevant. If the problem continues after the upgrade, it may have been an external event all along or it may be a new defect. Either way, if you are on a fixed version, the best thing to do is contact support and collect the necessary data for root cause.
What data should I collect so support can resolve the issue?
I am glad you asked I have a very good step by step action plan that you can collect in order for us to resolve the issue.
1.) Make the following directory:
2.) Start the packet traces (You will have to modify this command for the specific interfaces in your cluster (ie lagg0 may be em0) and you will also need to put your DC IPs in
isi_for_array 'tcpdump -s 0 -i lagg0 -w /ifs/data/Isilon_Support/DomainOfflineIssue/`hostname`.$(date +%m%d%Y_%H%M%S).lagg0.pcap -- host <ip of dc1 in cluster site> or host <ip of dc2 in cluster site> &'
isi_for_array 'tcpdump -s 0 -i lagg1 -w /ifs/data/Isilon_Support/DomainOfflineIssue/`hostname`.$(date +%m%d%Y_%H%M%S).lagg1.pcap -- host <ip of dc1 in cluster site> or host <ip of dc2 in cluster site> &'
3.) Turn on lsassd debug logging
isi_for_array -s 'isi auth log-level --set=debug'
4.) Wait for the domain to report offline
5.) After domain offline occurs run the following to stop the traces
isi_for_array -s 'pkill -9 tcpdump'
6.) Turn off lsassd debug logging
isi_for_array -s 'isi auth log-level --set=error'
7.) Copy lsassd logs to case directory
isi_for_array -s 'ls /var/log/lsassd.log | cut -d / -f 4 | while read foo; do bar=$(cp "/var/log/$foo" /ifs/data/Isilon_Support/DomainOfflineIssue/`hostname`.$foo);done'
8.) Upload all the data
isi_gather_info -n 1 --nologs -s "isi_hw_status -i" -f /ifs/data/Isilon_Support/DomainOfflineIssue
9.) Perform a full log gather