VNX: DOMAIN_CONTROLLER_NOT_FOUND after upgrading code that supports SMB2 secure channel communications
Summary: DOMAIN_CONTROLLER_NOT_FOUND messages after upgrading code that supports SMB2 secure channel communications with DCs. (Dell Correctable)
Symptoms
Code was upgraded to one of the versions listed above.
After upgrading to one of the following codes, error messages started appearing in the logs indicating the domain controller was down:
VNX2:
8.1.9.211
VNX1:
7.1.82.0
The message looks like this:
2017-06-20 20:51:27: SMB: 3:[NASSERVER1] OpenAndBind[LSA] DC=DC01 failed: Bind_OpenXFailed DOMAIN_CONTROLLER_NOT_FOUND
Cause
The aforementioned codes are all added in a new feature that allows the NAS server or data mover to communicate with the domain controllers in SMB2. Prior to this code domain controller communications were all handled in SMB1 (Though clients could still talk to us in SMB2/SMB3).
With the change to SMB2, our commands are not all serialized to the domain controllers. This appears to be leading to some commands running simultaneously in parallel.
An example is the attempt to open the named lsarpc named pipe.
In this error message, it is important to note the service we are trying to bind to:
2017-06-20 20:51:27: SMB: 3:[NASSERVER1] OpenAndBind[LSA] DC=DC01 failed: Bind_OpenXFailed DOMAIN_CONTROLLER_NOT_FOUND
From the error message, we can see it is trying to open LSA (highlighted in red.) This is where the problem comes in. We attempt to open the lsarpc named pipe multiple times simultaneously before receiving a response from the DC. The first request is successful, but the subsequent ones fail. We see the failure messages indicating STATUS_PIPE_NOT_AVAILABLE and log the DOMAIN_CONTROLLER_NOT_FOUND message in the logs.
It is important to note that these messages do not always indicate this problem. DOMAIN_CONTROLLER_NOT_FOUND errors can have many causes.
This particular one is likely to have a lot of the following informational messages in the logs:
2017-06-28 14:37:45: 26041909248: SMB: 6:[NASSERVER1] sendLookupSIDs pipe lsarpc reopened 2017-06-28 14:37:47: 26041909248: SMB: 6:[NASSERVER1] sendLookupSIDs pipe lsarpc reopened
If there are any questions on whether or not the problem matches the problem, it can be confirmed in a packet trace. In the trace, we would see multiple simultaneous requests to open lsarpc prior to receiving a response to any of them, followed by the first one succeeding and subsequent ones failing with STATUS_PIPE_NOT_AVAILABLE when the DC responds.
This issue tends to be mainly to occur on systems that require a lot of SID lookups on the domain controller. If there are orphaned SIDs in the environment, it tends to log a lot more of these errors. This is due to the amount of traffic that is being sent to the DC, each time an ACL is accessed we have to send a request to the DC to ask for the identity of any SIDs we do not have in our SID cache. Orphaned SIDs are never in the SID cache and are attempted to be looked up every time increasing the amount of opens we have to do to the lsarpc named pipe.
As the first open attempt succeeds, this is a non-impactful event and these messages can be ignored.
Resolution
Permanent fix:
Engineering is aware of the problem and is working for a fix in a future release of code. This is a non-disruptive problem and can be safely ignored in the meantime. However if you do want to try to and reduce or eliminate the message there are some workarounds available.
Workaround 1:
There are a couple of ways to try to and stop these messages from occurring in the logs. As multiple concurrent open attempts on the lsarpc pipe cause the problem, the easiest way to reduce the messages is to reduce the amount of SID lookups needed.
Orphaned SIDs cause these excessive lookups. The following parameter can be modified to force lookups to look at secmap cache for unknown sids and should reduce the amount of traffic going to the DC:
[nasadmin@CS0 ~]$ server_param server_2 -f cifs -i acl.mappingErrorAction -v server_2 : name = acl.mappingErrorAction facility_name = cifs default_value = 8 current_value = 8 configured_value = 8 user_action = none change_effective = immediate range = (0,31) description = Define rules for an unknown SID/UID/GID mapping detailed_description Defines the rules for unknown mapping between SID/UID/GID on ACL settings. Two kinds of errors might occur: the SID set in the ACL is unknown to the Domain Controllers we are using, or the username is not yet mapped to a UID/GID. 0x01: Stores unknown sid. 0x02: Stores sid with no UNIX mapping. 0x04: Enables debug traces. 0x08: Only do lookup in cache (secmap or globalSid cache or per connection SID cache) 0x08 is HIGHLY RECOMMENDED WITH OPTION=0x01. 0x10: Disable log displayed when an unknown SID resolution takes too much time.Maximum value = 0x1F Refer to param cifs.acl.retryAuthSID
This translates to the following info:
Bit0 = Stores unknown sid.
Bit1 = Stores sid with no UNIX mapping.
Bit2 = Enables debug traces.
Bit3 = Only do lookup in cache (secmap or globalSid cache or per connection SID cache)
If bits 0 and 1 are set (0x3 as a value), then orphaned SIDs are allowed to be stored on file system ACLs (by default they are not.). It would be suggested to change the value from 0x3 to 0x11, this turns on bit 0,1 and 3. Meaning store unknown SIDs with no unix mapping and only look in secmap and global SID caches. If it is set to 0x8 or another combination where bits 0 and 1 are turned off, then orphan sids are not allowed to be stored and no change should be made to this parameter.
If you want to modify the parameter, the following command can be run:
server_param server_2 -f cifs -m acl.mappingErrorAction -v 11
This likely reduces the occurrences but may or may not eliminate them entirely.
Workaround 2:
The surefire way to get rid of these errors in the logs is to revert the DC communication behavior to what it was prior to the new codes (meaning that we only talk to the domain controllers in SMB1.)
If you want to revert to SMB1 for dc communications (The old VNX behavior), contact Dell support and reference this knowledge base article.