PowerScale OneFS: Delayed Responses from Slow DCs to Nodes May Result in Intermittent Access Issues to the Cluster Over SMB.

Summary: On certain versions of code, a slow or nonresponsive domain controller (DC) may induce a condition where LSASS stops responding.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

A slow or nonresponsive domain controller (DC) in Active Directory may trigger a rare condition within LSASS. This condition leads to a backlog of threads that are unable to process. A thread that is waiting on a response from a problematic DC has the RPC Call Timer expire as expected. However, unexpected handling in the code then results in the state of that thread failing to transition into a finished state. The thread is then never canceled, and this in turn creates a block that other threads are stuck behind.

From a client-side perspective, it is observed that an impacted node is not able to service SMB requests. Any active sessions to the impacted node may now see failures to perform SMB operations. Clients attempting to establish a connection to the node over SMB see it timeout. A build-up of closed sessions in netstat to port 445 may manifest on impacted nodes. You may also observe the following messages in LSASS logs indicating a timeout over IRP calls.
2023-08-15T15:27:56.424739-04:00 <30.4> ninefiveoh-1(id1) lsass[9436]: [rdr] [context 0x84e7a8618] IRP_TYPE_READ timed out
It should be noted that while these symptoms may be observed when this issue is live, it is not exclusive to this particular issue.

Sufficient testing and adequate data analysis from Support is required to confirm the presence of the issue. Identifying the issue requires reviewing a memory dump of the LSASS process at minimum, which Support can collect.

If you suspect you may be experiencing this issue, engage Support immediately for data collection and mitigation. Certain datasets are ephemeral in nature and must be obtained before any mitigation efforts are enacted to properly identify the issue.

Cause:

Domain controller that is slow to respond or otherwise do not respond at all within the RPC timer induces the issue.

While the timer expires, the code does not properly cancel and close the thread. This lack of thread closure leads to a backlog of threads within LSASS. This issue is only present starting on OneFS 9.4.0.14 or 9.5.0.3 and all subsequent patches after.

For example, if a cluster is on OneFS 9.4.0.15, it is susceptible to this issue. This is because that patch includes the same code that impacts this issue as is present in 9.4.0.14.

Resolution:

While fixing the issues that lead to a slow DC would be optimal, the following may be considered for cluster-side mitigation until a code enhancement can be provided per an ongoing investigation by Dell Engineering. These actions should only be taken after the issue is confirmed through Support engagement.
  1. Extend the RPC Timer - You may consider extending this to anywhere between three to four minutes, but nothing beyond that as it would not be beneficial.
What does this parameter do?
It defines the maximum amount of time in seconds an RPC call to Active Directory is allowed to take. A value of 0 indicates no timeout. The default is one minute, which is sixty seconds.
 
How to view the current RPC Call Timeout value on the cluster for your configured AD provider:
# isi auth ads list -v |grep -i rpc
How to modify the RPC Call Timeout for a specific AD provider:
# isi auth ads modify <AD DOMAIN> --rpc-call-timeout=<time in seconds>
Example showing default parameters:
ninefiveoh-1# isi auth ads list -v |grep -i rpc
         RPC Call Timeout: 1m
Example showing parameters updated to three minutes:
ninefiveoh-1# isi auth ads list -v |grep -i rpc
         RPC Call Timeout: 1m
ninefiveoh-1# isi auth ads modify TESTDOMAIN.LOCAL --rpc-call-timeout=180
Verifying that the change took effect:
ninefiveoh-1# isi auth ads list -v |grep -i rpc 
         RPC Call Timeout: 3m
This is not anticipated to be an impactful change. Out of an abundance of caution we do recommend that changes be done during a maintenance window if there are any concerns about potential impact.

NOTE: If you have DCs that are excessively slowed to the point that adjusting the timer to four minutes is not beneficial, then a priority should be placed on investigating why those DCs are slow. This may require engagement with Microsoft Support.
 
  1. Controlled LSASS restart - This is required on the nodes impacted by the issue, to resolve the nonresponding state of the process. It is recommended that both before and after you perform the process restart, that you validate that the Process ID (PID) has successfully changed.
Validate the PID:
# isi_for_array -n<LNN> 'ps auwwxx | grep lsass | grep -v grep'
Restart LSASS while pulling a core.  
# isi_for_array -n<LNN> 'pkill -6 -f "lw-container\ lsass"'
Validate the PID a second time, confirming that it changed for the process.
# isi_for_array -n<LNN> 'ps auwwxx | grep lsass | grep -v grep'
Example where the command is run against nodes LNN 3, 5, and 6:
# isi_for_array -n3,5-6 'ps auwwxx | grep lsass | grep -v grep'
# isi_for_array -n3,5-6 'pkill -6 -f "lw-container\ lsass"'
# isi_for_array -n3,5-6 'ps auwwxx | grep lsass | grep -v grep'
NOTE: While restarting LSASS does not terminate active sessions, there is the slight risk of impact to end-users. This depends on how rigorous the workflow is and if any operations dependent on calls to LSASS are performed while restarting the process. If you opt to do this cluster-wide instead of on specific nodes, you may want to do this during a maintenance window.

The code-level fix for OneFS's handling of the thread state is corrected in OneFS 9.5.0.6 and 9.4.0.16. Any patches from that level and later have the fixes included.

Once the patch with the fix is installed, the workarounds on the cluster must be reverted as soon as possible. Leaving the RPC timeout value extended to such a high value may lead to unexpected behavior.

Additional Information

In addition to checking lsassd.log for IRP read timeout messages, you can confirm what nodes have a buildup of closed SMB connections with this command:
echo -e "\e[32m\n >>> Any buildup of closed sockets against SMB? <<< \e[0m"; isi_for_array -X 'netstat -an | grep "\.445 " | grep CLOSED | wc -l' | sort -V
This prints a list of nodes by LNN and any connections in a closed state against port 445. 
 
NOTE: While there is no hard threshold for what is considered as a point of concern, anything excessive could be indicative of the issue. CLOSED is a valid state in TCP for normal operations, so long as there is no unusual buildup of them. This behavior is not exclusive to this issue, but can be used to track what node may require closer inspection.
Article Properties
Article Number: 000216978
Article Type: How To
Last Modified: 07 Nov 2025
Version:  14
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.