PowerScale: The audit_flt Driver in the SMB Process is in Shutdown State, Causing Node DU
Summary: An issue was identified on OneFS 9.7 code, where the Audit driver does not load properly in the SMB process. This can lead to data unavailability (DU).
Symptoms
If the cluster is on 9.7.1.0 - 9.7.1.7 release code, this issue may affect it. The main symptoms are:
- SMB users are unable to access some or all nodes.
- The SMB service on some or all nodes shows a high number of connections in a closed state. While connections in a closed state can occur due to various reasons, it is also a symptom associated with this specific issue. To check this, use the following command:
isi_for_array -X 'netstat -an | grep "\.445 " | grep CLOSED | wc -l' | sort -V
You should expect to see all '0' in the output. Below is an example of a cluster exhibiting this issue:
MyCluster-1# isi_for_array -X 'netstat -an | grep "\.445 " | grep CLOSED | wc -l' | sort -V
MyCluster-1: 208
MyCluster-2: 425
MyCluster-3: 2228
MyCluster-4: 146
MyCluster-5: 5284
MyCluster-6: 964
- Auditing is enabled on the cluster. Check with this command:
isi audit settings global view | grep "Protocol Auditing"
MyCluster-1# isi audit settings global view | grep "Protocol Auditing"
Protocol Auditing Enabled: Yes
- Your cluster is running the affected code: 9.7.1.0 - 9.7.1.7
To conclusively determine if your cluster is experiencing this issue, raise a service request to Dell Support. They can help examine the SMB core-dump from the LWIO service.
Cause
This issue occurs because the SMB process does not properly load the audit_flt driver upon startup. This issue typically triggers when the SMB service is restarted, such as during a code upgrade or failover and failback between clusters. However, it can happen if the SMB is restarted for any reason.
Resolution
To alleviate the issue, restart the SMB service. Under normal circumstances, an isolated SMB restart should be sufficient:
MyCluster-1# killall -6 lwio
This can be done on multiple nodes with isi_for_array. Below is an example of restarting the SMB service on nodes 1-4:
MyCluster-1# isi_for_array -n1-4 'killall -6 lwio'
If this does not alleviate the issue, it may be necessary to restart SMB and all dependencies:
MyCluster-1# /usr/likewise/bin/lwsm restart lwio
Again, this can be done on multiple nodes simultaneously using isi_for_array. Below is an example of restarting the SMB stack on nodes 1-4:
MyCluster-1# isi_for_array -n1-4 '/usr/likewise/bin/lwsm restart lwio'
This issue is addressed in OneFS 9.7.1.8 code onwards.