PowerScale: 'NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence'.
Summary: OneFS 9.3 and OneFS 9.4: NFSv4 client reports error: 'NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence'. In packet captures, Dell Technologies sees the following error as well: NFS4ERR_NO_GRACE ...
Symptoms
PowerScale is on OneFS 9.3 or 9.4, and NFSv4 clients are reporting errors like the following:
Nov 18 13:00:22 kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000c6d21f3b!
Nov 18 13:00:22 kernel: NFS: nfs4_reclaim_open_state: unhandled error -10026
Nov 18 13:00:22 kernel: NFSv4: state recovery failed for open file /test2.txt, error = -10026
When these errors appear, the application accessing the NFS file system crashes, so it is affecting production and requires manual intervention a few times per day.
Even after the NFS clients have been rebooted, the clients still report the errors.
In the client or node packet captures you may also see the following errors:
PCAP:
41 13:00:11.313563 10.205.224.32 10.205.224.12 NFS 302 V4 Call (Reply In 42) OPEN DH: 0x1eb1379b/
42 13:00:11.313804 10.205.224.12 10.205.224.32 NFS 122 V4 Reply (Call In 41) OPEN Status: NFS4ERR_NO_GRACE
43 13:00:11.314731 10.205.224.32 10.205.224.12 NFS 330 V4 Call (Reply In 44) OPEN DH: 0xa07785fa/test2.txt
44 13:00:11.314911 10.205.224.12 10.205.224.32 NFS 122 V4 Reply (Call In 43) OPEN Status: NFS4ERR_BAD_SEQIDCause
This issue is caused by a known defect: PSCALE-162845: Accept incremented sequence id for the previous operation having NFS4ERR_NO_GRACE or NFS4ERR_GRACE error.
All versions of NFSv4 are affected, not just 4.1 and 4.2.
Detail about the defect is as follows:
The client is incrementing their sequence id when OneFS is not expecting them to. Thus, OneFS does not increment the sequence id correctly.
The NFS client seems to be monotonically incrementing the sequence number for OPEN/CLOSE/other operations including the NFS4ERR_NO_GRACE and NFS4ERR_GRACE error. But PowerScale does not allow the incremented sequence id if the previous operation encountered the NFS4ERR_NO_GRACE or NFS4ERR_GRACE error. Therefore, suppose PowerScale returns the NFS4ERR_NO_GRACE/NFS4ERR_GRACE error for a previous operation. This leads to NFS4ERR_BAD_SEQID for the next incoming operation because PowerScale does not expect an incremented sequence id.
As per NFSv4 RFC, there is no definition for NFS4ERR_NO_GRACE/NFS4ERR_GRACE to disallow incremented values.
There is a distinction between Linux and PowerScale regarding incrementing sequence id logic for NFS4ERR_NO_GRACE errors. Hence let us allow incrementation of seqence-id in Isilon code to align with Linux code.
Resolution
Workaround is to move workflow to NFSv3.
OR
Install patch:GA: PSP-3035 PATCH: [9.4.0.11_GA-RUP_2023-01][Multiple User space and Kernel Fixes](January 2023)DA: PSP-3069 PATCH: [9.4.0.10_DA-CUSTOM_2022-12][9.4.0.10_GA-RUP_2022-12 + NFS Fix](VMWARE)