PowerScale: NFS Core Dumps from NFSv4 GETATTR Request with an Invalid File Descriptor.
Summary: In rare instances, the Network File System (NFS) process continuously core dumps on nodes due to an NFSv4 GETATTR request with an invalid File Descriptor. The issue has only been reported when workflow NFSv4 clients using the Solaris operating system. ...
Symptoms
The NFS process continuously core dumps and restarts on multiple PowerScale nodes with the following stack trace:
2025-12-12T09:50:12.851358-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: [kern_sig.c:4043](pid 6400="nfs")(tid=103190) Stack trace:
2025-12-12T09:50:12.851392-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: Stack: --------------------------------------------------
2025-12-12T09:50:12.851397-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:Nfs4AttrGatherAttrs+0x516
2025-12-12T09:50:12.851401-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1150544965.Nfs4FillAttr+0x736
2025-12-12T09:50:12.851404-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1209865017.NfsProtoNfs4ProcGetattr+0x515
2025-12-12T09:50:12.851408-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1357219149.NfsProtoNfs4ProcCompound+0x18a2
2025-12-12T09:50:12.851412-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1895683854.NfsProtoNfs4Dispatch+0xa31
2025-12-12T09:50:12.851415-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:NfsExecContextCallback+0x61
2025-12-12T09:50:12.851419-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/liblwsched.so.0:WorkSparkMain+0x4f
2025-12-12T09:50:12.851422-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: /usr/likewise/lib/liblwbase.so.0:SparkMain+0x142
2025-12-12T09:50:12.851426-08:00 <0.5> powerscale01-28(id28) /boot/kernel.amd64/kernel: --------------------------------------------------
2025-12-12T09:50:12.851429-08:00 <0.6> powerscale01-28(id28) /boot/kernel.amd64/kernel: pid 6400 (nfs), jid 0, uid 0: exited on signal 11 from pid 0 (unknown) (core dumped)
OR
2023-03-01T09:18:00.403811+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: [kern_sig.c:4026](pid 71661="nfs")(tid=102404) Stack trace:
2023-03-01T09:18:00.403856+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: Stack: --------------------------------------------------
2023-03-01T09:18:00.403868+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:Nfs4AttrGatherAttrs+0x50a
2023-03-01T09:18:00.403879+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1150544965.Nfs4FillAttr+0x700
2023-03-01T09:18:00.403889+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1209865017.NfsProtoNfs4ProcGetattr+0x5e7
2023-03-01T09:18:00.403900+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1357219149.NfsProtoNfs4ProcCompound+0x1721
2023-03-01T09:18:00.403911+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace1895683854.NfsProtoNfs4Dispatch+0x402
2023-03-01T09:18:00.403921+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/lw-svcm/nfs.so:$dtrace2038417139.NfsProtoNfs4CallDispatch+0xd0
2023-03-01T09:18:00.403932+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: /usr/likewise/lib/liblwbase.so.0:SparkMain+0x141
2023-03-01T09:18:00.403943+01:00 <0.5> powerscale01-5(id6) /boot/kernel.amd64/kernel: --------------------------------------------------
2023-03-01T09:18:00.403953+01:00 <0.6> powerscale01-5(id6) /boot/kernel.amd64/kernel: pid 71661 (nfs), uid 0: exited on signal 11 from pid 0 (unknown) (core dumped)
Cause
This issue occurs when a Solaris NFSv4 client sends a NFSv4 GETATTR request with a NULL or Invalid File Descriptor.
This causes the NFS process to core dump and restart on the PowerScale node handling a root file handle in a second GETATTR, but pExecContext > pExport is not NULL.
To this date, all reports in the field of this issue so far have involved Solaris NFSv4 client workflow. However, PowerScale Engineering can replicate the issue using other UNIX or Linux operating systems as well. Evidence also indicates that Solaris clients using the autos or automount feature may be more prone to causing the issue.
A new defect has been created to address the issue: PSCLDF-6198: Invalid Pointer pGattrCtx->pFilePosixInfo causes a core dump.
Resolution
Permanent Solution:
Upgrade to a OneFS version which includes the fix. PowerScale Engineering is working on a patch for the issue. There is no exact time for release.
Workaround:
Until a permanent solution is applied, the following workarounds can be used to mitigate the impact:
- Identify the
NFSv4clients which are causing NFS to core dump.
If needed, Support can identify the culprit client IP address through the autogenerated core dumps found in /var/crash on the affected nodes. Do not manually produce a core dump. C Support requires the generated core dump from the issue found in /var/crash on the affected nodes. Support can create a consult escalation if assistance is needed in identifying the clients causing the issue.
- Disable the
autofs/automountfunction on the Solaris clients as Dell Technologies support believes this is related to the issue. Instead, manually mount the exports on the Solaris clients by configuring/etc/vfstabon the client. - Once Dell Technologies Support has identified the clients causing the issue, they can mitigate the impact to the rest of the NFS machines by suspending 1-2 nodes in the NFS pool. Customers can then configure the problematic Solaris clients to connect directly to the IP addresses (instead of using the SmartConnect zone name or FQDN) of the suspended nodes. Dell Technologies Support can assist with this procedure if needed. With the node suspended, the problematic Solaris clients can now connect to the nodes by IP address, whereas any NEW connections to the FQDN from all your other NFS clients are now prevented from connecting to this node. However, any preexisting connections to the node are affected. Again, the goal is to lessen the impact here, until a patch fix is applied, in that only one or two node’s NFS daemons now core dump.
Steps to suspend a node from a SmartConnect network pool:
Using node 26 as an example:
# isi network pools sc-suspend-nodes groupnet0.NFS_Subnet.NFS_Pool 26 ***where 26 is lnn #26 ####
Repeat for each affected pool.
To resume:
# isi network pools sc-resume-nodes groupnet0.NFS_Subnet.NFS_Pool 26 ***where 26 is lnn #26 ####
Repeat for each affected pool.