PowerScale OneFS 9.10: Rare Performance Issues when Running a SnapshotDelete job
Summary: On clusters upgraded to OneFS 9.10 or 9.11, performance issues may be experienced when running a SnapshotDelete job if there are multiple storage pools.
Symptoms
Clusters with two or more node pools that were upgraded to OneFS 9.10 or later may experience performance issues whenever a SnapshotDelete job is running. Pausing the SnapshotDelete job brings immediate relief, but the issue returns once the job is resumed.
On Clusters with snapshots with long expiration dates, the issue may not be apparent until several weeks or months after the OneFS upgrade was completed.
Logs and Hangdumps show the job engine (isi_job_d) SnapshotDelete job thread holding a LIN lock with a stack trace similar to this example:
77886 isi_job_d:
...
thread 100637: je_worker_main at 0xfffffe8b55ea95c0 in state "running":
On cpu 5 for 3 ticks
Stack: --------------------------------------------------
kernel:btree_leaf_check_prefetch+0xde
kernel:btree_leaf_get_entry+0x349
kernel:stf_is_fake_entry+0x41
kernel:stf_iterate_block+0x66
kernel:ifs_snap_get_lins_helper+0xac
kernel:_sys_ifs_snap_get_lins+0x279
kernel:amd64_syscall+0x7b0
--------------------------------------------------
Cause
OneFS 9.10 introduces Illogical Logical iNodes (LINs) to the Snapshot Tracking Files (STF). This was added to support a new feature, MetadataIQ. An STF is a special file type with several unique characteristics and is involved in the full snapshot life cycle, including the creation, storing, changing, and deletion of snapshots.
When data is migrated between different pools, the illogical LINs are added to the STF and can gradually build up. Performance issues occur when snapshots are expired and deleting, and there are too many illogical LINs in the STF of a snapshot.
How to determine whether a cluster is at risk for this issue?
Clusters which meet the following criteria are at higher risk of experiencing this issue if they are upgraded to OneFS 9.10 or 9.11.
- SnapshotIQ is licensed and enabled. Snapshots are being created and expired on the cluster.
- The Cluster contains multiple node pools.
Resolution
Permanent solution:
Upgrade to one of these OneFS versions or later which includes the fix:
- OneFS 9.10.1.4 PSP-4686 MR:[9.10.1.4_GA-MR][Multiple Userspace and Kernel Fixes](October 2025)
- OneFS 9.11.0.5 PSP-4681 MR:[9.11.0.5_GA-MR][Multiple Userspace and Kernel Fixes](September2025)
Workaround:
Until a permanent solution is applied, the following workaround should be used:
Apply the following setting change to disable illogical LINs cluster wide.
isi_sysctl_cluster efs.snapshot.stf_populate_illogical_lin_enabled=0
On clusters that have upgraded to OneFS 9.10 and are experiencing performance issues:
Cancel, and disable the
SnapshotDelete job to avoid a Data Unavailability (DU) situation. Then contact Dell Technical Support for assistance with removing the Snapshots containing illogical LINs.
To cancel a running
SnapshotDelete job:
isi job cancel snapshotdelete
To disable the SnapshotDelete job:
isi job types modify snapshotdelete --enabled=false
SnapshotDelete job disabled for too long can cause low disk space capacity issues. Dell Technical Support must be contacted as soon as possible to assist with removing the Snapshots containing illogical LINs manually before the SnapshotDelete job is reenabled.