PowerScale: Deleting a target quota during a job leads to SyncIQ failures
Summary: SyncIQ fails in the STF_PHASE_CT_DIR_DELS phase after a Target quota has been deleted.
Symptoms
SyncIQ fails in the STF_PHASE_CT_DIR_DELS phase after a Target quota has been deleted.
On the SyncIQ source cluster, a SyncIQ job can fail with an unable to delete error due to quotas:
Error at target cluster on node [target-1]: Unable to delete /ifs/PSCALE-154122/quota1 (1:005d:13c1::131), Local error : Job failed because the job attempted to delete a directory that a quota has been applied to. You must delete all quotas applied on or under /ifs/t_154122/quota1 before this job can continue.
Once the target quota is removed, the SyncIQ job then fails with the generic A work item has been restarted too many times. error:
SyncIQ policy failed. A work item has been restarted too many times. This is usually caused by a network failure or a persistent worker crash.
On the SyncIQ target cluster, the following FAILED ASSERTION is logged in the isi_migrate.log file:
isi_migr_sworker: *** FAILED ASSERTION tmp_st.st_ino != entryp->d_fileno @ /b/mnt/src/isilon/bin/isi_migrate/sworker/stf_transfer.c:1253: Tmpdir 1:005f:14ba not expected, and moving to itself
Cause
After a quota is deleted on a SyncIQ target cluster, a leftover tmp-working-dir remains in the directory to delete.
Resolution
Contact Dell PowerScale Support for assistance with the workaround. Mention this knowledge article.
Future occurrences of this issue can be avoided by modifying the SyncIQ policy to delete -quotas=yes.
isi sync policies modify <policy_name> --delete-quotas=yes
Additional Information
How to find errors on a live cluster:
On the Source cluster, check the error messages in the policy's report:
# isi sync reports view <Policy Name> <Report ID>
On the Target cluster, look for the following assertion in the messages log:
isi_for_array -QX 'grep -h "isi_migr.*FAILED ASSERTION tmp_st.st_ino" /var/log/messages' | sort | tail