In a new environment we see that the snapshot autodelete job run times vary from cluster to cluster.
Are there numbers available what should be reasonable in various environments?
My findings now:
cluster A - 3x NL400 nodes - running OneFS 220.127.116.11
Removing a directory with 500K files causes a snapshotdelete job to run with 130M lins and 49 snapshots. This job took 57 hours to complete. This averages to a deletion-rate of 2.3M lins per hour.
cluster B - 3x NL400 nodes - running OneFS 18.104.22.168
Removing a dir with 500K files with the same filetype causes a job with 138M lins and 322 snapshots. This job is allready running for 39 hours, but the number of lins processed is 16.7M. So the currect deletion rate is 0.4M lins per hour. Much much slower.
In both cases the impact policy has been set to HIGH, so it should be able to use all resources in the clusters.
I also checked using top, isi stat and isi perfstat what is going on. Top reports 200% WCPU on the isi_job_d. The isi perfstat command reports around 50MB/sec of deletes going on at disk-backend level, and the isi statistics command displays 30-40 IOPS per backend disk, and a disk utilization around 10% per disk.
Any ideas about this?
Also any additional info available how the snapshotdelete process works? Does a snapshot delete has to empty (write zeroes) to all 'freed' blocks? Can we say that removing 1000GB of snapshots requires 1000GB of backend writes?
Yes, times of completion can vary due to a number of variables.
Per best practices, snapshot expiration needs to delete snapshots sequentially. Oldest to newest. Also snapshot creation policies shouldn't overlap.
These are two of the biggest settings I have seen to account for the experience described. I would recommend to open a service request to review both configurations for best practices, lets get to the bottom of this!