PowerScale: Collect and MultiScan does not reclaim space on excluded device
Summary: Collect and MultiScan excludes devices from sweep if they go unavailable during the job run.
Symptoms
This KB is only for running, or completed Collect jobs, or MultiScan jobs that ran Collect within them.
Collect is used to free up blocks left on a device while it became unavailable.
MultiScan at times runs both AutoBalance and Collect, to ensure MultiScan ran Collect by checking the job.
# isi job view <jobID#>
During the marking phase of Collect, blocks are marked for later cleanup once it enters the sweep phase.
In some situations, a device may be excluded from the sweep operation of the job during the mark phase.
This can leave the cluster with imbalanced nodes or drives even after the job has been completed successfully.
The job cancels if too many drives or nodes are excluded from the job.
Cause
In events of a device going down, or unavailable the job excludes the device from the sweep phase.
This can be from various causes, such as a node reboot, power cycle, node split.
With drives this can happen if a drive stalls or otherwise goes unavailable.
Should a device go unavailable while running Collect or MultiScan the job sets it to bam_nosweep excluded devices list.
When the job enters the sweep phase, it works to sweep any blocks on the devices NOT added to the bam_nosweep excluded list.
This can lead to nodes or drives staying imbalanced with a higher capacity utilization after the job has been completed successfully.
Example 1 shows one devid excluded within the messages log, shown as devid 28.
2025-03-08T22:09:15.238162-08:00 <0.5> EXAMPLE-1(id25) /boot/kernel.amd64/kernel: [bam_nosweep.c:298](pid 63766="isi_job_d")(tid=104229) bam_nosweep_add_devices: Exclude set { devids (1) = [ 28 ], wdls = [] }
Example 2 shows one timestamp of a drive in devid 18 being excluded, followed by a drive in devid 24.
2024-11-04T16:20:33.664254-07:00 <0.5> EXAMPLE-12(id12) /boot/kernel.amd64/kernel: [bam_nosweep.c:298](pid 83067="isi_job_d")(tid=103674) bam_nosweep_add_devices: Exclude set { devids (0) = [], wdls (1) = [ (d: 18, unm:00000040 ] }
2024-11-04T17:06:21.738071-08:00 <0.5> EXAMPLE-12(id12) /boot/kernel.amd64/kernel: [bam_nosweep.c:298](pid 83067="isi_job_d")(tid=103674) bam_nosweep_add_devices: Exclude set { devids (0) = [], wdls (2) = [ (d: 18, unm:00000040, (d: 24, unm:00000020 ] }
Example of Collect being canceled due to the percentage of devices excluded due to 28.1% of the devices being excluded, the limit is 25%.
2025-11-04T14:08:28.356984+01:00 <0.5> EXAMPLE-8(id8) /boot/kernel.amd64/kernel: [bam_mark.c:1923](pid 3646="isi_job_d")(tid=101140) Mark not permitted with 28.1% of 32 nodes excluded (limit 25.0%)
2025-11-04T14:08:28.356994+01:00 <0.5> EXAMPLE-8(id8) /boot/kernel.amd64/kernel: [bam_mark.c:1837](pid 3646="isi_job_d")(tid=101140) Updated mark for cookie 19:none with error 85
2025-11-04T14:08:28.359093+01:00 <0.5> EXAMPLE-8(id8) /boot/kernel.amd64/kernel: [drv.c:1499](pid 67605="python3.8")(tid=102284) Drive sync in progress for ldnum 21
2025-11-04T14:08:28.365281+01:00 <0.5> EXAMPLE-8(id8) /boot/kernel.amd64/kernel: [lin_mark.c:376](pid 3646="isi_job_d")(tid=101140) lin_end_mark: Ending lin mark (error ECANCELED).
2025-11-04T14:08:28.365300+01:00 <0.5> EXAMPLE-8(id8) /boot/kernel.amd64/kernel: [lin_mark.c:398](pid 3646="isi_job_d")(tid=101140) lin_end_mark: Mark already canceled. (current group: <1,2770> current mark state: LIN_COLLECT_GOOD
Resolution
If the device needing swept has been excluded, a new job must be started.
If another issue is causing the device to go unavailable frequently, then it must be investigated further.