文章编号: 000053655

OneFS: Neighborhood split (Gen6) or node pool split (pre-Gen6) can lead to full pools, poor performance, or failed jobs

摘要: After adding nodes to an Isilon cluster that requires splitting neighborhoods or node pools, clusters with less free capacity can suffer from effects of capacity imbalance.

文章内容

症状

After adding nodes to Isilon cluster that requires splitting node neighborhoods or node pools, clusters with less free capacity can suffer from effects of capacity imbalance. These effects can include: jobs that fail due to lack of free space, interruptions to user workflow due to lack of free space, and less performance as data layout becomes more difficult in those full pools.

For instance, after adding a 20th node to a Gen6 Isilon cluster, two new neighborhoods will be created, each containing 10 nodes. Before Gen6, this split would occur after a 40th node was added, resulting in 20-node node pools. Further splits occur on the 20 (Gen6) or 40 (Pre-Gen6) node boundaries, up to the maximum supported cluster size for whatever release of OneFS is running.

原因

After the split, a SmartPools or SetProtectPlus job will need run to update all files and directories on the cluster. Some of those directories and files will remain in the same pools, but others may be moved to the new pools created after the split. SmartPools or SetProtectPlus will not, by default, consider the balance of these pools when making layout decisions. This can lead to one of the split pools filling more quickly than the other, particularly if the pools were already full to begin with. These jobs shrink stripe width for the files from the original 19 nodes (Gen6) down to 10 nodes, or 39 nodes (pre-Gen6) down to 20 nodes.

The lack of balance consideration by SmartPools and SetProtectPlus can be further exacerbated if a device must be smart failed, requiring FlexProtect or FlexProtectLin to run. FlexProtect and FlexProtectLin also do not, by default, consider balance when they run.

This KB gives some workarounds to help restore data balance across pools.

解决方案

Sometimes, running AutoBalanceLin can help restore balance to the split pools. In this case, users would want to pause any running SmartPools or SetProtectPlus job before running AutoBalanceLin. Why not AutoBalance? Because it has to scan all drives first in order to decide about what gets balanced. AutoBalanceLin instead decides as it looks at files and directories and can start to balance immediately. Warning: AutoBalanceLin does not update the pool policy applied to a file like SmartPools, and SetProtectPlus does. That is why the additional workarounds below may be preferable.

If AutoBalanceLin is not appropriate, choose the alternative below that fits with your cluster's situation:

Devices needing smartfail
If having devices that must be smart failed and have full (or nearly full) pools, run FlexProtect with a rebalance goal. This assumes that a FlexProtect job is already running.

NOTE: If your cluster is running FlexProtectLin, it may be better served by pausing that job and running AutoBalanceLin. FlexProtectLin's behavior with find and restripe goals does not match FlexProtect's and previous versions of this article incorrectly suggested changing the FlexProtectLin goal, which could lead to a job that essentially was running AutoBalanceLin on everything in the filesystem.

We are changing the config for the job and letting the job continue to run using that goal, trying to avoid starting a new job that has to start over from beginning. These settings, however, result in a job that runs more slowly as it has to not only repair data but also balance it to other nodes, drives, and pools. Do not forget to revert these settings after the job has completed or are satisfied that pool balance is no longer a concern. Also note - like AutoBalance/AutoBalanceLin, FlexProtect does not update pool policy that is applied to a file like SmartPools and SetProtectPlus.

In the case that FlexProtect fails to help with balance, SmartPools/SetProtectPlus must run in a special job engine configuration mode in order to restore balance before attempting to resume FlexProtect/FlexProtectLin.

To reconfigure:

Disable Job engine: isi services -a isi_job_d disable
Set FlexProtect find and restripe goals to rebalance: isi_gconfig -t job-config jobs.types.flexprotect.find_goal=rebalance; isi_gconfig -t job-config jobs.types.flexprotect.restripe_goal=rebalance
Confirm: isi_gconfig -t job-config | grep goal | grep flexprotect
Reenable Job engine: isi services -a isi_job_d enable
If pausing the job outside of these steps, resume now, otherwise the job resumes the job with the configuration changes made.

If balance does not improve enough for FlexProtect or FlexProtectLin to succeed, run SmartPools or SetProtectPlus with rebalance goal, following the instructions in the following section. But, first set the job engine to run in degraded mode, which allows OneFS 8.1.2 and earlier to run jobs other than FlexProtect or FlexProtectLin which require smartfail:

Put job engine in degraded mode: isi_gconfig -t job-config core.run_degraded=true
Continue to the instructions for "No devices needing smartfail."
Remember to revert the degraded mode setting back to its default with: isi_gconfig -t job-config -R core.run_degraded

No devices needing smartfail
If there are no devices needing smartfail, we configure SmartPools and SetProtectPlus to run with a rebalance goal. These settings result in a job that takes longer to run, but that job should balance data out among the new pools, preventing pools from filling. SmartPools and SetProtectPlus also update the file pool target on an object. Do not forget to revert these settings after the job has completed or if satisfied that pool balance is no longer a concern.

To reconfigure:

Disable Job engine: isi services -a isi_job_d disable
Set SmartPools find and restripe goals to rebalance.
1. OneFS 8.2.0 and later: isi_gconfig -t job-config jobs.types.smartpools.find_goal=rebalance; isi_gconfig -t job-config jobs.types.smartpools.restripe_goal=rebalance
2. OneFS 8.1.2 and earlier: isi_gconfig -t job-config jobs.types.filepolicy.find_goal=rebalance; isi_gconfig -t job-config jobs.types.filepolicy.restripe_goal=rebalance
Set SetProtectPlus find and restripe goals to rebalance: isi_gconfig -t job-config jobs.types.setprotectplus.find_goal=rebalance; isi_gconfig -t job-config jobs.types.setprotectplus.restripe_goal=rebalance
Confirm settings (Note: We did not change treefilepolicy): isi_gconfig -t job-config | grep goal | egrep "filepolicy|setprotectplus"
Reenable Job engine: isi services -a isi_job_d enable
If user paused the job outside of these steps, resume now, otherwise the job resumes the job with the configuration changes made.

Reverting to default job find and restripe goals.
When your cluster has balanced data sufficiently, the job has completed successfully, or no longer concerned about balance, user should revert these job goals so that the job runs with its original goals. To do that, we share the commands below for all jobs to revert to default job engine configurations:

OneFS 8.2.0 and later: isi_gconfig -t job-config -R jobs.types.smartpools.find_goal; isi_gconfig -t job-config -R jobs.types.smartpools.restripe_goal
OneFS 8.1.2 and earlier: isi_gconfig -t job-config -R jobs.types.filepolicy.find_goal; isi_gconfig -t job-config -R jobs.types.filepolicy.restripe_goal

Remainder of commands for all versions:
isi_gconfig -t job-config -R jobs.types.setprotectplus.find_goal; isi_gconfig -t job-config -R jobs.types.setprotectplus.restripe_goal
isi_gconfig -t job-config -R jobs.types.flexprotect.find_goal; isi_gconfig -t job-config -R jobs.types.flexprotect.restripe_goal

If having devices in SmartFail that must be failed out but must run the job engine in degraded mode to run SmartPools or SetProtectPlus, revert the degraded mode configuration with:
isi_gconfig -t job-config -R core.run_degraded

文章属性

受影响的产品

Isilon

产品

Isilon

上次发布日期

06 7月 2023

版本

文章类型

Solution

返回页首

欢迎

欢迎访问戴尔