Isilon: How to determine if an Isilon cluster is in a window of risk for data loss

Summary: How to determine if an Isilon cluster is in a window of risk for data loss.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Introduction

A Window Of Risk (WOR) occurs when enough devices in a cluster or node pool or disk pool have failed that the protection level is reached. This condition is also known as "at protection" or "over protection." When a cluster or pool is in a WOR, data loss has not yet occurred. However, if additional devices fail, data loss might occur. Whether or not Data loss occurs depends on differing factors. Those factors include; additional devices failing before FlexProtect can complete, or if the failed devices were the only source of the data in question.

This article describes how protection levels work on the cluster, and how you can tell if your cluster is in a WOR for data loss.

NOTE
For the purposes of WOR calculation, "failed" means devices that are in a "down" or "dead" state. Devices that are "soft_failed" are not counted against protection levels. See the "Procedure" section below for how to determine the number of "down" or "dead" devices.

NOTE
The condition where more devices fail than the number specified as the protection level is called "over protection." In this state, the cluster or node pool/disk pool can no longer successfully re-create all the data stored there.

 

Details

OneFS uses an N+M data protection model. In the N+M notation, N represents the number of nodes. The M represents the number of simultaneous nodes, drives, or node pool/disk pool the cluster can handle without losing data. For example, with N+2 protection, the cluster or pool can lose either two drives on different nodes, or lose two nodes altogether.

OneFS 6.5 and later also support an N+M:B protection model. In N+M:B notation, N represents the number of nodes. M represents the number of down or failed drives. The B represents the number of down or failed nodes that the cluster or node pool/disk pool can handle without losing data. For example, with N+3:1 protection, the cluster or pool can lose three drives or one node without losing data.

Multiple down or failed drives within a single node always represents a single node failure (rather than multiple drive failures) for the purposes of WOR calculation. Here are some examples using an 8-node cluster at N+3:1 protection:

  • Example 1: In a single cluster, three drives fail, each in a different node. This puts the cluster in a WOR ("at protection").
  • Example 2: In a single cluster, two drives within the same node have failed. Since the drives are in the same node, the failures are counted as a single node failure. This situation also puts the cluster in a WOR ("at protection").

For more information about data protection levels and how they are calculated, see the OneFS Administration Guide.

CAUTION!
If you suspect or determine that your cluster is in a WOR state, contact Dell Technical Support for assistance before taking further action.

IMPORTANT!
A WOR might occur when drives or nodes fail. However, Isilon Engineering advises that you keep failed drives or nodes in the cluster until the FlexProtect operation has been completed successfully. Although a device has failed, some or all blocks of data might still be readable. Leaving the drive or node joined to the cluster provides flexibility if an attempt to recover data from the failed device becomes necessary.

 

Cause

To determine whether the cluster or node pool/disk pool is presently in a WOR, first determine the level of protection configured on the cluster or pool. Next determine how many failed nodes and drives exist. For the purposes of WOR calculation, "failed" means devices that are in a "down" or "dead" state. Follow the instructions in the appropriate section that follows.

Resolution

Procedure

 

    OneFS 7.2, 8.0, 9.0 and above

    1. In the OneFS web administration interface, go to File System > Storage Pools > SmartPools.
    2. Obtain the current protection level from the Tiers & Node Pools table, in the Requested Protection column.
    3. Open an SSH connection to the node and log in using the "root" account.
    4. Determine how many devices are "down" or "dead" by running the following command:

      isi_group_info

      The output looks similar to the following. If there are down or dead devices, they are indicated as "down" or "dead" in the output.

      Example of a down node: efs.gmp.group: { 3-4:0-8, 5:0-6,8, 9:1-2,4-6,8, 12:0-11, down: 6 }

      Example of a down drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-10, 5:0-11, 6:0-11, down: 2:10, 4:11, soft_failed: 2:10, 4:11 }

      Example of a dead drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }
    For information about interpreting the output, including how to understand if the down or dead devices are drives or nodes, see:  Understanding OneFS Group Changes or Interpreting Group Changes.

    OneFS 7.1

    1. In the OneFS web administration interface, go to File System Management > Storage Pools > SmartPools.
    2. Obtain the current protection level from the Node Pools table, in the Requested Protection column.
    3. Open an SSH connection to the node and log in using the "root" account.
    4. Determine how many devices are "down" or "dead" by running the following command:

      isi_group_info

      The output looks similar to the following. If there are down or dead devices, they are indicated as "down" or "dead" in the output.

      Example of a down node: efs.gmp.group: { 3-4:0-8, 5:0-6,8, 9:1-2,4-6,8, 12:0-11, down: 6 }

      Example of a down drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-10, 5:0-11, 6:0-11, down: 2:10, 4:11, soft_failed: 2:10, 4:11 }

      Example of a dead drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }

      For information about interpreting the output, including how to understand if the down or dead devices are drives or nodes, see: Understanding OneFS Group Changes or Interpreting Group Changes.

    OneFS 7.0

    1. In the OneFS web administration interface, go to File System Management > SmartPools > Summary.
    2. Obtain the current protection level from the Tiers & Node Pools table, in the Protection column.
    3. Open an SSH connection to the node and log in using the "root" account.
    4. Determine how many devices are "down" or "dead" by running the following command:

      isi_group_info

      The output looks similar to the following. If there are down or dead devices, they are indicated as "down" or "dead" in the output.

      Example of a down node: efs.gmp.group: { 3-4:0-8, 5:0-6,8, 9:1-2,4-6,8, 12:0-11, down: 6 }

      Example of a down drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-10, 5:0-11, 6:0-11, down: 2:10, 4:11, soft_failed: 2:10, 4:11 }

      Example of a dead drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }
    For information about interpreting the output, including how to understand if the down or dead devices are drives or nodes, see: Understanding OneFS Group Changes or Interpreting Group Changes.

    OneFS 6.5

    1. In the OneFS web administration interface, go to File System > SmartPools > Disk Pools.
    2. Obtain the current protection level from the Default Protection column.
    3. Open an SSH connection to the node and log in using the "root" account.
    4. Determine how many devices are "down" or "dead" by running the following command:

      isi_group_info

      The output looks similar to the following. If there are down or dead devices, they are indicated as "down" or "dead" in the output.

      Example of a down node: efs.gmp.group: { 3-4:0-8, 5:0-6,8, 9:1-2,4-6,8, 12:0-11, down: 6 }

      Example of a down drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-10, 5:0-11, 6:0-11, down: 2:10, 4:11, soft_failed: 2:10, 4:11 }

      Example of a dead drive: efs.gmp.group: { 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }
    For information about interpreting the output, including how to understand if the down or dead devices are drives or nodes, see: Understanding OneFS Group Changes or Interpreting Group Changes.

    Affected Products

    PowerScale OneFS

    Products

    Isilon
    Article Properties
    Article Number: 000018892
    Article Type: Solution
    Last Modified: 09 Jul 2025
    Version:  4
    Find answers to your questions from other Dell users
    Support Services
    Check if your device is covered by Support Services.