Tutorial on Fault Sets for ScaleIO

Published APR 22, 2025

This video will overview on Fault Sets for ScaleIO.

Fault sets are an optional feature of scale IO that allow an administrator to manage system availability at data center scale in a system configured with fault sets, scale IO can continue serving data if an entire rack or chassis fails. Let's start with an example. The scale IO administrator defined fault sets at the rack level because of environmental considerations such as limited power or network redundancy fault sets provide units of redundancy. They do this by placing physical constraints around where redundant data copies can be held as data chunks are written to the system, the primary and secondary copies are placed in different fault sets.

This means that if an entire rack fails, redundant copies will still be available inside other fault sets. Since this system was configured with rack level fault sets as the unit of redundancy IO can continue as the remaining nodes work in parallel to rebuild data protection. Once data protection has been rebuilt, the system will be in a state where it can survive the loss of another rack and rebuild again if there is sufficient free space. In addition to providing rack level redundancy fault sets can also be used to protect against chassis failure. When third party blade servers are used, fault sets can also be used to place all the nodes in a rack or chassis and maintenance mode for planned operations. If you're already familiar with scale IO you probably noticed that a system with fault sets behaves a lot like a system without fault sets.

The difference is that a system without fault sets protects data by mirroring it on different physical nodes, a system with fault sets. However, places mirrored data in different fault sets. In this example, that means that the mirror data copies shown here would have instead been placed in different fault sets. This is because in a system without fault sets, the unit of redundancy is a node. But in a system with fault sets, the unit of redundancy is a group of nodes, fault sets are defined inside protection domains. A protection domain establishes the boundaries where data copies are maintained. The scale I system can have multiple protection domains. Each protection domain can have multiple fault sets.

When a system is configured with multiple protection domains and fault sets, the system can recover when fault sets and different protection domains fail. At the same time. In this configuration scale, I can continue serving data when 40 nodes are lost at once. Today, we will show a scale down demo where two nodes fail at once. We'll be working today with an eight node scale IO system consisting of dell poweredge 13 G's, all the nodes reside in a single protection domain protection domain. One, there are four fault sets with two nodes, each fault sets, ABC and D, each node is using 2 10 gigabit Ethernet ports. And each node has six S SDS and 18 spinning disks on this system. All the S SDS are in one storage pool and all the spinning disks are in another. A storage pool is an elastic software defined collection of physical drives that holds user data.

If you're new to scale IO consider watching our video on storage pools. After you finish this video with the system under load, we will simulate a rack or chassis failure by halting the hypervisors on two of the nodes at once within a few seconds. The remaining nodes in the system will begin a rebuild operation to re-establish data protection. Once the rebuild operation is complete, we will examine the state of the system and we'll wrap up here. We see the scale IO landing page. The scale IO system has more than 190 terabytes of raw capacity and is servicing workload consisting of both reads and writes. The system has a total of 100 and 91 drives consisting of a mix of both S sds and spinning media distributed across eight nodes. As we showed in the visualization, all eight nodes are inside the same protection domain protection domain.

One here we can see each of the eight Dell poweredge 13 G nodes that make up the cluster. Notice that the system is divided into four fault sets consisting of two nodes each in production. Each fault set would likely contain all of the nodes in a given rack. Or if third party servers are used, each fault set might contain all of the nodes in a given blade chassis. As you can see the A load and the capacity are evenly distributed among all the nodes in this protection domain. If you're interested in understanding scale, I's even IO and capacity distribution consider watching our video on storage pools. After you finish this video today, we will illustrate how scale IO can continue serving data. Even when all of the nodes in a fault set are lost. At the same time, we will do this by simulating an unplanned failure that takes down multiple nodes at once. We will do this by halting the hypervisors running on the first two nodes. After the nodes go down scale IO will begin redirecting user IO to the surviving nodes and a parallel rebuild operation will begin among the nodes that make up the surviving fault sets.

We've now halted both of the nodes in the first fault set. This video segment is displayed in real time to show the speed with which scale IO identifies the failure redirects user IO and begins the parallel rebuild operation. The IO has now ramped back up after the scale IO cluster redirected the clients to the locations containing redundant data chunks on the right, the system displays the rebuild activity that occurs as redundancy is reestablished. The nodes all work in parallel to re-establish the redundancy. But as before redundant data chunks will not be placed inside the same fault set.

We will now wait for the rebuild operation to complete. The rebuild is complete. The user IO continued despite the loss of the fault set, a rebalance operation has evenly distributed the load and capacity across all the surviving members. When the nodes in this fault set are brought back online, the system will rebalance again to make use of all the available capacity. Fault sets can also be used to facilitate planned operations that involve groups of servers by allowing the administrator to place multiple nodes and maintenance mode. At the same time. Note that in order to achieve the benefits of fault sets, the system must have enough free space to rebuild the storage pools in the surviving fault sets.

This is why fault sets are an optional feature of scale IO in many deployments protection domains configured without fault sets are sufficient. Also note that if fault sets are used, a minimum of three fault sets must be defined. Let's wrap up and review the fundamentals of scale I fault sets, fault sets, control units of redundancy inside a protection domain by default. Scios mesh mirroring layout works across nodes when fault sets are defined. Scow's mirrored layout works across fault sets. Fault sets provide resiliency against multiple nodes in a protection domain going down at once.

They do this by grouping nodes that are likely to experience a failure together. Fault sets can be used to provide rack level redundancy in cases where power or network redundancy is an issue. Finally, fault sets can provide chassis level redundancy. When blade servers are in use scale of fault sets allow an administrator to manage system availability at data center scale. They bring hardware and location aware redundancy to enterprise grade software defined storage.

Suggested Videos

How to replace an Infinite Architecture for the ScaleIO

How to replace an Infinite Architecture for the ScaleIO

11:01

How to replace the Provisioning with vSphere Plug-in for ScaleIO

How to replace the Provisioning with vSphere Plug-in for ScaleIO

4:10

How to Upgrade ScaleIO for ScaleIO

How to Upgrade ScaleIO for ScaleIO

9:45

Tutorial on vSphere Plug-in Capabilities for ScaleIO

Tutorial on vSphere Plug-in Capabilities for ScaleIO

17:15

Tutorial on Protection Domains for ScaleIO

Tutorial on Protection Domains for ScaleIO

9:11

Related Articles