Protection domains allow a scale io system to tolerate simultaneous failures by dividing a cluster into smaller elements. TT Protection domains are more fluid balanced and scalable than raid each protection domain. In a scale IO system contains redundant copies of user data. Data protection is distributed between nodes inside a protection domain to prevent a single point of failure. This allows the scale IO system to continue operating when a component is lost. When a node or drive is lost, scale IO automatically rebuilds data protection for the lost device. It does this by creating copies of data chunks in new locations across the protection domain.
This process re establishes data protection for those chunks that lost redundancy when the node or device went down in large deployments. Protection domains allow the system to survive the loss of multiple nodes at the same time. This is important because scale IO can grow to be thousands of nodes. A protection domain can have as many as 100 and 28 nodes protection domains can also provide performance isolation and data location control for multi tenancy. Today, we will look at protection domains on a small scale IO cluster. Let's take a look at that demo system layout.
This video features an eight node scale IO cluster divided into two protection domains, protection domain one and protection domain two protection domain one consists of all flash nodes and is used for performance oriented tenants and applications protection domain two consists of nodes containing only spinning media and is used for capacity intensive applications. During this video, we will simulate a hardware failure in the first protection domain by halting the hypervisor running on one of the nodes. While the system is recovering from that failure, we will do the same thing on a node in the second protection domain. While the scale IO system is recovering from these failures, we will examine the massively parallel reconstruct operation that re establishes data protection in each of the protection domains.
The scale IO dashboard shows the total capacity available to the cluster. The number of protection domains and the number of storage pools, storage pools are elastic allocations of storage performance and capacity inside of a protection domain. As we saw in the diagram, this system contains two protection domains. Let's take a look at them. Each protection domain is made up of standard X 86 64 servers running ESX data is balanced and mirrored between each of these servers. Scale's mirroring algorithm will not place redundant copies of data on the same node. Data is mirrored within a protection domain. Data is not mirrored between separate protection domains. Let's take a look at the storage media that resides on each of these nodes. This cluster uses all flash nodes and protection domain one and it uses spinning media in protection domain too.
A scale a cluster may also contain hybrid nodes. Storage pools are collections of physical drives inside a protection domain that can be sized and resized dynamically based on business requirements. The first protection domain contains two separate storage pools. The second protection domain contains a single storage pool, protection domains, manage the protection and availability of storage pools, protection domains and storage pools can both expand or shrink as needed. In the next step, we will simulate an unplanned failure of one of the nodes in the first protection domain to show a reconstruct operation to create this failure. We're gonna halt the ESX instance running on one of these nodes scale IO is now aware that the node went down this equates to six S SDS going offline at once.
This has triggered a parallelized rebuild operation among the surviving members of the protection domain. During this rebuild operation, all of the nodes in the protection domain are rebuilding to all of the other nodes in the protection domain. The individual disks in each of these nodes are also working in parallel to speed up the operation. In this demo, we took an entire node offline but a single device failure within a node is rebuilt in a similar fashion because the rebuild operation is paralyzed. The system works quickly to rebuild data protection protection. Domain two is unaffected by the loss of the node that went down. It remains in a state where no rebuild is currently required. Its data protection relationships are only within the last three nodes of this cluster protection domain two can survive the loss of one of its nodes while data protection is being reestablished in protection domain.
One to illustrate this, I will halt the ESX instance running on a node in protection domain two, a parallelized rebuild operation is now occurring among the surviving members of Protection domain too, we will now wait for this operation to complete. The system is rebuilt, data protection and rebalanced IOPS and capacity utilization in the storage pools that house the user data. The rebuild operation is fast and automated as nodes are added to a protection domain rebuilds get faster because there are more nodes working in parallel to rebuild the data protection relationships. In this example, we allowed the ESX nodes to remain down for the duration of both rebuild operations.
If a node comes back up during the rebuild operation scale IO can opt to allow the data on that node to remain in place. Reducing the amount of time required to complete the rebuild. A rebuild operation following an unplanned but brief outage is called a backwards rebuild. The rebuild operation shown in this video is called a forwards rebuild because Protection Domain one contains more than two surviving nodes and it contains sufficient free space. It can rebuild data protection again if another node is lost. This flexibility is a significant differentiator versus more traditional hardware-based approaches to data protection. Let's bring both ESX nodes back up, allowing scale IO to add the nodes back into the protection domains. We brought both dead ESX servers back online allowing scale IO to use their storage resources again scale IO will maintain data protection during the independent rebalance operations.
We will now wait for the rebalance operations to complete both nodes have been brought back online. The system is rebalanced. So all the available performance and capacity can be utilized. We just showed how scale IO rebuilds data protection when a node is lost, described how the system goes into a state where data protection can be rebuilt again if another node fails and discussed how scale IO can rebuild even more rapidly. If a node goes offline only briefly, we showed how faults are isolated inside of a protection domain. And we showed how a scale IO system can continue operating with simultaneous failures in separate protection domains to wrap up.
Let's summarize the advantages and use cases of protection domains, protection domains serve as data protection boundaries allowing a cluster to survive multiple simultaneous failures. When a node or device goes offline, they rebuild using a massively parallel operation between surviving members of the protection domain. In addition to the scale io quality of service implementation protection domains can provide tenant performance isolation. Finally protection. Domains can be used to provide data location control for multi tenancy protection domains and scale IO provide a simple automated approach to data protection and bring enterprise grade data protection to software defined storage.