Determining the Availability and Reliability of Storage Configurations
Santosh Shetty (August 2002)
Many configurations for storage are available, from direct attach storage to clusters to storage area networks (SANs). Dell analyzes five models to determine which has the best reliability, availability, and mean time between failures (MTBF).
Administrators can implement storage in several ways. Direct attach storage (DAS) is essentially a storage device (such as a hard disk, RAID array, or tape) directly connected to a server by a cable. A storage area network (SAN) is a high-performance network that moves data between heterogeneous servers and storage resources. It enables an any-to-any interconnection of servers and storage systems.
Storage can also be implemented in a cluster, which is a group of two or more servers joined together to minimize the possibility of system failures. When one server in a cluster fails, another server automatically takes over the activities and applications of the failed server. Clustering leads to high performance and high availability.
To compare the availability and reliability of these different storage configurations, Dell conducted a reliability analysis of model systems that incorporated the DellTM PowerEdgeTM 6450 server, the QLogic® QLA 220 host bus adapter (HBA), the Dell|EMC FC4700 storage array, and the Dell PowerVaultTM 56F Fibre Channel switch (when appropriate).
Establishing reliability models
For each configuration, this study calculated the following:
- Reliability: The probability that a system will perform its required functions under stated conditions for a stated period of time at a given confidence level
- MTBF (mean time between failures): An indicator, typically expressed in hours, of how reliable a repairable system is; the higher the MTBF, the more reliable the system
- Availability: The probability that the system will be up and running correctly at a given time; this study estimates inherent availability , which accounts for system operating time and corrective maintenance and excludes downtime associated with preventive maintenance, logistics, and administration
The computations accounted for only the hardware attributes of availability, reliability, and MTBF. As a basis for the calculations, Dell created reliability block diagrams for each configuration with the following assumptions:
- The redundant HBAs connecting the servers to the PowerVault 56F switches are in standby mode.
- Both the simple SAN and high-availability cluster configurations contain standby switches, which enable a server to choose between redundant HBAs; these configurations assume perfect switching.
- Both PowerVault 56F switches are active at any given time. During a switch failure, the other switch takes over the functions of the failed switch. The same assumption holds true for the dual controllers in the FC4700 storage array.
The non-redundant DAS configuration consists of one PowerEdge 6450 server, one HBA, and one FC4700 storage array (see Figure 1 ). Figure 2 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.
Figure 1. Non-redundant DAS configuration
Figure 2. Reliability block diagram and equation for non-redundant DAS
The redundant DAS configuration has failover capability because the server contains two HBAs and the storage array contains two controllers (see Figure 3 ). Figure 4 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.
Figure 3. Redundant DAS configuration
Figure 4. Reliability block diagram and equation for redundant DAS
This basic cluster is assumed to be established through Microsoft® Cluster Service. It includes two PowerEdge 6450 servers that can fail over to each other, one HBA per server, and one FC4700 storage array with two controllers (see Figure 5 ). Figure 6 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.
Figure 5. Basic cluster configuration
Figure 6. Reliability block diagram and equation for basic cluster
The high-availability cluster (see Figure 7 ), also established through Microsoft Cluster Service, is more redundant than the basic cluster-each server contains two HBAs with failover capability. The configuration includes a storage array that has two controllers and adds two PowerVault 56F switches. Figure 8 shows the reliability block diagram used to derive the mathematical formula for evaluating the reliability of this configuration (see Figure 9 ).
Figure 7. High-availability cluster configuration
Figure 8. Reliability block diagram for high-availability cluster
Figure 9. Reliability equation for high-availability cluster
The simple SAN configuration shown in Figure 10 assumes that the two servers are not clustered and thus do not fail over to each other. Therefore, the SAN will be down when one server fails. However, the HBAs in each server, the controllers in the storage array, and the switches each fail over to their respective partners. The two PowerVault 56F switches have no interswitch links. Figure 11 shows the reliability block diagram used to derive the mathematical formula for evaluating the reliability of this configuration (see Figure 12 ).
Figure 10. Simple SAN configuration
Figure 11. Reliability block diagram for simple SAN
Figure 12. Reliability equation for simple SAN
Determining subsystem reliability
To calculate the subsystem reliability, Dell used MTBF and availability estimates for the individual subsystems, or components (see Figure 13 ). The MTBF estimates were obtained from the Dell Quality Group. The availability estimates include mean time to repair (MTTR), logistics time, downtime, and administrative time. The availability of the QLA 2200 HBA was assumed to be 0.99996 because the target MTTR for HBAs is eight hours. Dell also assumed that the times to failure of all the subsystems (PowerEdge 6450, PowerVault 56F, Dell|EMC FC4700, and QLA 2200) follow an exponential distribution.
Figure 13. Estimated availability and MTBF for individual subsystems
First, the availability and MTBF figures were used to calculate the MTTR for each subsystem:
where A is the availability of an individual subsystem and MTBF is the mean time between failures of an individual subsystem.
The MTTR for the entire system is
where n is the number of subsystems, l is the failure rate of the i th unit (the failure rate for each subsystem is 1/MTBF), and t is the time to repair the i th unit.
The results shown in Figure 14 were used to determine that MTTR is 5.05 hours and MTTR is 4.80 hours.
Figure 14. Intermediate results used to determine subsystem reliability
Subsystem reliability is given as R (t) = e -lit . For each subsystem, mission reliability was calculated for mission times of one, two, and three years (8760, 17,520, and 26,280 hours, respectively). Figure 15 shows the results.
Figure 15. Mission reliability of subsystems
Calculating system reliability, MTBF, and availability
Dell evaluated the system reliability R for each configuration by applying the appropriate subsystem reliabilities to the equations shown in Figures 2 , 4 , 6 , 9 , and 12 . System reliability was used to calculate system MTBF:
where t is the mission time in hours. For the inherent system availability,
Figures 16 , 17 , and 18 show the results of the MTBF, availability, and reliability calculations for the different configurations.
Figure 16. Comparison of MTBF over three years
Figure 17. Comparison of availability over three years
Figure 18. Comparison of reliability over three years
Clustering and redundancy: Improving reliability
As expected, reliability declines from the first year to the third year for all the configurations. The non-redundant DAS and the simple SAN configurations exhibit the lowest MTBF, availability, and reliability because these configurations are serial in nature. In the non-redundant DAS configuration, any component failure would result in a system failure. Because of its built-in redundancy and clustering, the high-availability cluster configuration demonstrates the highest MTBF, availability, and reliability over a three-year period.
This analysis shows that incorporating clustering or redundancy into the constituent components of complex network configurations, such as a SAN, not only improves MTBF but also increases the availability and reliability of the configuration. These types of configurations can minimize the amount of downtime experienced each year, which translates into significant cost savings.
Santosh Shetty (email@example.com) is a development engineer advisor on the Reliability Engineering team in the Dell Enterprise Systems Group (ESG). Santosh is currently working on a variety of platforms to assist the product development team. He is also working on the hard disk drive qualification process. Santosh has experience as a quality assurance engineer at Bharti Duraline Limited in India. He holds a B.E. in Mechanical Engineering from Goa Engineering College and an M.S. in Industrial Engineering from the University of Arizona.