|

Dell Power Solutions

Dell Power Solutions

Dell Magazines

Dell Magazines

Dell Power Solutions

Dell Power Solutions
Subscription Center

Determining the Availability and Reliability of Storage Configurations

Santosh Shetty (August 2002)

Many configurations for storage are available, from direct attach storage to clusters to storage area networks (SANs). Dell analyzes five models to determine which has the best reliability, availability, and mean time between failures (MTBF).

Administrators can implement storage in several ways. Direct attach storage (DAS) is essentially a storage device (such as a hard disk, RAID array, or tape) directly connected to a server by a cable. A storage area network (SAN) is a high-performance network that moves data between heterogeneous servers and storage resources. It enables an any-to-any interconnection of servers and storage systems.

Storage can also be implemented in a cluster, which is a group of two or more servers joined together to minimize the possibility of system failures. When one server in a cluster fails, another server automatically takes over the activities and applications of the failed server. Clustering leads to high performance and high availability.

To compare the availability and reliability of these different storage configurations, Dell conducted a reliability analysis of model systems that incorporated the DellTM PowerEdgeTM  6450 server, the QLogic®  QLA 220 host bus adapter (HBA), the Dell|EMC FC4700 storage array, and the Dell PowerVaultTM  56F Fibre Channel switch (when appropriate).

Establishing reliability models

For each configuration, this study calculated the following:

  • Reliability:  The probability that a system will perform its required functions under stated conditions for a stated period of time at a given confidence level
  • MTBF (mean time between failures):  An indicator, typically expressed in hours, of how reliable a repairable system is; the higher the MTBF, the more reliable the system
  • Availability:  The probability that the system will be up and running correctly at a given time; this study estimates inherent availability , which accounts for system operating time and corrective maintenance and excludes downtime associated with preventive maintenance, logistics, and administration

The computations accounted for only the hardware attributes of availability, reliability, and MTBF. As a basis for the calculations, Dell created reliability block diagrams for each configuration with the following assumptions:

  • The redundant HBAs connecting the servers to the PowerVault 56F switches are in standby mode.
  • Both the simple SAN and high-availability cluster configurations contain standby switches, which enable a server to choose between redundant HBAs; these configurations assume perfect switching.
  • Both PowerVault 56F switches are active at any given time. During a switch failure, the other switch takes over the functions of the failed switch. The same assumption holds true for the dual controllers in the FC4700 storage array.

Non-redundant DAS
The non-redundant DAS configuration consists of one PowerEdge 6450 server, one HBA, and one FC4700 storage array (see Figure 1 ). Figure 2 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.

Figure 1. Non-redundant DAS configuration
Figure 1. Non-redundant DAS configuration

Figure 2. Reliability block diagram and equation for non-redundant DAS
Figure 2. Reliability block diagram and equation for non-redundant DAS

Redundant DAS
The redundant DAS configuration has failover capability because the server contains two HBAs and the storage array contains two controllers (see Figure 3 ). Figure 4 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.

Figure 3. Redundant DAS configuration
Figure 3. Redundant DAS configuration

Figure 4. Reliability block diagram and equation for redundant DAS
Figure 4. Reliability block diagram and equation for redundant DAS

Basic cluster
This basic cluster is assumed to be established through Microsoft®  Cluster Service. It includes two PowerEdge 6450 servers that can fail over to each other, one HBA per server, and one FC4700 storage array with two controllers (see Figure 5 ). Figure 6 shows the reliability block diagram and the mathematical formula for evaluating the reliability of this configuration.

Figure 5. Basic cluster configuration
Figure 5. Basic cluster configuration

Figure 6. Reliability block diagram and equation for basic cluster
Figure 6. Reliability block diagram and equation for basic cluster

High-availability cluster
The high-availability cluster (see Figure 7 ), also established through Microsoft Cluster Service, is more redundant than the basic cluster-each server contains two HBAs with failover capability. The configuration includes a storage array that has two controllers and adds two PowerVault 56F switches. Figure 8 shows the reliability block diagram used to derive the mathematical formula for evaluating the reliability of this configuration (see Figure 9 ).

Figure 7. High-availability cluster configuration
Figure 7. High-availability cluster configuration

Figure 8. Reliability block diagram for high-availability cluster
Figure 8. Reliability block diagram for high-availability cluster

Figure 9. Reliability equation for high-availability cluster
Figure 9. Reliability equation for high-availability cluster

Simple SAN
The simple SAN configuration shown in Figure 10 assumes that the two servers are not clustered and thus do not fail over to each other. Therefore, the SAN will be down when one server fails. However, the HBAs in each server, the controllers in the storage array, and the switches each fail over to their respective partners. The two PowerVault 56F switches have no interswitch links. Figure 11 shows the reliability block diagram used to derive the mathematical formula for evaluating the reliability of this configuration (see Figure 12 ).

Figure 10. Simple SAN configuration
Figure 10. Simple SAN configuration

Figure 11. Reliability block diagram for simple SAN
Figure 11. Reliability block diagram for simple SAN

Figure 12. Reliability equation for simple SAN
Figure 12. Reliability equation for simple SAN

Determining subsystem reliability

To calculate the subsystem reliability, Dell used MTBF and availability estimates for the individual subsystems, or components (see Figure 13 ). The MTBF estimates were obtained from the Dell Quality Group. The availability estimates include mean time to repair (MTTR), logistics time, downtime, and administrative time. The availability of the QLA 2200 HBA was assumed to be 0.99996 because the target MTTR for HBAs is eight hours. Dell also assumed that the times to failure of all the subsystems (PowerEdge 6450, PowerVault 56F, Dell|EMC FC4700, and QLA 2200) follow an exponential distribution.

Figure 13. Estimated availability and MTBF for individual subsystems
Figure 13. Estimated availability and MTBF for individual subsystems

First, the availability and MTBF figures were used to calculate the MTTR for each subsystem:

where A is the availability of an individual subsystem and MTBF is the mean time between failures of an individual subsystem.

The MTTR for the entire system is

where n is the number of subsystems, l is the failure rate of the i th unit (the failure rate for each subsystem is 1/MTBF), and t is the time to repair the i th unit.

The results shown in Figure 14 were used to determine that MTTR is 5.05 hours and MTTR is 4.80 hours.

Figure 14. Intermediate results used to determine subsystem reliability
Figure 14. Intermediate results used to determine subsystem reliability

Subsystem reliability is given as R (t) = e -lit . For each subsystem, mission reliability was calculated for mission times of one, two, and three years (8760, 17,520, and 26,280 hours, respectively). Figure 15 shows the results.

Figure 15. Mission reliability of subsystems
Figure 15. Mission reliability of subsystems

Calculating system reliability, MTBF, and availability

Dell evaluated the system reliability R for each configuration by applying the appropriate subsystem reliabilities to the equations shown in Figures 2 , 4 , 6 , 9 , and 12 . System reliability was used to calculate system MTBF:

where t is the mission time in hours. For the inherent system availability,

Figures 16 , 17 , and 18 show the results of the MTBF, availability, and reliability calculations for the different configurations.

Figure 16. Comparison of MTBF over three years
Figure 16. Comparison of MTBF over three years

Figure 17. Comparison of availability over three years
Figure 17. Comparison of availability over three years

Figure 18. Comparison of reliability over three years
Figure 18. Comparison of reliability over three years

Clustering and redundancy: Improving reliability

As expected, reliability declines from the first year to the third year for all the configurations. The non-redundant DAS and the simple SAN configurations exhibit the lowest MTBF, availability, and reliability because these configurations are serial in nature. In the non-redundant DAS configuration, any component failure would result in a system failure. Because of its built-in redundancy and clustering, the high-availability cluster configuration demonstrates the highest MTBF, availability, and reliability over a three-year period.

This analysis shows that incorporating clustering or redundancy into the constituent components of complex network configurations, such as a SAN, not only improves MTBF but also increases the availability and reliability of the configuration. These types of configurations can minimize the amount of downtime experienced each year, which translates into significant cost savings.

Santosh Shetty (santosh_shetty@dell.com) is a development engineer advisor on the Reliability Engineering team in the Dell Enterprise Systems Group (ESG). Santosh is currently working on a variety of platforms to assist the product development team. He is also working on the hard disk drive qualification process. Santosh has experience as a quality assurance engineer at Bharti Duraline Limited in India. He holds a B.E. in Mechanical Engineering from Goa Engineering College and an M.S. in Industrial Engineering from the University of Arizona.

snWW16|8.8.0.5