Storage Area Networks: Providing High-Availability Storage for Cluster Environments
By Erik Ottem (March 2002)
As clustering grows in popularity, so does the need for a storage system that provides the high availability of clusters. Storage area networks (SANs) supply highly available storage for such environments. Configuring these SANs with high-speed switches, such as the Gadzoox® Networks 2 Gb SlingshotTM 4210, can build different levels of redundancy and further ensure high availability.
Storage area networks (SANs) are an ideal way to meet the basic storage requirements of high-availability clusters. Clusters consist of multiple computers grouped together to perform the same functions. When servers are clustered, they must have access to the same data if they are to work together to keep an application up and running. Separating the data from the clustered servers by placing it in a SAN provides a central location for shared data and simplifies the process of attaching, expanding, and reallocating storage among multiple servers.
Using Fibre Channel switches and hubs for redundant access paths, SANs also increase overall availability. Any node on the SAN can be connected or disconnected without disrupting service to other nodes. The Gadzoox® Networks 2 Gb SlingshotTM 4210 switch is a Fibre Channel switch specifically designed to address SAN connectivity and to enhance storage availability, ease of use, manageability, and serviceability for clustered environments.
Examining common cluster implementations
Methods for clustering servers vary. One common clustering implementation is the use of a hot spare. In this implementation, two or more servers have access to the same set of data, but only one server is active and in control of the data. If the server with active control of the data goes down, the other server—the hot spare—can access the data and provide services in a substitute fashion. The only change that end users should see is some inaccessibility when the hot spare takes over, starts the application, and accesses the information on the SAN. Depending on the design of the front-end application, the duration of inaccessibility will vary from a slowdown in response time to an unrecoverable error requiring the application to be restarted.
In Microsoft® Windows® clusters, each application has a specified failover server. If a problem occurs on the initial server, the application will fail over to the alternate server. Administrators may need to restart the application, but the storage for the application persists in the SAN.
In another type of cluster, the standby server does not simply wait for the primary server to fail. Instead, it operates as a standard server, providing access to a different application or function. These applications may be shut down during the failover process if they are of a lower priority. If the servers are in an active/active configuration, the applications will continue running. In an active/active configuration, two servers monitor one another for failure with a heartbeat connection. If a failure occurs, either server can stand in for the failed server while still supporting its own functions.
Building SANs to support high-availability clusters
Building a SAN to support a high-availability cluster involves three steps: clustering the servers, connecting the SAN switches, and integrating the storage system.
The first step for creating high availability in a SAN begins with the servers. Ideally, two Fibre Channel host bus adapters (HBAs) should be connected to each server (see Figure 1 ). Administrators can choose among several connection options, such as a fixed port configuration, Gigabit Interface Converter (GBIC), or Small Form Pluggable (SFP) GBIC for copper or optical capability.
Figure 1. Dual HBAs for alternate path failover
Administrators should choose an HBA that provides hot standby, load sharing, or load balancing capabilities. Hot standby for an HBA is similar to hot standby for a full server. If the primary HBA cannot access the data, the secondary HBA will take over and provide access to the data. Load sharing, the next level of high-availability support, allows both HBAs to be active at the same time, providing two paths to access the data. Neither card, however, can act for both of them. Load balancing, the optimal HBA configuration, enables each HBA to operate simultaneously and still stand in for one another if a failure occurs.
Connecting SANs and clusters with switch configurations
The next step in a highly available SAN configuration involves connecting the SAN switches. Administrators should connect each HBA from the server to a switch. The switch then uses zoning to enable the sharing of storage resources and to ensure that each application has access only to its authorized storage.
Fibre Channel Second Generation Switch Fabric (FC-SW-2), a new standard in the Fibre Channel industry, allows the interconnection of fabric switches from multiple companies. FC-SW-2 also defines how zoning information is communicated from one switch to another to provide the following capabilities:
- Added security, reliability, and manageability
- A method for interconnecting switches through e-ports
- Distributed name services for communicating device names from one switch to another
- Distributed management services to manage fabric devices from outside the fabric
- Routing protocols for crossing multiple switches
The Gadzoox Networks Slingshot 4210 switch supports the FC-SW-2 protocol.
Implementing Gadzoox Networks Slingshot 4210 switch for highly available SANs
Administrators can use the Gadzoox Networks Slingshot 4210 switch to build SAN islands around a workgroup or departmental cluster. These clusters typically include up to eight servers sharing a storage array and tape. This configuration provides seamless growth for the storage supporting the servers and allows non- disruptive growth and high availability if a component fails.
All connectors of the Slingshot 4210 are located on one end of the switch for easy connection or removal. Fault and power indicators also reside on the front and rear of the switch so that, when the Slingshot 4210 sits in a rack with other equipment, administrators can determine which unit needs service. Other usability features of the Slingshot 4210 include auto-loading of World Wide Name (WWN) addresses, manual override on port-type specification, selectable in-order-delivery settings, and no reset requirement for retrieving statistics.
Administrators implementing Slingshot 4210 switches in a cluster can combine operating system (OS) environments through zoning. Some of the servers could be running Microsoft Windows SQL Server while other servers run Microsoft Exchange® and yet another supports a Sun® Solaris® application.
Using switch redundancy in SAN configurations
In SAN design, a simple method for implementing high availability is to connect two switches, using two HBAs in each server connected to different routes to the storage through at least two switches (see Figure 2 ). Each switch then connects the same set of servers and storage to provide two independent access paths from server to data. If the SAN experiences a path failure, an alternate data transfer path still exists. For example, if a connection from the server to the first switch fails and the connection between the second switch and the data fails, a connection still exists from the server to the data.
Figure 2. Multiple switches for high-availability clusters
This process of interconnecting switches builds a mesh environment that offers enhanced availability and manageability. A meshed switch environment allows administrators to implement a layer of switches whose sole responsibility is to provide interconnections from one switch to another.
Considering storage system requirements for cluster environments
The final step in the SAN design process involves connecting the storage devices. As with the servers, administrators should make sure that each data storage unit has multiple access paths and supports the same availability features (hot spare, load sharing, or load balancing) as the server's HBA. Administrators should also configure the data with various RAID (redundant array of independent disks) levels to address access speed and availability.
SANs do not burden the LAN with additional traffic when transferring data from disk to disk or from disk to tape. Instead of depending on the server or storage device to move data to two or more locations, the Fibre Channel switches can route data to multiple target storage arrays.
Using Slingshot 4210 switches for inter-cluster linking
Although administrators can connect switches with single cables, a more advanced form of inter-switch link (ISL) is a trunk. Trunking connects SAN islands together to share data at Fibre Channel distances of up to 10 km. Trunks are a unique type of ISL because they are single, virtual pipes composed of multiple physical links.
When any two Slingshot 4210 switches are connected, they recognize each other and form a trunking link. A Slingshot trunk can consist of as many connections as desired, each with 2 Gbps capability. These trunking links also provide failover so that if any connection fails, the data will automatically route over the surviving links (see Figure 3 ).
Figure 3. SAN trunking connections between clusters
Advanced inter-cluster linking: Meshed networks
System administrators can use inter-cluster linking to create several types of meshed networks, including core/edge, skinny tree mesh, and dual-mesh networks.
A core/edge architecture uses two or more high-end fabric switches to build a network. The core connects to the switches on the "edge," switches into which all nodes are plugged. These switches can be either loop or fabric. In large configurations, the core switches should be director-class devices, a completely redundant switch housed entirely in one chassis. Unfortunately, director-class switches are more expensive than fabric switches and still need to be in a redundant pair in case of environmental disaster, such as fire or flood.
Another type of inter-cluster linking is the skinny tree mesh, which involves three layers of switches—four switches on both the first and third layers, and two switches on the second layer (see Figure 4 ). The number of switches used in a skinny tree mesh may vary depending on the number of available ports per switch and the number of interconnections between each switch. A skinny tree mesh using a Fibre Channel switch with flexible trunking and a high port count can obtain an optimal amount of ports while still providing enough inter-switch connectivity to avoid bandwidth congestion.
Figure 4. Skinny tree mesh architecture
Administrators also can set up two separate fabric meshes and connect each node on the network into each mesh. When creating these mesh designs, the primary concern is availability; other concerns are cost, expandability, and ease of support. Designs should allow for an adequate number of ports for most environments.
Current SCSI standards for most data connections are limited to about 12 meters, but Fibre Channel connections enable clustered computers to exist in separate buildings up to 10 km apart and work with redundant locations for greater data protection. SANs can separate servers from storage by 50 km or more with specialized optical connections.
Meeting the needs of cluster environments
Integrating SANs into clusters provides storage that scales and fails over with the cluster. The Gadzoox Networks 2 Gb Slingshot 4210, specifically designed to allow failover pairs in clustered environments, can ensure SAN connectivity. As clustering grows in popularity, storage connectivity with a highly available SAN becomes increasingly important for both efficiency and effectiveness.
Erik Ottem (firstname.lastname@example.org) is a senior director of worldwide sales with Gadzoox Networks. He has been with Gadzoox Networks for three years, working in product marketing, business development, and sales. Prior to Gadzoox Networks, Erik spent time at both Seagate and IBM in storage and high-end computing. He has a B.S. from the University of California, Davis and an MBA from Washington University in St. Louis.
For more information
For more information, service, and support, visit http://www.gadzoox.com