Architecting Linux High-Availability Clusters - Part 1
By Tau Leng; Jenwei Hsieh, Ph.D.; and Edward Yardumian (Issue 4 2000)
This article, the first in a series on Linux high-availability (HA) clusters, provides an overview of the Linux HA cluster. It describes the two common types of clusters, the Linux Virtual Server for IP HA and load balancing, and HA application clusters. Future articles in the series will cover product-specific implementations and features.
Linux is known as a stable operating system. However, a Linux client/server configuration can have several points of failure, including the server hardware, the networking components, and the server-based applications. As more administrators choose Linux for critical applications, the demand for high-availability (HA) clustering for the Linux platform is increasing.
In response, a number of Linux distributors have designed and implemented bundled HA solutions in their products, and numerous third-party add-ons are now available. However, several aspects of these technologies are not always clear, such as how the technologies work, the types of applications for which they are suitable, and the kinds of hardware required.
HA Clustering Provides Availability, Performance, and Capacity
For networked computers, clustering is the process of connecting multiple systems together to provide greater overall system availability, performance, capacity, or some combination of these. Because the term clustering itself is so broad, other terms—such as load balancing, failover, parallel, and Beowulf—are used to describe specific cluster implementations. For example, Beowulf clusters are designed to provide scalability and parallel processing for computational functions. HA clustering solutions, however, seek to provide enhanced availability for a service or application.
Common Types of HA Clusters in Linux
Two common types of HA clusters are emerging in the Linux environment: HA IP clusters and HA application clusters. HA IP clusters ensure availability for network access points, which typically are IP addresses that clients use to access network services. HA IP clusters on the Linux platform achieve high availability using the Linux Virtual Server (LVS) mechanism. By using this mechanism to support virtual IP addresses, some HA IP cluster implementations can also load-balance certain types of applications if their contents are completely replicated on a pool of application servers. Applications that the HA IP cluster can load-balance include static Web and File Transfer Protocol (FTP) servers and video streaming servers.
HA application clusters, on the other hand, are more suitable for stateful, transactional applications, such as database servers, Web application servers, file servers, and print servers. HA application clusters ensure availability through the failover of applications, along with all of the resources that the applications need-such as disks, IP addresses, and software modules-to remaining servers.
HA IP Clusters: The LVS Presents a Single System
Most HA IP cluster implementations, such as Piranha, Red Hat High Availability Server 1.0, TurboCluster from TurboLinux®, and Ultra Monkey (supported by VA LinuxTM ), use an LVS mechanism as well as a group, or pool, of cloned application servers.
The LVS presents the pool of application servers to network clients as if it were a single system. The LVS is represented by the virtual IP address or addresses clients use to access the clustered services, including the specific port and protocol—either UDP/IP or TCP/IP. An LVS server maintains the LVS identity and dispatches client requests to a physically separate application server or pool of application servers. Network clients are kept unaware of the unique physical IP addresses used by the LVS server nodes or any of the application nodes; the clients access only a virtual IP address managed by the LVS.
The LVS server is responsible for routing client requests to cloned application servers. To accomplish this task, the LVS is configured with scheduling policies that allocate and forward incoming connections to the application servers. High availability is achieved by having multiple destinations capable of processing requests. If one of the application servers fails, one or more application servers will be available to continue service through the same virtual IP address.
Heartbeats Monitor Server Health
To provide uninterrupted service, the LVS continuously monitors the health of the application servers (see Figure 1 ). Health monitoring between the LVS and application servers ensures timely failure detection and cluster membership status. This monitoring is performed via a heartbeat mechanism managed by the LVS. Heartbeat packets are sent between cluster nodes at regular intervals (on the order of seconds). If a heartbeat is not received after a predefined period of time—typically a few heartbeat intervals—the absent machine is presumed failed. If this machine is an application node, the LVS server stops routing client requests to it until it is restored. Depending on the implementation, the heartbeat protocol may run through a TTY serial port, UDP/IP over Ethernet, or even over shared storage connectivity.
Figure 1. Monitoring the Health of Application Servers
Providing continuous service requires that the applications are installed locally in each of the application servers. For this reason, the application servers are often referred to as "clones." Any data, such as Web or FTP content, must be completely replicated to all of the application servers to ensure a consistent response from the application server pool. In the event of a server failure, the LVS heartbeat mechanism detects the failure, makes the necessary changes to the cluster's membership, and continues forwarding requests to the remaining server or servers.
HA IP Clusters Support Load Balancing
HA IP clusters not only provide high availability for an IP address, but the manner in which client requests are forwarded from the LVS server to the cloned application servers also supports load balancing. In fact, load balancing—simply by adding additional application nodes when demand increases—can help the appropriate applications achieve tremendous scalability.
To help spread client requests, or workload, across the pool of application servers, each LVS implementation uses a set of basic scheduling policies. The most commonly used policies are round robin and least connections. Round robin simply forwards requests to each application server one at a time and perpetually repeats the process in the same order. With least connections, the LVS assesses the current number of connections on each application server and forwards the request to the server with the fewest number.
Some Linux distributions or products also use more advanced algorithms that can actually examine the load on each application server and distribute the incoming requests accordingly. For Web sites that do not maintain state information outside the Web server itself, most LVS implementations have a persistency mode that redirects clients to the appropriate servers throughout a session.
At a minimum, an HA IP cluster can be implemented by using one of the application servers for the LVS mechanism (see Figure 2 ). All requests are first handled by the server running the LVS, which initially determines whether the request will be handled locally or shipped to another application server. In this configuration, however, the LVS mechanism can become a bottleneck because one server handles all routing functions and application requests.
Figure 2. Hosting the LVS Mechanism on an Application Server
An Active/Passive LVS Helps to Prevent Bottlenecks
Assigning smaller loads to the server running the LVS can mitigate the risk of a bottleneck. Ideally, the LVS should be built with a pair of dedicated servers, one actively functioning as the LVS, the other acting as its hot standby (see Figure 3 ). This configuration is often referred to as an active/passive LVS, because one server actively serves as the LVS and routes requests to the application servers while the passive LVS server waits to assume control only if the active server fails. In a failure scenario, the standby server assumes the virtual IP address of the LVS, while retaining its own unique physical IP address, through a process referred to as IP failover. Clients are automatically reconnected to the LVS running on the other server without reconfiguration.
Figure 3. Redundant LVS Server Mechanism
Currently, most HA IP cluster distributions do not support active/active configurations. An active/active configuration would include two or more LVS servers, each active and responsible for a different LVS in addition to being available in the case of a failover. Furthermore, no LVS implementations currently support load-balancing configurations in which multiple LVS servers share routing responsibilities for the same virtual IP address.
As traffic to the site grows and the LVS server routes an increasing number of requests, the LVS server may have to be upgraded or replaced to ensure that CPU, memory, or network bandwidth is adequate. To ease the load on an LVS server that typically routes all application server responses back to clients using network address translation (NAT), responses from application servers can be sent directly to clients by creating another physical route and using IP tunneling or direct routing techniques.
Data Replication is a Challenge
Data replication often is the greatest challenge when implementing HA IP clusters with a pool of application servers. If just two application servers are required, a shared storage system can be built using Dell's cluster-ready PERC 2/DC RAID controllers and PowerVault 200S or 210S storage systems with Enclosure Services Expander Module (ESEM) or SCSI Enclosure Management Module (SEMM) cluster modules (see Figure 4 ). In shared storage configurations, both servers can access the same set of files using the global file system to share files between application servers. If more than two nodes are required or if shared storage is not desired, Intermezzo's distributed file system enables directory tree replication and can be used to replicate files to the application servers' internal disks.
Figure 4. Two-Node Shared Storage Configuration
When transactions are involved or when mirroring must be nearly instantaneous, complex distributed locking techniques often are required to maintain data integrity and consistency. While replication and mirroring may work for some applications that involve writes, the replication can incur substantial overhead. In the cases of databases, messaging, and most application services, it is difficult to implement HA IP clustering because of their read/write and transactional natures and the complexities in replicating or synchronizing their content. Therefore, these applications are better suited for Linux HA application clusters.
HA Application Clusters
HA application clusters, such as LifeKeeper® from SteelEye®, Convolo ClusterTM from Mission Critical Linux, RSF-1 from High-Availability.com, and VERITAS Cluster ServerTM , are appropriate for transactional applications, such as databases, groupware, file systems, and other applications containing business logic. While the LVS mechanism is the enabling technology for HA IP clusters, HA application clusters take the concept of the LVS a step further to ensure the availability of applications. HA application clusters achieve this availability by continuously monitoring the health of an application and the resources the application depends on for normal operation, including the server it is running on. Should any of the application resources fail, the HA application cluster will restart, or fail over, the application on one of the remaining servers.
To ensure that all of the resources an application needs are closely monitored and will fail over to one of the remaining servers, they are grouped together in a resource hierarchy. In resource hierarchies, the resources are grouped and arranged in dependency trees so that they can be moved (between physical resources) or restarted on different servers. A dependency tree ensures that the resources come online in the right order. Lower level resources such as disks and IP addresses are brought online first, and application modules are brought up last, after all the dependent resources are ready.
For example, the resource group hierarchy for a file share could include a disk where the files are stored, an IP address, a server name, and file share resources (see Figure 5 ). The disk resource and the IP address have no dependencies and are brought online first. The disk is among the first resources to come online because it would be futile to connect to the servers before the disk they are stored on is online and mounted. Likewise, it is important to bring up the IP address before bringing up the server's name. Finally, after all of the required dependencies are brought online, the application services are made available on the network.
Figure 5. Example Resource Group Hierarchy
Key elements that assist application failover are the resource manager and application recovery kits. The resource manager enables the user to define the resource hierarchy for applications and specify the dependencies among resources. Application recovery kits are tools, or a set of scripts, that provide the mechanism to automatically restart an application and all of its resources, in the proper order, on one of the remaining servers should a failure occur. Application recovery kits are usually provided by vendors of HA application clusters for packaged software, including databases and the most commonly used applications on Linux servers, such as Apache Web server, sendmail, and print services. Moreover, recovery kits at the system level for the Linux file system are becoming more widely available. For applications that have no associated recovery kit, users can create custom scripts by employing application programming interface (API) commands and utilities.
Generally three approaches exist to ensure that application data or storage remain available to reminding servers after failover. Figure 6 summarizes the advantages and disadvantages of these approaches.
Figure 6. Approaches to Ensure Data or Storage Availability After Failover
Passive Standby and Active/Active Modes Ensure High Availability
Similar to the LVS concepts of active/passive and active/active, HA application clusters use the passive standby mode and active/active mode terminology. A straightforward approach to achieve high availability is the passive standby mode, in which one server acts as the primary server, while a secondary server remains available for use should the primary server fail. In this passive backup mode, the secondary server is not used for any other processing; it simply stands by to take over if the primary server fails. This configuration enables maximum resources to be available to the application in the event of a failure. However, this configuration is expensive to implement because it requires twice the amount of hardware to be purchased.
For all but the most mission-critical applications, active/active configurations can be highly effective. In active/active configurations, each server performs useful processes, while still possessing the ability to take over for another server in the event of a failure. Drawbacks of active/active configurations include increased design complexity and the potential introduction of performance issues upon failover.
Multinode Solutions Are Now Available
Although the most common HA application cluster configurations are currently for two nodes, several multinode (more than two nodes) failover solutions for Linux have recently become available. With multinode clusters, configurations such as N+1 and cascaded failover help administrators meet high-availability needs in a complex environment while also providing better resource utilization. For example, in an N+1 configuration, a single dedicated server runs in passive mode while the rest of the servers actively process requests for their applications. If a server completely fails, the passive node provides all the resources of an unused server, rather than squeezing the application onto another server already responsible for several applications.
Multinode configurations can be implemented either by mirroring content locally to each server or through a switched storage fabric. Mirroring often requires complex replication techniques and network overhead to push and pull the content to all of the servers. The technology is now widely available to build switched fabrics by using storage area networks (SANs). SANs (see Figure 7 ) provide multinode clusters with excellent server-to-storage performance—even as additional servers are added to the SAN—as well as the ability to effectively scale the amount of storage the cluster nodes use.
Figure 7. A Four-Node Switched SAN Configuration
Combine Both HA Cluster Types for a Multitier Solution
High availability and scalability are equally important to the construction of an e-commerce or a business-critical system. Providing continuous service for a distributed, multitier application is possible by deploying HA IP and HA application clusters together. Figure 8 shows an e-commerce configuration in which a pair of LVS servers are responsible for two Web-based applications, one for static content running on a pair of servers, and one for commerce applications running on three servers. Both of these sites are load balanced through IP HA clustering and the active LVS server. High availability for the database component of the site is achieved with an active/active HA application clustering solution. One node of the cluster runs a database used for the site's catalog and inventory, while the other node runs the orders database. Although both servers cannot coprocess the same databases, each server can run all of the databases simultaneously if one of the two servers fails.
Figure 8. Using HA IP and HA Application Clusters Together
Support is Growing
Support for high availability and scalable services under Linux is growing. As these technologies mature, we will cover more specific implementations and features in future articles. Presently, Linux is more widely used as the front end of distributed, multitier configurations for stateless mode operations, such as load-balanced Web serving. While the need for high availability and scalability expands far beyond Web farms, the technologies mentioned in this article provide a good starting point for advanced solutions to come. As HA implementations continue their migration from UNIX to Linux, the number of proven options developed in the UNIX space will continue to expand for Linux.
Tau Leng (firstname.lastname@example.org) is a system engineer in the Scale Out System Group. His product development responsibilities include cluster product solutions from Dell including Linux high-performance and high-availability clusters. Tau earned an M.S. in Computer Science from Utah State University. Currently he is a Ph.D. candidate in Computer Science at the University of Houston.
Jenwei Hsieh, Ph.D. (email@example.com) is a member of the Scale Out Systems Group at Dell. He has published extensively in the areas of multimedia computing and communications, high-speed networking, serial storage interfaces, and distributed network computing. Jenwei has a Ph.D. in Computer Science from the University of Minnesota.
Edward Yardumian (firstname.lastname@example.org) is a technologist specializing in distributed systems, cluster computing, and Internet infrastructure in the Scale Out Systems Group in the Enterprise Server Products division at Dell. Previously, Ed was a lead engineer for Dell PowerEdge Clusters.