Achieving 99.999 Percent Uptime on Windows

Achieving 99.999 Percent Uptime on Windows

Dell Magazines

Dell Magazines

Dell Power Solutions

Dell Power Solutions
Subscription Center

Achieving 99.999 Percent Uptime on Windows

By Craig Jon Anderson (Issue 1 2001)

Marathon Assured Availability solutions combined with Dell PowerEdge servers can keep your Microsoft Windows infrastructure up and running continuously. This article describes the solution and compares it with traditional clusters for high availability.

All key functions within the enterprise today—inventory control, order placement and confirmation, customer support, communications, accounting, and workgroup collaboration—are dependent on the information technology (IT) infrastructure. And Microsoft® Windows®  environments have emerged as one of the major platforms for today's mission-critical applications, such as messaging and collaboration, call center applications, and manufacturing automation software.

New technologies are also contributing to the dramatic increase in the volume of transactions handled by this infrastructure. For example, e-mail can confirm orders, notify customers of delays, and provide ongoing customer support. Every second of downtime equals lost orders, lost revenue, and reduced productivity. Indirect costs include customer dissatisfaction, increased IT workload, and inefficient business processes.

This article describes two key approaches for IT managers to maintain the availability of their Windows infrastructure and provides an overview of the Marathon Assured Availability®  solution.

High availability options

IT managers want to protect their infrastructure from downtime. The first step in this process is to carefully consider the costs and consequences of downtime within their organizations, which can help IT managers make informed decisions about the level of availability appropriate to their needs. For example:

  1. An off-the-shelf Windows server delivers 99 percent availability with an expected average downtime per year of 83.3 hours.
  2. Clusters deliver 99.9 percent availability with an expected downtime of 526 minutes per year or 8.33 hours (based on one failure per year).
  3. Dell®  servers with Marathon Assured Availability solutions deliver more than 99.999 percent availability, with an expected downtime of less than 5.26 minutes per year.

This evaluation can help IT managers choose the right approach for their situation. The two major categories—stand-alone servers and clusters—provide different levels of availability and different restoration processes. The first category recovers from failures; the second computes through failures, making them invisible to the end user.

The first category uses a single system to run the user applications until a failure is detected. The mechanism for handling failures is to fail over to an alternate server. In the simplest recovery systems, the operator physically moves the disks from the failed system to another system and boots the second system.

In more sophisticated technologies, the second system has both knowledge of the applications and users on the failed system and access to the users' data. These systems automatically restart the applications and log users onto the new system, giving them access to saved data. These systems called clusters provide 99.9 percent availability. However, in both cases users see a pause in operation during the failover and restart processes and risk losing unsaved data. Subsequent system performance may also be degraded. Figure 1 shows standby servers and clustered systems as examples of this category.

Figure 1
Figure 1. Simple cluster configuration (Microsoft Cluster Server)

The second category masks failures by using two parallel components to perform the same function at the same time. If one component fails, the other continues, thereby improving overall system reliability. Because these systems have at least two processors viewing and manipulating the same data simultaneously, the failure of any single component is invisible to both the application and the user. These truly fault-tolerant systems detect most faults instantaneously and offer other features that facilitate 24x7 operation, such as online repair and upgrade capabilities. The Marathon Assured Availability solution provides this type and level of availability for Windows environments.

The Marathon unique architecture

Marathon's architecture logically divides the Windows operating environment into synchronous components called compute-elements (CE) and asynchronous components called I/O processors (IOP). (See Figure 2 .) Together, each CE-I/O processor pair functions as a single logical Windows server. Either CE can be paired with either I/O processor. Together they are called an array, a set of duplexed computing and I/O components configured to form a single, logical server.

Figure 2
Figure 2. The Marathon architecture

Features of Marathon Assured Availability solution include:

  • ComputeThru processing —failure events will not stop a transaction in progress
  • Load 'n Go, NoTouch Recovery —replaced components will automatically rejoin
  • Continuous data access —data is available even through operating system and hardware failures
  • Uninterrupted connectivity —duplicate network connections
  • Constant performance —each redundant system runs with no degradation of capacity
  • OS fault tolerance for Windows 2000 and Windows NT —patented architecture provides extra level of OS fault tolerance
  • No need for scripting or cluster application programming interface (API) programming
  • Disaster tolerance —optional SplitSite®  capability allows physical separation of redundant systems

Provides high availability
Marathon Assured Availability solutions deliver at least 99.999 percent uptime and eliminate vulnerability to hardware failures and most operating system faults. Rather than failing over, Marathon-enabled servers compute continuously through problems that occur in the Windows I/O system, making such problems invisible to both applications and users.

Isolating the application from the I/O subsystem eliminates software failures related to the I/O hardware and associated software (OS- and kernel-mode device drivers). Note that the largest body of code within both Windows NT and Windows 2000 is the set of device drivers associated with the variety of hardware in the Microsoft HCL.

Marathon's redundant I/O processor architecture eliminates many asynchronous, hard-to-reproduce bugs that only show up under the stress of a production environment. By running all I/O on dedicated I/O processors, the application executes instructions without being interrupted by the asynchronous I/O system. This averts production-related bugs and results in a stable application during execution.

Protects data integrity
Marathon also provides data redundancy and eliminates data loss by identifying the errors or failures before any data can be corrupted, isolating errors or failures so the system can continue to operate in the presence of the error or failure, and repairing the failed component while the system is running user applications.

Through resynchronization, the failed subsystem is fully restored into the system configuration with minimal or no interruption of service.

Minimizes disaster vulnerability
The optional SplitSite capability allows geographic separation of IO/CE pairs for maximum protection from flood, fire, and other physical threats. The pairs are connected with high-speed fiber cable to maintain transaction speeds. If one pair is destroyed, the surviving pair computes through the disaster without interruption of service. This is a natural capability of the Marathon architecture.

Offers simple implementation and management
Marathon Assured Availability solutions provide high availability with no failover scripting, complex configuring, or customized application software often required by other solutions. (See Figure 3 .) Systems are built on Dell servers with standard Windows 2000 or Windows NT operating systems, delivering high availability to all Windows applications.

Figure 3
Figure 3. Marathon Array versus traditional clusters

Marathon provides assured availability

Marathon Assured Availability solutions deliver continuous infrastructure availability, optimal data protection, and easy implementation. Because the software does not require special application software or scripts and system management is "touch free," it offers high levels of availability with a lower total cost of ownership.

Craig Jon Anderson (craig@marathontechnologies.com) is vice president of market development at Marathon Technologies Corporation. He has more than 20 years of experience in business development, marketing, and professional services. He holds an M.B.A. from the University of Virginia Darden School and a B.S. from Ohio State University.

For more information

For more information, service, and support, visit www.marathontechnologies.com/Dell.htm or call 877-999-9971