Strategies for Fault-Tolerant ComputingFor Windows Server 2003 Mehmet Altan AÇIKGÖZ Ercan SARAÇ
Agenda • Introduction • Fault-tolerant servers • Fault tolerance on Windows • Unique Benefits • Clustering • Comparison of Clusters & Stratus ftServer • FtServer Software Availability • Summary
Introduction • Availability the percentage of time that a system is capable of serving itsintended function. Correlation Between Availability and Annual Downtime
Importance of Availability • Availability of mission-critical information systems is often tied directly to business performance or revenue • Average Cost of Unplanned Downtime for Various Industries
Fault-Tolerant Servers • minimizing causes of downtime is through the use of fault-tolerant servers, combined with software that supports them • If a primary component fails, the secondary component takes over in a process that is seamless to the application running on the server.
Fault-Tolerant Servers • Most high-end servers employ at least some redundant components to eliminate common points of failure but they will still fail when a nonredundant component such as a microprocessor or memory controller fails • True fault-tolerant servers, however, employ complete redundancy across all system components, ensuring that no single point of failure can compromise system availability.
Traditional Barriers to Adoption • Extremely high hardware costs The typical cost for an entry-level fault-tolerant server running a proprietary operating system was $250,000 prior to 2000 • Complexity and expense of writing software Writing programs for these systems required a deep understanding of transactional semantics and manual “checkpointing” at the application level
Fault Tolerance on Windows • Microsoft designed the Windows Server 2003, Enterprise Edition, operating system to fully support fault-tolerant servers • Specific enhancements in Windows Server 2003 that apply to fault-tolerant servers include the following: • Memory mirroring • Multipath I/O • Improvements in load balancing and failover for miniport drivers • Hot-plug PCI support • Hot-add memory support
Adoption of Fault-Tolerance on Windows • To increase availability for traditional Windows–based solutions As Windows–based solutions continue to become more mission-critical, some companies are improving their application availability by moving these applications to fault-tolerant servers
Adoption of Fault-Tolerance on Windows • As a cost-effective alternative to proprietary platforms Companies are realizing lower costs by deploying fault-tolerant servers running Windows for solutions that have traditionally resided on clustered UNIX servers, mainframes, or proprietary fault-tolerant systems
Fault-Tolerance on Windows • According to Stratus Technologies, which first began shipping fault-tolerant servers running a proprietary operating system in 1982 and added a UNIX-based offering in 1995, the company’s three-year-old Windows–based ftServerproduct line has resulted in more than 500 new customers in 2003 alone. NEC Corporation, which began shipping mainframes running proprietary operating systems in 1965, reports similar findings since the introduction of its FT Series fault-tolerant servers for Windows in early 2001.
Complete Solutions • Downtime for Windows Server 2003–based solutions is typically due to hardware failures, bad device drivers, user error, poor change control processes, and so on, with a very small percentage attributable to the core operating system • Several fault-tolerant system vendors go a step further in delivering availability-related services through continuous server monitoring. As an example, every Stratus server continually monitors itself for component and operating system failure, and can be set to immediately call into the company’s customer assistance center to report a failure or other important event. NEC offers similar service offerings.
Unique Benefits • Reduced Time to Market Solutions intended to run on Windows–based fault-tolerant servers can be developed and deployed as rapidly as any other Windows–based application Companies can take advantage of the rich functionality provided in the .NET Framework and the highly-productive Microsoft Visual Studio® .NET integrated development system to rapidly develop custom solutions, or they can choose from the full range of off-the-shelf Windows applications
Unique Benefits • Ease of Integration With native support for industry standards such as XML Web services, the Microsoft platform and .NET technologies make it easy to integrate Windows–based solutions running on fault-tolerant servers with other systems. Microsoft BizTalk® Server extends these capabilities even further, with more than 300 plug-in BizTalk Adapters available to simplify enterprise application integration and enable companies to comply with industry-specific electronic transaction formats such as HIPAA or EDI
Unique Benefits • Ease of Management • Windows–based solutions running on fault-tolerant servers can be administered easily using the comprehensive management tools provided in the Microsoft platform. For example, Microsoft Operations Monitor enables companies to subject applications running on Windows–based servers to granular real-time monitoring, enabling administrators to detect many problems before they can affect system availability
Unique Benefits • Lower Hardware Costs • Fault-tolerant servers for Windows are available starting at under $20,000, a fraction of the typical $200,000-plus starting price for proprietary fault-tolerant platforms. Combined with the superior cost-effectiveness of the Microsoft platform, this order-of-magnitude decrease in hardware costs makesfault-tolerance on Windows economically justifiable in a far broader range of situations than fault-tolerance on proprietary platforms
Clustering • A cluster connects two or more servers together so that they appear as a single computer to clients. Connecting servers in a cluster allows for workload sharing, enables a single point of operation/management, and provides a path for scaling to meet increased demand. Thus, clustering gives you the ability to produce high availability applications
Three Technologies for Clustering • Microsoft servers provide three technologies to support clustering: • Network Load Balancing (NLB), • Component Load Balancing (CLB), and • Microsoft Cluster Service (MSCS).
Network Load Balancing • Network Load Balancing acts as a front-end cluster, distributing incoming IP traffic across a cluster of servers
Component Load Balancing • Component Load Balancing distributes workload across multiple servers running a site's business logic. It provides for dynamic balancing of COM+ components across a set of up to eight identical servers. • COM+ is both an object-oriented programming architecture and a set of operating system services.
Cluster Service • Cluster Service acts as a back-end cluster; it provides high availability for applications such as databases, messaging and file and print services. MSCS attempts to minimize the effect of failure on the system as any node (a server in the cluster) fails or is taken offline
Failover Capability Through Microsoft Cluster Service • MSCS failover capability is achieved through redundancy across the multiple connected machines in the cluster, each with independent failure states. • Redundancy requires that applications be installed on multiple servers within the cluster. However, an application is online on only one node at any point in time. As that application fails, or that server is taken down, the application is restarted on another node. • The Windows Server 2003, Datacenter Edition supports up to 8 nodes in a cluster
Each node has its own memory, system disk, operating system and subset of the cluster's resources. • If a node fails, the other node takes ownership of the failed node's resources (this process is known as "failover"). • Microsoft Cluster Service then registers the network address for the resource on the new node so that client traffic is routed to the system that is available and now owns the resource. When the failed resource is later brought back online, MSCS can be configured to redistribute resources and client requests appropriately
Cluster Service Architecture • The Cluster Service The Cluster Service is the core component and runs as a high-priority system service. • The Cluster Service controls cluster activities and performs such tasks as coordinating event notification, facilitating communication between cluster components, handling failover operations and managing the configuration. • Each cluster node runs its own Cluster Service.
The Resource Monitor The Resource Monitor is an interface between the Cluster Service and the cluster resources, and runs as an independent process. The Cluster Service uses the Resource Monitor to communicate with the resource DLLs
The Resource DLL Every resource uses a resource DLL. The Resource Monitor calls the entry point functions of the resource DLL to check the status of the resource and to bring the resource online and offline. • The resource DLL is responsible for communicating with its resource through any convenient IPC mechanism to implement these methods.
Comparison: High-Availability Clusters and the Stratus ftServer Family • Clusters are still the most frequent choice for meeting defined targets for availability in Windows server environments • High-availability clusters and Stratus ftServer systems start out with one common characteristic: Both use redundant hardware to eliminate single points of failure. However, similarities between the two approaches end there.
The Stratus ftServer family offers a choice of Dual Modular Redundant (DMR) and Triple Modular Redundant (TMR) configurations. DMR models include two lockstepped CPU/ memory units; TMR systems contain a total of three CPU/memory units • In both systems, duplicate or triplicate motherboards execute all instructions in lockstep. If proprietary on-board error detection circuitry identifies a fault, that motherboard is immediately isolated from the system and removed from service
A second level of error detection compares outputs from each CPU/memory unit on each I/O operation. • In a DMR system, if a comparison error occurs with no on-board error indication software algorithms based on motherboard history are used to determine which board to remove from service. • In a TMR system, "odd-man-out" voting logic is used to identify and isolate additional faults. In either event, processing continues on the remaining motherboards without interruption or performance degradation. • The entire error detection and isolation process occurs in just milliseconds, without any interruption to system operation
In contrast, a high-availability cluster initiates its failover process only after a failed node does not send a "heartbeat" message. Crucial seconds can elapse before the working node even begins the failover routine, which results in downtime even under the best of circumstances. • After failover initiation, the new cluster is formed, the database is recovered, and applications are restarted. This recovery sequence can span many minutes depending on the complexity and sophistication of the application environment and the cluster configuration.
In turn, the absence of a heartbeat may not always be a reliable indicator of system error or availability e.g. as in the case of slow performance. In these cases, failover sequences may be initiated unnecessarily
Software Availability • A frequently mentioned strength of high-availability clusters is their ability to provide a superior degree of software availability compared to traditional hardware fault-tolerant offerings • Stratus recognizes the need to enhance total system availability and delivers a number of unique features that speed recovery in the event of a software outage • Hardened device drivers • ftServer Software Availability Manager • ftServer Active UpgradeTM technology • Quick Dump“
Hardened device drivers • Specifically, hardened drivers detect and stop adapter card writes beyond the physical memory allocated; monitor the mean time between failure (MTBF) of the PCI card and remove it from service if thresholds are crossed; and support visual indicators that communicate the device state.
ftServer Software Availability Manager • In addition to Microsoft Windows operating system monitoring, the ftServer Software Availability Manager tracks the activity of CPU, memory, and disk resources against thresholds specified by the system administrator
ftServer Active UpgradeTM technology • Active Upgrade technology adds a new availability dimension beyond the field-proven 99.999% uptime protection for which Stratus servers are known. Unlike cluster online upgrade, the complexities associated with upgrading multiple systems, revision synchronization between systems, and recovery from issues encountered during an upgrade are vastly reduced because the Active Upgrade process takes place on a single fault-tolerant ftServer system
Quick Dump • This facility allows the server to be restarted rapidly after an operating system outage, without sacrificing information needed to analyze the cause
Conclusion • Fault-tolerance and software-based clustering provide two powerful options for achieving mission-critical availability • Windows–based fault-tolerant solutions carry far lower costs than proprietary solutions, enabling companies in all industries to achieve a positive return on investment in a reasonable timeframe across a much broader range of scenarios • As a result Windows–based fault-tolerant solutions is very capable of high availability with the true server technologies like Stratus ftServer Systems
References • http://msdn2.microsoft.com/en-us/library/ms952401.aspx • http://www.microsoft.com/windowsserver2003/evaluation/performance/faulttolerance.mspx • http://www.stratus.com/whitep/evalalt/haclus.htm • http://www.stratus.com/resources/pdf/transerrors.pdf • http://www.stratus.com/resources/pdf/evalalt.pdf