Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance

Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance Aniruddha Gokhale agokhale@lucent.com In collaboration with Balachandran Natarajan (bala@cs.wustl.edu) Douglas C. Schmidt (schmidt@uci.edu) Shalini Yajnik (shalini@precache.com)

Motivation • Distributed applications are becoming more complex & mission-critical • Increasing demand for cots-based multi-dimensional quality-of-service (QoS) support • E.g., Simultaneous requirements for efficiency, predictability, scalability, security, & dependability • Key open challenge is QoS-enabled dependability

Promising Solution: Fault Tolerant (FT) Distributed Object Computing Middleware Challenges • Limitations of non-OO FT strategies that focus on application processes • Techniques based on process-based failure detection & recovery are not applicable to distributed object computing applications due to: • Overly coarse granularity • Inability to restore complex object relationships • Restrictions on process checkpointing & recovery

Overview of Fault Tolerant CORBA Overview • Provides a standard set of CORBA interfaces, policies, & services • Entity Redundancy of objects is used for fault tolerance via • Replication • Fault detection & • Recovery from failure Features • Inter-Operable Group References (IOGR) • Replication Manager • Fault Detector & Notifier • Message Logging for recovery • Fault tolerance Domains

Interoperable Object Group References • Composite & enhanced Interoperable Object Reference (IOR) for referencing server object groups • Comprises one or more TAG_INTERNET_IOP profiles, which in turn must contain a TAG_FT_GROUP and zero or more TAG_IIOP_ALTERNATE_ADDRESS components • TAG_PRIMARY component in at most one TAG_INTERNET_IOP profile • Client ORBS operate on IOGRs in the same way as with IORs

DOORS & FT-CORBA • DOORS is a “Distributed OO Reliable Service” developed prior to FT-CORBA • Uses the service strategy to provide FT to CORBA objects • Patterns and mechanisms in DOORS were integrated into FT-CORBA standard • DOORS implements most of FT-CORBA standard • Focus on passive-replication • Available as open-source for non-commercial use from Lucent • Runs atop the TAO open-source real-time ORB • www.theaceorb.com

FT-CORBA Component Interaction 1.External object asks RM to set properties for replica group and create it 2.RM delegates replica creation to local factories 3.The local factories create CORBA objects 4.The local factories send the replica IOR’s to the RM for it to create the IOGR 5.The RM registers the IOGR with a CORBA Naming Service(NS) 6.The RM asks fault detectors to initiate fault monitoring of replicas 7.Clients contact the NS for IOGR 8.Client sends requests to the primary

Fault Detection and Recovery 1.Fault detector detects failure of primary 2.Detector propagates fault to Notifier 3.Notifier pushes fault to RM 4.RM promotes backup to primary 5.RM requests local factory to create a new backup and gets new IOR 6.RM creates new IOGR and informs all replicas of it 7.RM registers new IOGR with NS 8.Client sends request to old primary PRIMARY BACKUP BACKUP PRIMARY 9.Old primary throws exception LOCATION_FORWARD 10.Client sends request to new primary

Optimization Opportunities to Improve Fault Tolerant CORBA Performance ORB Core Optimizations • Efficient IOGR parsing & connection establishment • Reliable handling & ordering of GIOP messages • Predictable behavior during transparent connection establishment & retransmission • Tracking requests with respect to the server object group CORBA Service Optimizations • Support for dynamic system configuration • Bounded recovery time • Minimize overhead of FT CORBA components

Effect of Polling Interval on Failure Detection Times Analysis • Failure detection time increases with the polling interval • Average failure detection time is half the polling interval Challenge • Choosing small polling interval • Minimize message overhead Fault detection time measured as the time between the failure of replica & the FaultDetector detecting failure

Effect of Polling Interval on Recovery Time Analysis • Average failure detection time is half the polling interval • Replica Group Management time is constant Challenge • Minimize replica group management time Recovery Time = Failure detection time + Replica Group Management Time

Overview of Patterns Patterns codify expert knowledge to help generate software architectures by capturing recurring structures & dynamics and resolving common design forces www.posa.uci.edu/ Design patterns capture the static & dynamic roles & relationships in solutions that occur repeatedly Architectural patterns express a fundamental structural organization for software systems that provide a set of predefined subsystems, specify their relationships, & include the rules and guidelines for organizing the relationships between them Optimization principle patterns document rules for avoiding common design & implementation mistakes that degrade performance

Replica Replica Replica Replica HANGS Polling Thread Fault Detector HANGS Fault Notifier Decoupling Polling and Recovery Context • Periodic polling & recovery request done in the same polling thread can block the thread • Problem • Blocking can cause missed polls • Forces • Must guarantee polling of other objects while recovery request is sent • Must minimize concurrency overhead • Solution • Apply the Leader-Followers or AMI architectural pattern

Decoupling Recovery Initiation From Recovery Execution Context • Replication Manager serializes failure reports • Problem • Reduced responsiveness • Forces • Bounded amount of time for failure recovery irrespective of number of failure reports • Solution • Apply the Active Object design pattern

Supporting Interchangeable Behavior Context • FT properties can be set statically (as defaults) or set dynamically • Problem • Hard-coding properties make the FT-CORBA design inflexible & non-extensible • Forces • Need highly extensible services that can be composed transparently from configurable properties • Solution • Apply the Strategy design pattern

Consolidating Strategies Context. • FT CORBA implementations can have many properties. • e.g.,membership, replication, consistency, monitoring, # of replicas, etc. • Problem • Risk of combining semantically incompatible properties • Forces • Ensure semantically compatible properties • Simplify management of properties • Solution • Apply the Abstract Factory design pattern

Dynamic Configuration Context • There are many potential FT properties that can be used • Problem • Static configuration of properties is inflexible & overly resource intensive • Forces • The behavior of FT-CORBA properties should be decoupled from the time when they are actually configured • Solution • Apply the Component Configurator design pattern

Efficient Property Name-Value Lookup Context • FT-CORBA mandates a hierarchical lookup of properties based on strings • Property lookup is required during object group creation & recovery • Problem • Inefficient property lookup degrades QoS • Forces • Efficient lookups of properties guided by the order specified in the FT-CORBA standard • Solution • Use the Chain of Responsibility design pattern & Perfect Hashing optimizations

Research Directions • Middleware for Ad hoc/Wireless networks • FT CORBA enhancements for JINI-like systems • CORBA Pluggable Protocol for Bluetooth devices • Middleware enhancements for 3G wireless/mobile internet • Fault tolerance • Sequenced Initialization and Recovery (dealing with object dependencies) • Handling failure groups and collocated groups • Fault Escalation strategies and Fault Analysis • Growth/degrowth, runtime upgrades • QoS-enabled framework of middleware components • Higher level middleware framework shielding applications from lower level middleware • Multi-dimensional QoS support • Patterns-based architecture of plug & play components • Code generation tools for repetitive tasks

Concluding Remarks • Researchers & developers of distributed systems face common challenges, e.g.: • Connection management, service initialization, error handling, flow control, event demuxing, distribution, concurrency control, fault tolerance, synchronization, scheduling, & persistence • The application of patterns, frameworks, & components can help to resolve these challenges • Carefully applying these techniques can yield efficient, scalable, predictable, dependable, & flexible middleware & applications

EXTRA SLIDES

Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance

Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance

Presentation Transcript

Fault-Tolerant Broadcast

Fault Tolerant CORBA (FT-CORBA) - Modeling and Analysis

Fault tolerance made easy Patterns for fault tolerant software design applied

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

High-Performance, Low Fault-Tolerant Schools

Fault Tolerant MPI

Fault Tolerant Computing Based on Diversity

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Automatic Generation of Fault-Tolerant CORBA-Services

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

High-Performance, Low Fault-Tolerant Schools

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast