Dr. Aniruddha S. Gokhale gokhale@dre.vanderbilt (Co-Advisor)

Jaiganesh Balasubramanian jai@dre.vanderbilt.edu http://www.dre.vanderbilt.edu/~jai FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Douglas C. Schmidt schmidt@dre.vanderbilt.edu (Advisor) Dr. Aniruddha S. Gokhale gokhale@dre.vanderbilt.edu (Co-Advisor) Department of Electrical Engineering and Computer Science Vanderbilt University, Nashville, TN, USA Middleware 2007 Doctoral Symposium (MDS 2007) Newport Beach, CA, USA

Focus: Distributed Real-time Embedded (DRE) Systems • Stringent simultaneous QoS demands, e.g., “never die,” soft real-time, etc. • predominantly stateless, tolerates weaker consistency if stateful • Distributed Object Computing middleware used to design and develop DRE systems • support for highly available systems (e.g., FT-CORBA) • end-to-end predictable behavior for requests (e.g., RT-CORBA) • Goal is to provide real-time fault tolerance to DRE systems • FT uses redundancy; RT assured by resource management

Determining the Replication Scheme for DRE Systems • Active replication • client requests multicast and executed at all the replicas • strong state consistency • deterministic behavior of replicas • very fast recovery • resource-expensive • Passive replication • low resource/execution overhead • better suited for weaker consistency • no restrictions on deterministic behavior • enables making tradeoffs between FT and resource consumption • applies to a class of soft real-time DRE systems • Passive replication better suited for our purpose • Goal is to provide RT+FT for DRE systems using passive replication

Challenges: Using Passive Replication for DRE Systems • Challenge 1: Maintain real-time performance of applications at all times • Focus: Real-time performance after failover • Decision-making algorithms used for electing a new primary • Client response times depend on the loads of the processor hosting the failover target • Task deadlines are met if the CPU utilization is under a threshold • Failure could affect multiple clients – failover to multiple processors

Challenges: Using Passive Replication for DRE Systems • Challenge 2: Fast failover on client side • Focus: Faster and predictable failover • Client-side middleware could maintain static list of references • Round-robin approach of trying out different references • Faster failover – but not appropriate failover • No RT guarantee after failover • Client-side middleware need to be updated with references based on dynamic operating conditions

Challenges: Using Passive Replication for DRE Systems • Challenge 3: FT+RT in spite of resource overloads • Focus: Dynamic reconfigurations and overload management • Long running systems – continued operation through many failures • Periodic loss of resources – simultaneous failures • Graceful degradation of applications • Operate higher priority applications at all times • Overload management – predictable and fast • Alternate degraded and assured functionality

Challenges: Using Passive Replication for DRE Systems • Challenge 4: Resource-aware stateful replication • Focus: State consistency in stateful DRE systems • State transfer requires CPU and network reservations • Support for resource-constrained operations • Different consistency models – strong, weak, and no consistency • Adapt consistency of certain tolerant applications depending on available resources • Utility optimizations – better state consistency for higher priority applications

Our Approach: FLARe RT-FT Middleware • FLARe =Fault-tolerant Lightweight Adaptive Real-time Middleware • Transparent and Fast Failover • Redirection using client-side portable interceptors • catches COMM_FAILURE exceptions and transparently throws LOCATION_FORWARD exceptions • Failure detection can be improved with better protocols – e.g., SCTP

Our Approach: FLARe RT-FT Middleware • Real-time performance after failover • monitor CPU utilizations at hosts where backups are deployed • adaptive failover target selection algorithms operated by a resource manager • failover targets chosen on the least loaded host hosting the backups • better chance to provide RT performance

Our Approach: FLARe RT-FT Middleware • Predictable failover • failover target decisions computed periodically by the resource manager • conveyed to client-side middleware agents – forwarding agents • agents work in tandem with portable interceptors • redirect clients quickly and predictably to appropriate targets • agents periodically/proactively updated when targets change

Current Progress • Current Progress • Initial prototype of FLARe developed using The ACE ORB (TAO) • Stateless FT using passive replication • Implemented a resource-aware adaptive failover target selection algorithm • Compared and contrasted the performance of the FT middleware when using static failover strategies versus adaptive failover strategies • Significant reduction in client response times and system utilization FLARe is open-source and available at www.dre.vanderbilt.edu

Proposed Research and Expected Milestones • Overload management • investigate overload management algorithms that do not degrade application QoS • minimum client disturbance • implemented within the resource manager • extreme resource constrained operating conditions – investigate opportunities to change implementations and reduce overloads • Utility optimizations – when to degrade QoS and when not to • RT/FT trade-offs Deadline : March 2008

Proposed Research and Expected Milestones • State Synchronization • View state synchronization as an aperiodic scheduling problem • more slack available – more time available to synchronize state • slack devoted for higher priority applications always • availability of slack – support for different consistency management schemes (e.g., weak, strong, none) • application informs middleware when to synchronize state Deadline : July 2008

Proposed Research and Expected Milestones • Network Reservations • View real-time fault-tolerance as an end-to-end scheduling problem • network reservations are required for state transfers • without reservations, no predictability • middleware-mediated mechanisms to use external network QoS mechanisms such as DiffServ • network monitors for alternate routes in the presence of failures (leverage existing network research) Deadline : September 2008

Concluding Remarks • Passive replication – a promising approach for DRE systems • Resource-aware adaptive fault-tolerance – required for adapting passive replication for DRE system requirements • Adaptive algorithms required for trading off RT versus FT requirements • Middleware transparently supports FT for applications – works in conjunction with adaptive algorithms to take care of RT requirements as well FLARe is open-source and available at www.dre.vanderbilt.edu

Dr. Aniruddha S. Gokhale gokhale@dre.vanderbilt (Co-Advisor)