Download
drafts distributed real time applications fault tolerant scheduling n.
Skip this Video
Loading SlideShow in 5 Seconds..
DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling PowerPoint Presentation
Download Presentation
DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling

DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling

1 Vues Download Presentation
Télécharger la présentation

DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. DRAFTSDistributed Real-time Applications Fault Tolerant Scheduling Claudio Pinello (pinello@eecs.berkeley.edu) DRAFTS

  2. Motivation • Drive-by-Wire applications DRAFTS

  3. Motivation • No rods  increased passive safety • Interior design freedom BMW, Daimler, Cytroen, Chrysler, Bertone, SKF, etc… DRAFTS

  4. Problem Overview • Safety: system failure must be as unlikely as in traditional systems • Fault tolerance: redundancy is key DRAFTS

  5. Faults • SW faults: bugs • can be reduced by disciplined coding • even better by code generation • HW faults • harsh environment • many units (>50 uProcessors in a car; subsystems with 10-15 uP’s) DRAFTS

  6. Fault Model • Silent Faults • faults result in omission errors • Detectable Faults • faults result in detectably corrupted data (e.g. CRC-protected channels) • Non-silent Faults • faults result in value errors • Byzantine Faults • malicious attacks, non-silent faults, unbounded delays, etc… DRAFTS

  7. Software Redundancy • Space redundancy • execute replicas on different HW • send results on different/multiple channels DRAFTS

  8. Pros: design once Cons: N-x costs, 1x speed Pros: reduced cost Cons: degradation, 1x speed multiple designs N-copies Solution Plant Plant Plant Plant Abstractinput Abstractinput Abstractinput Abstractinput ArbiterBest ArbiterBest ArbiterBest ArbiterBest AbstractOut AbstractOut AbstractOut AbstractOut CoarseCTRL CoarseCTRL CoarseCTRL CoarseCTRL FineCTRL FineCTRL FineCTRL FineCTRL Plant Plant Iterator Iterator Iterator Iterator Abstractinput Abstractinput AbstractOut AbstractOut Iterator Iterator DRAFTS

  9. Redundancy Management • Managing a distributed system with multiple results requires careful programming • keep N-copies synchronized • exchange and apply results • detect and isolate faults • recover DRAFTS

  10. Off-The-Shelf solutions TTP-based architectures FT-CORBA middle-ware Synthesis Debugged and portable libraries Possible solutions Development tools DRAFTS

  11. Automotive Domain • Production costs dominate NRE costs • multi-vendor supply-chain • interest in full utilization of architectures • Validation and certification are critical • validate process • validate product DRAFTS

  12. Shortcomings of OTS solutions • TTP • proprietary communication network • network redundancy default is 2-way • active replication  potential underutilization of resources • FT CORBA • fairly large overhead middleware DRAFTS

  13. Synthesis-based Solution • Synthesize only needed glue-code • at the extreme: get rid of OS • Customizable replication mechanisms • use passive replicas • Treat architecture as a distributed execution machine • exploit parallelism to speed up execution DRAFTS

  14. Schedule Synthesis Plant CPU CPU CPU CPU CPU CPU Mapping Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL Sens Sens Sens Sens CoarseCTRL CoarseCTRL CoarseCTRL Input Input Input Act Act Act Output Output Output ArbiterBest ArbiterBest ArbiterBest FineCTRL Sens CoarseCTRL CPU CPU CPU CPU CPU CPU Input Act ArbiterBest Output Sens FineCTRL Iterator FineCTRL Iterator Iterator Iterator DRAFTS

  15. Synthesis-based Solution • Enables fast architecture exploration DRAFTS

  16. Contributions • Programming Model • Metropolis platform • Schedule synthesis tool and optimization strategy • Verification Tools DRAFTS

  17. Programming Model • Definition of a programming model that • Is amenable to specifying feedback controllers • Is convenient for analysis, simulation and synthesis • Supports degraded functionality/accuracy • Supports redundancy • Deterministic DRAFTS

  18. Pros: Deterministic behavior Actors perform deterministic computation (no internal states) Requires all inputs to fire an actor Explicit parallelism Good for periodic algorithms Shortcomings: Requires all inputs to fire an actor, but source actors may fail! Static Data-flow Model B C A DRAFTS

  19. Pendulum Example Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL FineCTRL Iterator Bang-Bang Linear DRAFTS

  20. Model Extensions • Node Criticality • Node Typing (sensor, input, arbiter, etc.) • Some types (input and arbiter) can fire with missing inputs • Tokens have “Epoch” and “Valid” fields • Specialized single-place buffer links • manage redundant sources (and destinations) DRAFTS

  21. Data Tokens: Epoch • iteration index of the periodic algorithm • Actors ask for “current” inputs • Using >= we can account for missing results (self-synchronization) Data Epoch Valid DRAFTS

  22. Data Tokens: Valid • Valid models the effect of fault detection: • True: data was received/produced correctly • False: data was not received on time or was corrupted • Firing rules (and actors) may use it to change their behavior Data Epoch Valid DRAFTS

  23. FTDataFlow modeling • Metropolis used as framework to develop the set of tools • FTDF is a platform library in Metropolis • modeling, simulation, fault injection • supports semi-automatic replication • results visualization DRAFTS

  24. DF_SENactor sensor actor DF_INactor input actor DF_AINactor abstract input actor DF_FUNactor data-flow actor DF_ARBactor arbiter actor DF_AOUTactor abstract output actor DF_OUTactor output actor DF_ACTactor actuator actor DF_MEM state memory DF_Injector fault injection Actor Classes DRAFTS

  25. Pendulum Example Inject Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL FineCTRL Iterator DRAFTS

  26. Simulation output Fault DRAFTS

  27. Summary on FTDF • Extended SDF to deal with • missing/redundant inputs • different criticality • functionality types • Developed Metropolis platform • modeling, simulation, fault-injection, visualization of results • support for adding redundancy DRAFTS

  28. Architecture Connectivity: bipartite graph Computation and communication times:actor/cpu data/channel matrices of execution and transmission times Same as SynDEx model Architecture Model CPU CPU CPU CPU CPU CPU DRAFTS

  29. Fault Behavior • Failure patterns • Subsets of Arch-Graph that may fail simultaneously • For each failure pattern specify criticality level • i.e. which functionalities must be guaranteed • typically for empty failure pattern all functionality must be guaranteed DRAFTS

  30. Synthesis Problem Plant CPU CPU CPU CPU CPU CPU Mapping Plant Abstractinput ArbiterBest AbstractOut CoarseCTRL Sens Sens Sens Sens CoarseCTRL CoarseCTRL CoarseCTRL Input Input Input Act Act Act Output Output Output ArbiterBest ArbiterBest ArbiterBest FineCTRL Sens CoarseCTRL CPU CPU CPU CPU CPU CPU Input ArbiterBest Act Output Sens FineCTRL Iterator FineCTRL Iterator Iterator Iterator • Given • Application • Architecture • Fault Behavior • Derive • Redundancy • Schedule DRAFTS

  31. Pendulum Example CPU CPU CPU • Actuator/Sensor location • Tolerate any single fault • {empty} all functionality • {one CPU} may drop FineController, and sensor/actuator on that CPU • {one Channel} may drop FineController Sens Sens Sens Act Act DRAFTS

  32. Refined I/O Plant Sens Act Output Sens Input ArbiterBest CoarseCTRL Act Sens FineCTRL Iterator DRAFTS

  33. Full Replication Plant Sens CoarseCTRL Input Act Output ArbiterBest Sens CoarseCTRL Input ArbiterBest Act Output Sens FineCTRL Iterator Iterator Iterator DRAFTS

  34. Simulation output DRAFTS

  35. Schedule Synthesis Strategy • Leverage existing dataflow scheduling tools (e.g. SynDEx) to achieve a distributed static schedule that is also fault-tolerant • At design time (off-line) • devise redundant schedule • At run-time • trivial reconfiguration: skip actors that cannot fire DRAFTS

  36. Generating Schedules Maximum performance • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components (critical functionalities cannot run there) • Schedule all functionalities • Merge the schedules Add redundancy DRAFTS

  37. Generating Schedules • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components • Schedule the critical functionalities • Merge the schedules DRAFTS

  38. Merge into FTS [ECU0]Input receiver (requires 1) [ECU0]Function1 (required) [ECU0]Function2 (optional) [ECU1]Arbiter [ECU1]Output driver (requires 1) • Care must be taken to deal with multiple routings, clear non optimality [ECU0]Sensor1 [ECU1]Sensor2 [ECU1]Input receiver (requires 1) [ECU1]Function1 (required) [ECU0]Function2 (optional) [ECU0]Arbiter [ECU0]Output driver (requires 1) [ECU0]Actuator1 [ECU1]Actuator2 DRAFTS

  39. Heuristic 1: Limit CPU Load • Full architecture • Schedule all functionalities • For each failure pattern • Mark the faulty architecture components (critical functionalities cannot run there) • Re-schedule only critical functionalities (constrain non critical as in full architecture) • Merge the schedules Redundancy for critical only DRAFTS

  40. Heuristic 2: Limit Bus Load Heuristic 3: passive replicas (limit CPU load) • Prune redundant communication [ECU0]Sensor1 [ECU1]Sensor2 [ECU0]Input receiver (requires 1) [ECU1]Input receiver (requires 1) [ECU0]Function1 (required) [ECU1]Function1 (required) [ECU0]Function2 (optional) [ECU0]Arbiter [ECU1]Arbiter [ECU0]Output driver (requires 1) [ECU1]Output driver (requires 1) [ECU0]Actuator1 [ECU1]Actuator2 DRAFTS

  41. Total Orders • For each processor and for each channel find a total order that is compatible with the partial order of FTS • Prototype: “any compatible total order” DRAFTS

  42. Schedule optimization • Exploit architectural redundancy as a performance boost (in absence of faults) • replicas overloading and deallocation • passive replicas • graceful degradation: reduced functionality (and resource demands) under faults DRAFTS

  43. Active Replicas CPU CPU Behavior: Active Replication: B P1 P2 A A A D C B B Architecture: C C D D DRAFTS

  44. Deallocation & Degradation K P CPU CPU D D P K Behavior: Deallocation: B P1 C1 C2 P2 A A A D C B C Architecture: B->D C->D C B D D DRAFTS

  45. Aggressive Heuristics • Some heuristics can be certified to not break fault-tolerance/fault behavior • Others may need verification of the results • E.g. human inspection and modification DRAFTS

  46. (Off-line) Verification Functional Verification • For each failure pattern the corresponding functionality is correctly executed • Timing Verification/Analysis • Worst case iteration time under each fault DRAFTS

  47. Functional Verification • Apply equivalence checking methods to FT Schedule, under all fault scenarios (failure patterns) • Based on application DAGs & Architecture graph DRAFTS

  48. Functional Verification (example - continued) [ECU0]Sensor1 [ECU1]Sensor2 Sensor1 Sensor2 Input receiver (requires 1) [ECU0]Input receiver (requires 1) [ECU1]Input receiver (requires 1) Function1 (required) Function2 (optional)  ? [ECU0]Function1 (required) Arbiter [ECU1]Function1 (required) Output driver (requires 1) [ECU0]Function2 (optional) Actuator1 Task Graph – Actuator1 [ECU0]Arbiter [ECU1]Arbiter Sensor1 Sensor2 [ECU0]Output driver (requires 1) [ECU1]Output driver (requires 1)  ? Input receiver (requires 1) Function1 (required) Function2 (optional) [ECU0]Actuator1 [ECU1]Actuator2 Arbiter Output driver (requires 1) Actuator2 • For the full functionality case, the arbiter must include both functions. • The output function only requires one of the actuators be visible. • In the other graphs (which include failures) , the arbiter only needs the • single required input (Function1) Task Graph – Actuator2 Source: Sam Williams DRAFTS

  49. F.Verification comments • Takes milliseconds to run small cases. Few minutes for large schedules • Tool was written in PERL (performance was sufficient) • Schedule Verification is performed offline (not time critical) • Credits: Sam Williams DRAFTS

  50. Conclusions • Contributions • Programming Model FTDF • Metropolis platform • Schedule synthesis tool (in collaboration with INRIA) • Schedule optimization strategy • Functional verification (in collaboration with Sam Williams) • Replica determinism analysis (not shown here) DRAFTS