1 / 33

The RAPIDS Project

The RAPIDS Project. Israel Koren C. Mani Krishna. Architecture and Real-time Systems (ARTS) Lab Dept. of Electrical and Computer Engineering University of Massachusetts, Amherst MA. ARTS. Our Goal. RAPIDS. Performability = performance + reliability

carolinaf
Télécharger la présentation

The RAPIDS Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The RAPIDS Project Israel Koren C. Mani Krishna Architecture and Real-time Systems (ARTS) Lab Dept. of Electrical and Computer Engineering University of Massachusetts, Amherst MA ARTS

  2. Our Goal RAPIDS • Performability = performance + reliability • Measure performance through detailed monitoring • Measure reliability through fault injection and monitoring fault recovery • Provide a framework to explore various configurations within the scope of the resources available • Provide an experimental testing capability for REE To develop a tool that aids in the analysis and enhancement of the performability of real-time systems JPL kickoff meeting 2000

  3. RAPIDS Why Is It Important? • Real applications are run on real hardware • Monitoring the application closely will expose • Performance bottlenecks • Recovery bottlenecks • Design errors • Better understanding the working of the application • Making the application more efficient It can help us further understand the interaction between hardware and software JPL kickoff meeting 2000

  4. RAPIDS Wait, There’s More!! • Combined capabilities of fault injection and recovery monitoring help test applications thoroughly • The designer can experiment with various configurations and system parameters until the required performance is obtained • More aggressive designs can be implemented • Investigators can be assured of the performance, dependability and availability of the REE flight system Performability validation helps build confidence JPL kickoff meeting 2000

  5. RAPIDS The Current Version • A simulation testbed for evaluating real-time algorithms and software • Users can specify • topology and network protocol • task set to be run on the system • type of (fault) environment • various algorithms (allocation, scheduling & fault recovery) • Output: The number of deadline misses (among many other results) for the duration of the mission • RAPIDS 3.0has already been installed at JPL RAPIDS 3.0 - The Simulator JPL kickoff meeting 2000

  6. RAPIDS The Next Version RAPIDS 4.0 = Emulator + RAPIDS 3.0 Configuration Parameters Simulator Algorithms Emulator Configuration Additional Parameters Emulator Simulator Configuration JPL kickoff meeting 2000

  7. RAPIDS The Emulator Design • Two phases of development: • Phase I: Monitoring • The MPI Wrapper • The Monitoring Modules • The Graphical User Interface • Phase II: Control • Configurability (Allocation and Scheduling Algorithms) • Fault Injector • Synthetic Workload The tool should not interfere greatly with the working of the application JPL kickoff meeting 2000

  8. RAPIDS The Process Model Application Node Main Display Node Application Node IGP Main Control Module GUI Application Node Legend Application Task MPI Wrapper Application Node Local Control Module Monitoring Channel Control Channel IGP Info Gathering Process JPL kickoff meeting 2000

  9. RAPIDS The MPI Wrapper • All applications are assumed to useMPIfor communication • Important MPIcalls are wrapped with a system call that sends relevant information to the display node • The lightweight wrapper minimizes • system overhead due to extra system calls • network overhead due to extra messages • Evaluation of the overhead is important • The applications will now use the RAPIDS-MPIlibrary JPL kickoff meeting 2000

  10. RAPIDS Monitoring Modules • IGP: Information Gathering Process • one IGP per application • collects monitoring messages through MPI calls from subtasks • forwards messages to the main module through IPC • MCM: Main Monitoring and Control Module • accepts input from the user through GUI • spawns the IGPs, executes appropriate “mpirun” s • collects and displays all monitoring information • IGPs and MCM are located in the Main Display Node (MDN), separate from the nodes running the applications JPL kickoff meeting 2000

  11. RAPIDS The GUI • Provides a detailed pictorial view showing: • allocation of the subtasks of the application to nodes • start and end of task instance • messages sent and received during execution • checkpointing epochs • faults injected and recovery actions taken • User can choose from various levels of monitoring • Extra monitoring handlers allow user to display other important events or values of key variables • Extra handlers are part of theRAPIDS-MPIlibrary JPL kickoff meeting 2000

  12. RAPIDS Enhanced GUI & Apps • The Enhanced GUI will provide further flexibility • User can select specific variables and events to be monitored, through an easy interface • The display of certain events/variables can be user-defined • Two REE applications are being used: • OTIS • NGST • Both have been successfully ported and run JPL kickoff meeting 2000

  13. RAPIDS Control Parameters • The user can analyze the impact of various system parameters and determine their appropriate values • Selectable System Configuration Parameters: • Application(s) to be run • Number of subtasks for each application • Subset of nodes on which to run the applications • Task Allocation -- manual or algorithm-based • Scheduling of tasks on application nodes • depending on the operating system used JPL kickoff meeting 2000

  14. RAPIDS Task Parameters • User can specify the period of each subtask • Synthetic workloads can be used to emulate applications that are unavailable • User can specify synthetic tasks through • a detailed user interface • a task trace generated earlier • Workload surges can be emulated • Ability to handle load surges is another measure for dependability JPL kickoff meeting 2000

  15. RAPIDS Fault Parameters • Type of fault • Register faults • Memory faults • I/O device failures • Network faults • message corruption • message delaying/loss • Time and duration of fault • Wall clock time • Stochastic (based on a distribution) • Selectable parameters determined by the fault injector JPL kickoff meeting 2000

  16. RAPIDS Fault Injector • We start by integrating SWIFIinto RAPIDS • SWIFIcapabilities: • fault injection into application’s virtual memory address space • registers, code, data, heap, stack or user defined regions • multiprocessor fault injection • some rudimentary monitoring • centrally controllable • The LCM(Local Control Module) houses SWIFI JPL kickoff meeting 2000

  17. RAPIDS SWIFI status • Fault Injection: • SWIFI4ported toLinux • Initial experiments have run successfully • Initial results show that most faults can cause process to crash (dump core) • SWIFIprimarily relies on the ptrace()system call • ptrace()was designed for debugging programs • setting and clearing break points • reading and writing to virtual space thus emulating faults JPL kickoff meeting 2000

  18. RAPIDS SWIFI & ptrace • ptrace() drawbacks: • The kernel must do four context switches for each fault injection • this interference can slow down the application considerably • ptrace()can only be used on child processes • Requires a separate parent process for each MPI task • ptrace()requires modifications to the source code • A child process must call TRACE_ME (enter trace mode) JPL kickoff meeting 2000

  19. RAPIDS Fault Injection • Using the /proc file system • The /proc file systems contains files for each process • Information about the process status, memory, network statistics etc. • Location-specific faults can be injected by reading and writing to the appropriate offset of the file • No context switches are involved • Multiple faults in contiguous locations can be injected in one call • A separate process for each MPI task is not required • Source code of the application is not needed JPL kickoff meeting 2000

  20. RAPIDS Fault Injection (cont.) • Fault injection through the /proc file system • Only one fault injector process needed per system • Only the superuser can use this facility • Evaluating fault tolerance of the OS is important • It is rarely swapped out of physical memory • ptrace() cannot be used to debug theOS • The/procfile system can be used to inject faults into the physical memory directly including theOS • The /proc file system approach seems a viable option! JPL kickoff meeting 2000

  21. Setup of the ARTS Lab Cluster Windows PC Tintin Nestor DNS DHCP WWW ECS DeskJet Bianca Gateway Eric’s Laptop Firewall 100 Ethernet Hub Haddock Calculus Thomson Snowy NFS NIS Myrinet Switch LaserJet 2100M Legend Myrinet Host Card Ethernet Card Parallel Port JPL kickoff meeting 2000 PC Chassis

  22. RAPIDS Monitoring Alternatives • User can choose from these two alternatives • Myrinet-only • Monitoring messages also pass through Myrinet • Produces both system as well as network overhead • Myrinet-Ethernet • Monitoring messages use only “EtherNetwork” • Main overhead is from wrapper calls • These alternatives provide a way to assess network overhead JPL kickoff meeting 2000

  23. RAPIDS Remote Experimentation • Restricted remote access to our lab cluster will be provided • User downloads the RAPIDSJava Applet (RAJA) • Only encrypted GUI update messages are passed S S ARTS Cluster JPL JPL kickoff meeting 2000

  24. RAPIDS Deliverables • The MPI Wrapper • The Monitoring and Control Modules • The RAPIDS GUI • SWIFI integrated into RAPIDS • Synthetic workloads • Light-weight fault recovery techniques The RAPIDS Emulator JPL kickoff meeting 2000

  25. RAPIDS Conformance with Specs • Use of COTS components • Current development on a Linux/Myrinetsystem using MPIfor message passing • Plans to useLinux-RTin the future • Portability of software - a design requirement • Use of REEfault model and REEExecutive • Our timeline fits well into REE schedule • Completion ofPhase I: December 2000 • Completion ofPhase II: September 2001 JPL kickoff meeting 2000

  26. RAPIDS Long-Term Plans • Integration of the emulator and simulator • Fault injection through the /proc filesystem • Light-weight application-specific checkpointing • Exploitation of application-level information • Integrating our application-level fault recovery techniques(ALFT)into RAPIDS • Techniques for low-power fault recovery • Tools for evaluating power-aware techniques JPL kickoff meeting 2000

  27. RAPIDS Ongoing work in ARTS Lab • Lightweight application-specific fault detection and fault recovery techniques • Key Idea: Exploit application semantics to implement low overhead fault tolerance • Initial results are very promising • The RTHT benchmark requires 15% redundancy • Redundancy can be tuned to the extent of fault-tolerance required • Guidelines to develop ALFT for applications Application-Level Fault Tolerance JPL kickoff meeting 2000

  28. RAPIDS Ongoing work in ARTS Lab • Design of power-aware algorithms • Impact of high-level algorithms such as allocation algorithms, scheduling algorithms etc. on power • Voltage clock scaling - an alternative for reducing energy consumption in embedded systems • Initial results reveal that considerable energy savings can be obtained through voltage scaling and appropriate allocation and scheduling algorithms Power-Aware Real-time Systems JPL kickoff meeting 2000

  29. RAPIDS Summary • An integrated platform for the launch, monitoring and validation of real applications on real hardware • A framework to test different configurations (parameters, algorithms) to get the best out of the system • Monitoring exposes bottlenecks and provides feedback for improvement • Validation ensures the system meets the goals of the mission RAPIDS: A multi-faceted tool JPL kickoff meeting 2000

  30. RAPIDS The Console JPL kickoff meeting 2000

  31. RAPIDS Schedule Windows JPL kickoff meeting 2000

  32. RAPIDS The Task Editor JPL kickoff meeting 2000

  33. RAPIDS Enhanced GUI JPL kickoff meeting 2000

More Related