1 / 12

Deterministic Replay for Real-time Software Systems

This article discusses the major difficulties faced in building real-time embedded applications, including handling concurrent events, timing control, and temporal dependence in program behavior. It also explores the challenges of modeling, analyzing, testing, and reproducing non-deterministic and time-dependent behavior in software systems.

Télécharger la présentation

Deterministic Replay for Real-time Software Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deterministic Replay for Real-time Software Systems

  2. Background • Major difficulties of building real-time embedded applications • handling concurrent events (real-world events occur in parallel) • timing control and temporal dependence in program behavior • asynchronous operations • Non-deterministic operation, Time-dependent behavior, and race condition • difficult to model, analyze, test, and re-produce. • Example:NASA Pathfinder spacecraft • Total system resetsin Mars Pathfinder • An overrun of data collection task  a priority inversion in mutex semaphore  failure of communication task  a system reset. • Took 18 hours to reproduce the failure in a lab replica  the problem became obvious and a fix was installed

  3. Background (Cont’d) • Other examples • select(2)/accept(2) Race Condition in TCP Servers of NetBSD • the bug depends on a specific event and is sometimes difficult to reproduce, particularly if the server is very fast and the network is relatively slow. • The Delphi Bug Report 459 • difficult to reproduce the bug since the timing of the two threads (one is being destroyed and one is being created) has to be “right” for it to occur. • it is easy to identify the faults and fix them once the failing sequences are reproduced (or observed). • The failures are rooted in the interaction of multiple concurrent operations/threads and are based on timing dependencies.

  4. Execution/ Instrumentation Execution D. replay/ Instrumentation Execution/ Observation/ Assertion Execution D. replay/ Observation/ Assertion Execution/ Checkpointing/ Msg logging Rollback/ D. replay Deterministic Replay • Can we re-produce the exact execution behavior with additional delays in a controlled environment • the delays may be caused by instrumentation and break points • For multiple purposes: • Test analysis • Debugging • Recovery

  5. deterministic replay real-time execution interrupt_1 interrupt_1 PC=1000 PC=1000 interrupt_2 PC=2000 interrupt_2 PC=2000 Deterministic Replay (Cont’d) • Programs read in the same input values (timer, DAQ, status, etc.) • Interrupts occurs in the same program execution instances • Need to log external events during real-time execution and re-submit the events during replay • recording and replaying stages intrusions time

  6. Testing Analysis and Timing Intrusion • Software quality analysis and test coverage • Instrumentation at source programs • program behavior may be changed due to timing intrusion • test a robotic controller in the target system – hardware and human-in-the loop operations • some solutions : • hardware-based trace collection (Applied Microsystems) • special datalogging, monitoring, and test facility (SVF for NASA ISS) • Apply instrumentation during deterministic replay • if the overhead of logging external events can be minimized

  7. Our Approach -- A Two-stage Instrumentation • Instrumentation based on RTOS -- for context switches, interrupts, events, and task communication • Annotation for device drivers • Synchronize program execution with external events • cannot rely on program counter • an interrupt during a loop (need loop count and program counter) • simulated time • must be adjusted to match with the real execution time • determine when an event occurs • if no data dependence, it can occur at any instance during a block execution • else, need to know the corresponding statement

  8. Software Instruction Counter • Exact instance in program execution • specified by program counter (PC) I/O status changed read I/O check value read I/O check value • Software instruction counter (SIC) -- • incremented when backward jump or procedure call • software or hardware implemented • Has been applied to recovery and debugging

  9. Current Status source program code instrumentation code analyzer execution trace ESIC and replay instrumentation ESIC, system, and event instrumentation target - record environment target - replay environment instrumented program_2 instrumented program_1 PC stamp converter event trace_2 event trace_1

  10. Current Status (Cont’d) • Works for single execution thread in the whole system (vxWork + MPC860) • There are kernel and non-instrumented threads • test analysis of one program in a multitasking environment • debug a program which calls library routines • system calls to RTOS • Can we still reach deterministic replay if the execution of the instrumented thread is interleaved with other threads? • If interrupts (input)  thread_1  thread_2, then, both threads must be instrumented instrumented program RTOS semTake() The other thread ISR interrupt semGive()

  11. Current Status (Cont’d) • If interrupts (input)  thread_2 and thread_1  thread_2, • thread_1 doesn’t need to be instrumented • however, interrupts can occur while thread_1 is running (I.e. execution is not in the instrumentation region due to a blocked system call or library call) • Solution: • check thread id when an interrupt occurs • if the interrupted instruction is in the instrumentation region, use PC+SIC for replay • else, replay the interrupt just before the call (RTOS or library)

  12. Current Tasks • Tool integration and GUI • Experiments • joystick program with input and timer • DC motor controller with a LabView-based simulator • Applications in JSC • X38 • AERCam • Porting • vxWorks and Suds on MBX860 embedded controller • porting to RT-linux and other platforms • Documentation and dissemination

More Related