1 / 97

Parallel Programming and Timing Analysis on Embedded Multicores

Parallel Programming and Timing Analysis on Embedded Multicores. Eugene Yip The University of Auckland Supervisors: Advisor: Dr. Partha Roop Dr . Alain Girault Dr. Morteza Biglari-Abhari (INRIA) ( UoA ). Outline. Introduction ForeC Language Timing Analysis Results

ova
Télécharger la présentation

Parallel Programming and Timing Analysis on Embedded Multicores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programmingand Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors: Advisor: Dr.ParthaRoopDr. Alain Girault Dr.MortezaBiglari-Abhari (INRIA) (UoA)

  2. Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions

  3. Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions

  4. Introduction • Safety-critical systems: • Perform specific real-time tasks. • Comply with strict safety standards [IEC 61508, DO 178] • Time-predictability useful in real-time designs. Embedded Systems Safety-critical concerns Timing/Functionality requirements [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.

  5. Introduction • Safety-critical systems: • Shift from single-core to multicore processors. • Cheaper, better power vs. execution performance. Core n Core 0 Shared System bus Resource Resource Shared Shared [Blake et al 2009] A Survey of Multicore Processors. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.

  6. Introduction • Parallel programming: • From super computers to mainstream computers. • Frameworks designed for systems without resource constraints or safety-concerns. • Optimised for average-case performance (FLOPS), not time-predictability. • Threaded programming model. • Pthreads, OpenMP, Intel Cilk Plus, ParC, ... • Non-deterministic thread interleaving makes understanding and debugging hard. [Lee 2006] The Problem with Threads.

  7. Introduction • Parallel programming: • Programmer responsible for shared resources. • Concurrency errors: • Deadlock, Race condition, Atomic violation, Order violation. [McDowell et al 1989] Debugging Concurrent Programs. [Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.

  8. Introduction • Synchronous languages: • Deterministic concurrency (formal semantics). • Execution model similar to digital circuits. • Threads execute in lock-step to a global clock. • Threads communicate via instantaneous signals. • Concurrency is logical. Typically compiled away. Inputs 1 2 3 4 Global ticks Outputs [Benveniste et al 2003] The Synchronous Languages 12 Years Later.

  9. Introduction • Synchronous languages: Must validate: max(Reaction time) < min(Time for each tick) Specified by the system’s timing requirements Time for a tick 1s 2s 3s 4s Physical time Reaction time [Benveniste et al 2003] The Synchronous Languages 12 Years Later.

  10. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Retain the essence of C and add deterministic concurrency and thread communication. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  11. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Concurrent threads scheduled sequentially in a cooperatively manner. This ensures thread-safe access to shared variables. Semantics designed to facilitate static analysis. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  12. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Read phase followed by write phase for shared variables. Multiple writes to the same shared variable are combined using an associative and commutative “combine function”. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  13. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language More expressive than PRET-C, but static timing analysis hasn’t been formulated yet. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  14. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Sequential execution semantics. Unsuitable for parallel execution. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  15. Introduction • Synchronous languages • Esterel, Lustre, Signal • Synchronous extensions to C: • PRET-C • Reactive Shared Variables • Synchronous C • Esterel C Language Compilation produces sequential programs.Unsuitable for parallel execution. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

  16. Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions

  17. ForeC Language “Foresee”  ForeC • C-based, multi-threaded, synchronous language. Inspired by PRET-C and Esterel. • Deterministic parallel execution on embedded multicores. • Fork/join parallelism and shared memory thread communication. • Program behaviour independent of chosen thread scheduling.

  18. ForeC Language • Additional constructs to C: • pause: Synchronisation barrier. Pauses the thread’s execution until all threads have paused. • par(st1, ..., stn): Forks each statement to execute as a parallel thread. Each statement is implicitly scoped. • [weak] abortstwhen[immediate] exp: Preempts the statement st when exp evaluates to a non-zero value. exp is evaluated in each global tick before st is executed.

  19. ForeC Language • Additional variable type-qualifiers to C: • inputand output: Declares a variable whose value is updated or emitted to the environment at each global tick.

  20. ForeC Language • Additional variable type-qualifiers to C: • shared: Declares a shared variable that can be accessed by multiple threads.

  21. ForeC Language • Additional variable type-qualifiers to C: • shared: Declares a shared variable that can be accessed by multiple threads. • Threads make local copies of shared variables that they may use at the start of their local ticks. • Threads only modify their local copies during execution. • If a par statement terminates: • Modified copies from the child threads are combined (using a commutative & associative function) and assigned to the parent. • If the global tick ends: • The modified copies are combined and assigned to the actual shared variables. a b

  22. Execution Example Shared variable shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Commutative and associative combine function Fork-join Synchronisation

  23. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... }

  24. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start

  25. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start

  26. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start

  27. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end

  28. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end

  29. Execution Example shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... } Global tick start Global tick end Global tick start

  30. Execution Example Shared variables: • Threads modify local copies of shared variables. • Isolation of thread execution allows threads to truly execute in parallel. • Thread interleaving does no affect the program’s behaviour. • Prevents most concurrency errors. • Deadlock, Race condition: No locks. • Atomic and order violation: Local copies. • Copies for a shared variable can be split into groups and combined in parallel.

  31. Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.

  32. Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables Aggregates cilk::reducer_op cilk::holder_op shared var reduction(operator: var) MPI_Reduce MPI_Gather shared var collectives [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.

  33. Execution Example Shared variables: • Programmer has to define a suitable combine function for each shared variable. • Must ensure the combine function is indeed commutative & associative. • Notion of “combine functions” is not entirely new: • Intel Cilk Plus, OpenMP, MPI, UPC, X10 • Esterel, Reactive Shared Variables shared var Combine operator Valued signals Combine operator [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.

  34. Shared Variable Design Patterns • Point-to-point • Broadcast • Software pipelining • Divide and conquer • Scatter/Gather • Map/Reduce

  35. Overview of the Framework

  36. Concurrent Control Flow Graph shared int sum = 1 combine with plus; intplus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(inti) { sum = sum + i; pause; ... }

  37. Scheduling • Light-Weight Static Scheduling: • Take advantage of multicore performance while delivering time-predictability. • Generate code to execute directly on hardware (bare metal/no OS). • Thread allocation and scheduling order on each core decided at compile time by the programmer. • Develop a WCRT-aware scheduling heuristic. • Thread isolation allows for scheduling flexibility. • Cooperative (non-preemptive) scheduling.

  38. Scheduling • Cores synchronise to fork/join threads and end each global tick. • One core to perform housekeeping tasks at the end of the global tick: • Combining shared variables. • Emitting outputs. • Sampling inputs and trigger the next global tick.

  39. Outline • Introduction • ForeC Language • Timing Analysis • Results • Conclusions

  40. Timing Analysis Compute the program’s worst-case reaction time (WCRT). Must validate: max(Reaction time) < min(Time for each tick) Specified by the system’s timing requirements Time for a tick 1s 2s 3s 4s Physical time Reaction time [Benveniste et al 2003] The Synchronous Languages 12 Years Later.

  41. Timing Analysis Existing approaches for synchronous programs: • Integer Linear Programming (ILP) • “Coarse-grained” Reachability (Max-Plus) • Model Checking One existing approach for analysing the WCRT of synchronous programs on multicores: • [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. • Uses ILP, no tightness result, all experiments performed 4-core processor.

  42. Timing Analysis Existing approaches for synchronous programs. • Integer Linear Programming (ILP) • Execution time of the program described as a set of integer equations. • Solving ILP is NP-complete. [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.

  43. Timing Analysis Existing approaches for synchronous programs. • “Coarse-grained” Reachability (Max-Plus) • Compute the WCRT of each thread. • Using the thread WCRTs, the WCRT of the program is computed. • Assumes there is a global tick where all threads execute their worst-case. [M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

  44. Timing Analysis Existing approaches for synchronous programs. • Model Checking • Computes the execution time along all possible execution paths. • State-space explosion problem. • Binary search: Check the WCRT is less than “x”. • Trades-off analysis time for precision. • Counter example: Execution trace for the WCRT. [P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.

  45. Timing Analysis Proposed “fine-grained” Reachability approach: • Only consider local ticks that can execute together in the same global tick. • Timed execution trace for the WCRT. • To handle the state-space explosion: • Reduce the program’s CCFG before analysis. Reconstruct the program’s CCFG Program binary (annotated) Find all global ticks (Reachability) WCRT

  46. Timing Analysis Programs executed on the following multicore architecture: Core 0 Core n Instruction memory Instruction memory Data memory Data memory TDMA Shared Bus Global memory

  47. Timing Analysis Computing the execution time: • Overlapping of thread execution time from parallelism and inter-core synchronizations. • Scheduling overheads. • Variable delay in accessing the shared bus.

  48. Timing Analysis • Overlapping of thread execution time from parallelism and inter-core synchronisations. • An integer counter to track each core’s execution time. • Synchronisation occurs when forking/joining, and ending the global tick. • Advance the execution time of participating cores. Core 1 Core 2 main f1 f2 Core 1:Core 2: main f2 f1

  49. Timing Analysis • Scheduling overheads. • Synchronisation: Fork/join and global tick. • Via global memory. • Thread context-switching. • Copying of shared variables at the start the thread’s local tick via global memory. Core 1 Core 2 main f1 f2 Synchronisation Thread context-switch Global tick

  50. Timing Analysis • Scheduling overheads. • Required scheduling routines statically known. • Analyse the scheduling control-flow. • Compute the execution time for each scheduling overhead. Core 1 Core 2 main Core 1 Core 2 main f1 f2 f1 f2

More Related