240 likes | 421 Vues
Code Generation Framework for Process Network Models onto Parallel Platforms. Man-Kit Leung, Isaac Liu, Jia Zou Final Project Presentation. Outline. Motivation Demo Code Generation Framework Application and Results Conclusion. Motivation. Parallel programming is difficult…
E N D
Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac Liu, Jia Zou Final Project Presentation
Outline • Motivation • Demo • Code Generation Framework • Application and Results • Conclusion
Motivation • Parallel programming is difficult… • Functional correctness • Performance debugging + tuning (Basically, trial & error) • Code generation as a tool • Systematically explore implementation space • Rapid development / prototyping • Optimize performance • Maximize (programming) reusability • Correct-by-construction [E. Dijkstra ’70] • Minimize human errors (bugs) • Eliminates the need for low-level testing • Because, otherwise, manual coding is too costly • Especially true for multiprocessors/distributed platforms
Higher-Level Programming Model • Kahn Process Networks (KPNs) is a distributed model of computation (MoC) where a group of processing units are connected by communication channels to form a network of processes. • The communication channels are FIFO queues. • “The Semantics of a Simple Language For Parallel Programming” [GK ’74] Implicit Buffers Source Actor1 • Deterministic • Inherently parallel • Expressive Sink Actor Source Actor2 Implicit Buffers
MPI Code Generation Workflow Executable • Execute code • Obtain execution statistics for tuning Model • Generate MPI code • SIMD (Single Instruction Multiple Data) Partitioning (Mapping) Given a (KPN) Model • Analyze & annotate model • Assume weights on edges & nodes • Generate cluster info (buffer & grouping) Code Generation
Demo The codegen facility is in the Ptolemy II nightly release - http://chess.eecs.berkeley.edu/ptexternal/nightly/
Role of Code Generation Platform-based Design [AS ‘02] Models Partitioning (Mapping) Code Generation Ptolemy II Ptolemy II Executable
Implementation Space for Distributed Environment • Mapping • # of logical processing units • # of cores / processors • Network costs • Latency • Throughput • Memory Constraint • Communication buffer size • Minimization metrics • Costs • Power consumption • …
Partition • Using node and edge weights abstractions • Annotation on the model • From the model, the input file to Chaco is generated. • After Chaco produces the output file, the partitions are automatically annotated onto the model.
Multiprocessor Architectures • Shared Memory vs. Message Passing • We want to generate code that will run on both kinds of architectures • Message passing: • Message Passing Interface(MPI) as the implementation • Shared memory: • Pthread implementation available for comparison • UPC and OpenMP as future work
Pthread Implementation void Actor1 (void) { ... } void Actor2 (void) { ... } void Model (void) { pthread_create(&Actor1…); pthread_create(&Actor2…); pthread_join(&Actor1…); pthread_join(&Actor2…); } Model
MPI Code Generation • KPN Scheduling: • Determine when actors are safe to fire • Actors can’t block other actors on same partition • Termination based on a firing count MPI send/recv Local buffers MPI Tag Matching
Sample MPI Program main() { if (rank == 0) { Actor0(); Actor1(); } if (rank == 1) { Actor2(); } ... } Actor#() { [1] MPI_Irecv(input); [2] if (hasInput && !sendBufferFull){ [3] output = localCalc(); [4] MPI_Isend(1, output); } }
Conclusion & Future Work • Conclusion • Framework for code generation to parallel platforms • Generate scalable MPI code from Kahn Process Network models • Future Work • Target more platforms ( UPC, OpenMP etc) • Additional profiling techniques • Support more partitioning tools • Improve performance on generated code
Acknowledgments • Edward Lee • Horst Simon • Shoaib Kamil • Ptolemy II developers • NERSC • John Kubiatowicz Questions / Comments
Why MPI • Message passing • Good for distributed (shared-nothing) systems • Very generic • Easy to set up • Required setup (i.e. mpicc and etc.) for one “master” • Worker nodes only need to have SSH • Flexible (explicit) • Nonblocking + blocking send/recv • Cons: required explicit syntax modification (as opposed to OpenMP, Erlang, and etc.) • Solution: automatic code generation
Actor-oriented design: a formalized model of concurrency object oriented actor oriented • Actor-oriented design hides the states of each actor and makes them inaccessible from other actor • The emphasis of data flow over control flow leads to conceptually concurrent execution of actors • The interaction between actors happens in a highly disciplined way • Threads and mutexes become implementation mechanism instead of part of programming model
Pthread implementation • Each actor as a separate thread • Implicit buffers • Each buffer has a read and write count • Condition variable: sleeps and wakes up threads • Capacity of the buffer • A global notion of scheduling exists • OS level • All actors are at blocking-read mode implies the model should terminate
MPI Implementation • Mapping of actors to cores is needed. • Classic graph partitioning problem • Nodes: actors • Edges: messages • Node weights: computations on each actor • Edge weights: amount of messages communicated • Partitions: processors • Chaco chosen as the graph partitioner.
Partition Profiling • Challenge: providing the user with enough information so node weights and edge weights can be annotated and modified to achieve load balancing. • Solution 1: Static analysis • Solution 2: Simulation • Solution 3: Dynamic load balancing • Solution 4: Profiling the current run and feed the information back to the user