A Case for Language Support for Implicitly Parallel Programming

A Case for Language Support for Implicitly Parallel Programming Christopher Rodrigues Joint work with Prof. Shobha Vasudevan Advisor Wen-Mei Hwu UPCRC Seminar

Algorithm Outline Sourcecode This presentation examines automatic parallelization from a compiler-centric point of view. • Why parallel algorithms become sequential programs • A better expression of parallelism • Checking correctness and parallelism • Implementation status • Moving forward SequentialIR ParallelIR Parallel Executable UPCRC Seminar

Parallelism Lost DCT block transforms in JPEG Encoding (Programmer’s view) DCT block transforms in JPEG Encoding (Compiler’s view after pointer analysis) DCT Block 0 Block 0 for (i=0; i<N; i++) { BlockDCT(in[i], out[i]); } DCT Block 1 Block 1 Points-toset #1000 Points-toset #1001 DCT Block N-1 Block N-1 Parallel! Sequential? Compiler knows much less about a program’s parallelism than software developer does UPCRC Seminar

Parallelism Lost • Developers manage complexity using high-level abstractions that provide • Compositionality: Developer can reason about each module in isolation, because modules don’t interact • Separation: Functions only interact with a few pieces of data, and are independent of everything else • Independence is a way to manage complexity • However, abstractions are lost in source code • Translation to source code introduces dependences UPCRC Seminar

An Example of Parallel Algorithms Becoming Sequential: SIFT • Scale-Invariant Feature Transform (SIFT) is a parallelizable image processing application • A sequential C implementation is provided in VLFeat • Open-source, download from www.vlfeat.org • SIFT is a feature detector • Feature: Something in an image that helps to identify it • Each SIFT feature consists of • A keypoint: feature’s location and orientation • A descriptor: feature’s distinguishing characteristics UPCRC Seminar

SIFT Execution Time Profile Time Code total: 98.5% • Arbitrary 640x480 picture used for profiling • Three major parallel sections • Scale and gradient images computed by convolution (30% time) • Each descriptor is a histogram (60% time) • Will focus on parallel descriptor calculation • The highlighted loop for each file: for each octave: compute scale images compute gradient images for each scale: find keypoints refine keypoints for each scale: for each keypoint: calculate orientations for each orientation: calculate descriptor output descriptor 25% 6% 3% 2.5% 62% UPCRC Seminar

SIFT’s Descriptor Computation Pipeline for (; i < nkeys; i++) { <get next keypoint>; <get its orientations>; for (q = 0; q < nangles; q++) { Descriptor descr; <descriptor calculation>; <write output to files>; } } Stage 1:Get keypoint Stage 2:Compute orientations (parallelizable) Stage 3:Compute descriptor (parallelizable) Stage 4:Write output (sequential) UPCRC Seminar

Developer Introduces Sequential Dependences The code cannot be parallelized in its current form • Buffer reuse • Cannot parallelize because buffer d only holds one result at a time • Solution: privatization • Sequential I/O • Cannot parallelize loop because stage 4 is sequential • Solution: loop fission • Lazy update • Data computed on demand and cached for reuse • Solution: precomputation for (...) { Descr d; // lazy update (used // in do_descriptor) if (grad == NULL) grad = do_gradient(); // stage 3: write to d do_descriptor(k, a, &d); // stage 4: read from d // write output to a file write_output(k, a, &d); } UPCRC Seminar

Compiler Analysis Cannot Recover Parallelism Software complexity prevents transformations • Privatization analysis fails here • Attempt to detect dead (definitely overwritten) data • Silenced errors: if input is out of range, no output is written • Dynamic array size: number of array elements written and read varies across iterations • Conditional execution: data is written and read only if a flag is set • Without privatization, cannot do loop fission • To my knowledge, no compiler will do precomputation here UPCRC Seminar

Parallelism Regained Don’t ask compiler to reverse-engineer low-level code • Avoid introducing dependences by providing libraries that match programming abstractions • Pipelines • Container data structures • Others... • Provide ways to communicate high-level abstractions to compiler • Access permissionsSeparation (data independence) • Parametric polymorphism Context-sensitive behavior • Algorithmic skeletons Control abstractions • Data encapsulation Data structures • Proof objects Integer value ranges • Dependent types Variable-size arrays, conditional effects, proof objects UPCRC Seminar

Introducing a Software Pipeline Library Instead of writing loops, can we let programmers write a pipeline? for (; i < nkeys; i++) { <get next keypoint>; <get its orientations>; for (q = 0; q < nangles; q++) { Descriptor descr; <descriptor calculation>; <write output to files>; } } <get next keypoint> Keypoint stream <get its orientations> Angle stream <descriptor calculation> Descriptor stream <write output to files> UPCRC Seminar

Software Pipeline Definitions a stream • Stream: A computation producing a sequence of values • Filter: A computation that transforms an input stream to an output stream • Streams can contain filters • Pipeline: A stream connected to a consumer • Stage: A stream, filter, or consumer • Stages may have internal state [43,42...2] ...a... a filter ...b... a stream consumer a pipeline UPCRC Seminar

Design Methodology of the Pipeline Library API • Developer calls library functions to build stages and pipelines • Stages and pipelines are data types • Stages wrap “worker” functions that do the real computationmy_pipeline_stage = p_map(my_worker_func); • Similar libraries and languages exist (TBB, StreaMIT) • Library functionality here is similar to previous work • Different motivation: to enable automatic parallelization by checking side effects • Will lead to different programming language features UPCRC Seminar

Pipeline-Building Library Functions p_range: Generate a sequence of integers s [0,1...n-1] p_bind: Add a pipeline stage to a stream, creating a new stream p ...a... p_map: Apply a transformation to each element of a stream (may be sequential or parallel) p ...a... ...b... p_cmap: Generalization of p_map that can produce multiple outputs per input p ...b,b,b... s p_fold: Connect a stream to a consumer f p_run: Execute a pipeline UPCRC Seminar

Software Pipeline Library Execution • Library manages communication and execution order • Stateless stages can run sequentially or in parallel • Sequential: as soon as output is produced, run next stage • Parallel: save outputs in an array Execution order Sequential Parallel Time UPCRC Seminar

The SIFT Descriptor Pipeline Here’s how SIFT would look using the pipeline library // define pipeline functions... Keypoint * get_keypoint(int *n) { ... } void do_orientations (Keypoint *k, void (*send)( struct {Keypoint *k; double angle;} *s)) { ... } // then build and run pipeline if (mode) start = p_unfold(lookup_keypoint); else start = p_bind(p_range(nkeys), p_map(get_keypoint)); pl = p_fold(p_bind(p_bind(start, p_cmap(do_orientations)), p_map(do_descriptor)), write_output)); p_run(pl); UPCRC Seminar

Going from Explicit Parallelism to Implicit Parallelism • I showed an explicitly parallel pipeline library • Explicitly parallel software developer declares what stages are parallel • If developer was wrong, program will probably have race conditions resulting in nondeterministic behavior • Will show next how to make it implicity parallel • Implicitly parallel software developer writes a pipeline and provides some dependence information • Guaranteed that parallel execution will produce the same result as sequential UPCRC Seminar

Side Effect Conditions in the Pipeline API • Use of a pipeline indicates developer intention to use restricted communication pattern • Stages are independent, except for input and output • Different iterations of stateless stages are independent • Restricted communication pattern is part of API • Code that does not respect interface is incorrect • Code that respects interface is safe to run in parallel Checking correctness is detecting parallelism UPCRC Seminar

Preliminaries to Checking Correctness • First, define a language semantics • Framework for reasoning about whether parallel execution of sequential code is safe • Defined in terms of computations and permissions • Then reify the semantics within the language • Turn this framework into a type system • Compiler can use type system to reason about parallelism • Will show a pipeline example UPCRC Seminar

iteration 0 iteration 1 foo foo bar bar Language Semantics: Computations for (...) { foo(); bar(); } • Programs are structured • Nested blocks of code (more or less) • A computation is an execution of a block of code • Nested • Canonical execution order • Execution order within a library function is specified by the library • Sibling computations are candidates for parallel execution Code for loop Computations UPCRC Seminar

Language Semantics: Permissions • Keep track of data using access permissions • Also called capabilities • Permissions are first-class values • To perform a memory read/write requires • A pointer to the data • A permission to access the data • Permissions are compile-time bookkeeping • No run-time overhead UPCRC Seminar

Language Semantics: Writable Permissions • Computations interfere if running them in parallel produces nondeterministic results • Dependences are contention for access to a piece of data • Prevent interference by restricting access permissions • A computation needs a writable permission to write data • Writable permissions are linear values • Cannot duplicate a permission  only one computation owns any part of memory UPCRC Seminar

Language Semantics: Readable Permissions • Also want to support shared read-only data access • Writable permissions can temporarily become read-only permissions • A computation requires read-only permission to the data it writes • Read-only permissions can be duplicated or discarded • But cannot be returned, so that writable permissions can be recovered UPCRC Seminar

Language Semantics: Summary • A program runs as a hierarchy of nested computations • All side effects require permissions • Writable permissions are linear values • Readable permissions are commutative effects • Can generalize to transactions, I/O, etc. • Permission accounting also detects leaks, type errors, and dangling pointers • Computations can run in parallel if their permissions can be provided simultaneously UPCRC Seminar

Permissions Are Statically Tracked in Type System • Type system statically keeps track of what data is guarded by a permission value • Permission types written asdata type @ address • For example, permission to access an array of 100 integers at address b:array 100 int @ b UPCRC Seminar

Permission Tracking in the Pipeline Library r1 • Use type system to describe how pipelines behave when run • Stream s • Produces outputs of type α • Can access private writable data s1 • Can read shared data r1 • Filter p • Reads inputs of type α • Produces outputs of type β • Can access private writable datas2 • Can read shared data r2 s1 s α s : Stream α r1 s1 @ a r2 α s2 p β p : Filter αβ r2 s2 @ b Input Output Writablepermissions Read-only permissions UPCRC Seminar

Permission Tracking in the Pipeline Library: Building Pipelines r1 ∪ r2 • Can connect a stream to a filter if input and output types match • Resulting stream requires the combined permissions of its parts • Union of read-only permissions • Separating conjunction of writable permissions(s1 and s2 are disjoint) Type system propagates side effect through library calls s s1 * s2 p β Compatible s : Stream α r1 s1 @ a p : Filter αβ r2 s2 @ b s2 = p_bind(s, p); s2: Stream β (r1 ∪ r2) (s1 * s2) Combined permissions of both stages UPCRC Seminar

Permission Tracking in the Pipeline Library: Running Pipelines • Running a pipeline requires all permissions to be available • Cannot run a pipeline with a race condition • Two stages want the same piece of data • Running requires two pieces of data at the same address, but only one is available s : Stream α empty (int@c) @ a p : Filter αβ empty (int@c) @ b s2 = p_bind(s, p); s2: Stream β empty (int@c * int@c) Running requires two copies of the same writable permission! Pipeline API’s side effect conditions are checked statically UPCRC Seminar

Making the Type System Useful for Real Programs • Parallel programs employ a variety of software techniques in their parallel section • A general-purpose type system requires some difficult (but not fundamentally new) solutions • Implementing linear and dependent types • Logical conditions in types • Parallelizing loop nests over arrays • Permitting user-defined data types • Solutions employed (separately) in • Proof-theoretic programming languages • Parallelizing FORTRAN compilers • Shape analyses • Often as an analysis rather than a type system UPCRC Seminar

Implementation Status of Pipeline Library • Sequential pipeline library implemented in C • Runs both sequential and parallel execution order • SIFT is not parallelized yet • Modest overhead for using library • Each pipeline stage invocation involves two indirect function calls • Stream outputs are heap-allocated • Overhead is easily amortized • SIFT takes >1ms computation time per loop iteration • Much greater than overhead UPCRC Seminar

Ongoing Work • Building compiler infrastructure for type system • Bridging high-level source code and type system • Type system has programmer-unfriendly features • Linear and dependent types incompatible with mutability • Management of permissions and proof objects is tedious • Use type system as an IR produced from source code • Investigating lightweight annotations and analysis to produce type information • Optimizations and parallelization • Exploit extra information in the type system for more powerful code and data transformations • Compiler-generated message passing, unboxing, layout transformations UPCRC Seminar

A Case for Language Support for Implicitly Parallel Programming

A Case for Language Support for Implicitly Parallel Programming

Presentation Transcript

A Pattern Language for Parallel Programming

Language and Compiler Support for Mixin Programming

L21: Parallel Programming Language Features

Implicitly Parallel Programming Models

Architectural Support for Synchronization-Free Deterministic Parallel Programming

Potential for parallel computers/parallel programming

Language Support for Concurrency

MapReduce As A Language for Parallel Computing

Additional Programming Language Support

A Comparative Study Of Language Support for Generic Programming Oopsla 2003

Language Support for Concurrency

Language Support for Concurrency

Case for Support

A Case for Teaching Parallel Programming to Freshmen

Design Patterns for Parallel Programming

Case for Support

Programming Language Support for Automated Testing

Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems

Case for Support

A Comparative Study Of Language Support for Generic Programming Oopsla 2003

nesC: A Programming Language for Motes

Language Support for Concurrency