240 likes | 359 Vues
This talk explores various compiler transformations that simplify parallel programming. It covers transformations like loop distribution, reassociation, and scalar expansion, emphasizing their role in reducing the tedium of parallel implementation. The discussion is limited to cache-coherent shared-memory environments and focuses on the expressivity of programming languages in handling many-core systems. We delve into arbitrarily complex control flows within loops and provide examples illustrating the application of these principles in practical scenarios, enhancing programmer productivity and performance.
E N D
Simplifying Parallel Programming with Compiler Transformations Matt Frank University of Illinois mif@illinois.edu
What I’m ranting about • Transformations that alleviate tedium • Analogous to: • code generation, register allocation, and instr. Sched • (Not really “optimizations”) • Mainly: • Loop distribution, reassociation, “scalar” expansion, inspector-executor, hashing. • Cover much more than you might think • || language expressivity mif@illinois.edu
Assumptions • Cache-coherent shared-memory many-cores • (I’m not addressing distributed memory issues) • Synchronization somewhat expensive • Don’t use barriers gratuitously (but don’t avoid at all costs) • Analysis is not my problem • Programmer annotates • Non-determinism is outside realm of this talk • No race detection in this talk either mif@illinois.edu
Compiler Flow Front-end type systems and whole-program analysis New information: Type systems (e.g. DPJ) Domain-specific objects run-time feedback Dependence Graph (PDG) based compiler Program analysis (info about high level program invariants) for more efficient coherence, checkpointing, q.o.s. Feedback Runtime/Execution platform New capabilities: checkpointing, q.o.s. guarantees. mif@illinois.edu
I’m leaving out locality Front-end type systems and whole-program analysis || exposing transformations Tiling, etc. Runtime/Execution platform mif@illinois.edu
What’s enabled? • Loops that contain arbitrary control flow • Including early exits, arbitrary function calls, etc. • Arbitrary iterators (even sequential ones) • Can’t depend on main body of computation though • Arbitrary combinations of data parallel work, scans and reductions • Can use “partial sums” inside loop • Buffered printf mif@illinois.edu
The transformations • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu
You’ve heard of map-reduce doall i(1..n) private j = f(X[i]) total = total + j shared j[n] doall i(1..n) j[i] = f(X[i]) do i(1..n) total = total + j[i] mif@illinois.edu
How ‘bout scan-map? struct { data; *next; } *p; doall p != NULL modify(p->data) p = p->next n=0 do a[n++] = p p = p->next doall i(0..n) modify(a[i]->data) p = p->next mif@illinois.edu
Sparse matrix construction data rows scan int ptr = 0 shared data[float] shared rows[int] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ row ptr mif@illinois.edu
Partial Sum Expansion scan int ptr = 0 shared float data[m] shared int rows[n] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ scan int ptr[n]# scalar expand ptr shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ expand partial sum mif@illinois.edu
Scalar Expansion scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() and inner loop fission mif@illinois.edu
Outer Loop Fission scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1]+ ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ do row (1..n) rows[row] = rows[row-1] + ptr[row-1] doall row (1..n) for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() mif@illinois.edu
Concatenation data data rows rows row ptr parallel sequential mif@illinois.edu
printf() is same pattern stdout buffer doall i (1..n) private mystring = s(i) printf(mystring) private mystrings mif@illinois.edu
Sparse array updates doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) x[i]+= temp x[j]+= temp mif@illinois.edu
Becomes doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) continue[hash(i)][myproc].push(i,temp) continue[hash(j)][myproc].push(j,temp) doall p(1..P) for t (1..P) private (ptr,val) = continue[p][t] x[ptr] += val 1 2 3 4 1 2 3 4 the continuation matrix -> mif@illinois.edu
Graph updates doall i(1..n) newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, value[pred]) value[i] = newvalue mif@illinois.edu
Inspector Executor Polychronopolous ’88 Saltz ’91 Leung/Zahorjan, ‘93 int wavefront[n] = {0} do i(1..n) wavefront[i] = max(wavefronts[i’s predecessors]) do w(1..maxdepth) doall i suchthat wf[i] = w newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, pred[i]) value[i] = newvalue mif@illinois.edu
Limits of what we know doall node in worklist modify graph structure mif@illinois.edu
What I’ve shown you • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu
Where next? • Relieve Tedium • (build the compiler, or frameworks, or …) • Find new patterns • Delauney triangulation • Pick an example application: there will be something new you wish could be transformed automatically • Parallel languages beyond “doall” and “reduce” mif@illinois.edu