Multi-Grid
E N D
Presentation Transcript
Multi-Grid Esteban Pauli 4/25/06
Overview • Problem Description • Implementation • Shared Memory • Distributed Memory • Other • Performance • Conclusion
Problem Description • Same input, output as Jacobi • Try to speed up algorithm by spreading boundary values faster • Coarsen to small problem, successively solve, refine • Algorithm: • for i in 1 .. levels - 1 • coarsen level i to i + 1 • for i in levels .. 2, -1 • solve level i • refine level i to i – 1 • solve level 1
Problem Description Coarsen Coarsen Solve Refine Solve Refine Solve
Implementation – Key Ideas • Assign a chunk to each processor • Coarsen, refine operations done locally • Solve steps done like Jacobi
Shared Memory Implementations • for i in 1 .. levels - 1 • coarsen level i to i + 1 (in parallel) • barrier • for i in levels .. 2, -1 • solve level i (in parallel) • refine level i to i – 1 (in parallel) • barrier • solve level 1 (in parallel)
Shared Memory Details • Solve is like shared memory Jacobi – have true sharing • /* my_ all locals*/ • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = … • Coarsen, Refine access only local – only false sharing possible • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = …[level ± 1]
Shared Memory Paradigms • Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) • Being able to control distribution (CAF, GA) should help • If small enough, only have to worry about initial misses • If larger, will push out of cache, have to bring back over network • If have to switch to different syntax to access remote memory, it’s a minus on the “elegance” side, but a plus in that it makes communication explicit
Distributed Memory (MPI) • Almost all work local, only communicate to solve a given level • Algorithm at each PE (looks very sequential): • for i in 1 .. levels - 1 • coarsen level i to i + 1 // local • for i in levels .. 2, -1 • solve level i // see next slide • refine level i to i – 1 // local • solve level 1 // see next slide
MPI Solve function • “Dumb” • send my edges • receive edges • Compute • Smarter • send my edges • compute middle • receive edges • compute boundaries • Can do any other optimizations which can be done in Jacobi
Distributed Memory (Charm++) • Again, do like Jacobi • Flow of control hard to show here • Can send just one message to do all coarsening (like in MPI) • Might get some benefits from overlapping computation and communication by waiting for smaller messages • No benefits from load balancing
Other paradigms • BSP model (local computation, global communication, barrier): good fit • STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) • Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit • Cilk (spawn processes for graph search): not a good fit
Performance • 1024x1024 grid – 256x256 grid, 500 iterations at each level • Sequential time: 42.83 seconds • Left table 4pes • Right table 16 pes
Summary • Almost identical to Jacobi • Very predictable application • Easy load balancing • Good for shared memory, MPI • Charm++: virtualization helps, probably need more data points to see if it can beat MPI • DSM: false sharing might be too high a cost • Parallel paradigms for irregular programs not a good fit