Automatic Tuning of Parallel Programs for Maximum Performance

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University

GOALS • Automatic parallelization without loss of performance • Use automatic detection of parallelism • Parallelization is overzealous • Remove overhead-inducingparallelism • Ensure no performance loss over original program • Generic tuning framework • Empirical approach • Use program execution to measure benefits • Offline tuning

AUTO Vs. MANUALPARALLELIZATION Significant development time Source Program Hand parallelized Parallel Program Parallelizing Compiler User tunes the program for performance State-of-the-art auto-parallelization in the order of minutes

AUTO-PARALLELISM OVERHEAD Loop level parallelism intfoo() { #pragmaompprivate(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragmaompprivate(j,t) #pragmaomp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } fork Fork/Join overheads Load balancing Work in parallel section join

NEED FOR AUTOMATIC TUNING • Identify, at compile time, the optimization strategy for maximum performance • Beneficial parallelism • Which loops to parallelize • Parallel loop coverage

OUR APPROACH Best combination of loops to parallelize Offline tuning Decisions based on actual execution time

CETUS: VERSION GENERATION

SEARCH SPACE NAVIGATION • Search Space -> The set of parallelizable loops • Generic Tuning Algorithm • Capture Interaction • Use program execution time as decision metric • COMBINED ELIMINATION • Each loop is an on/off optimization • Selective parallelization • Pan, Z., Eigenmann, R.: Fast and eﬀective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

TUNING ALGORITHM BATCH ELIMINATION ITERATIVE ELIMINATION • Considers separately, the effects of each optimization • Instant elimination -Considers interactions -More tuning time COMBINED ELIMINATION New Base Case • Considers interactions amongst a subset • Iterates over the smaller subset and performs batch elimination

CETUNE INTERFACE intfoo() { #pragmacetusparallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragmacetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } cetus –ompGen –tune-ompGen=“1,1” Parallelize both loops cetus –ompGen –tune-ompGen=“1,0” cetus –ompGen –tune-ompGen=“0,1” Parallelize one and serialize the other cetus –ompGen –tune-ompGen=“0,0” Serialize both loops

Next point in the search space Version generation using tuner input Decision based on RIP Back end code generation Runtime performance measurement Train data set EMPIRICAL MEASUREMENT Input source code (train data set) Automatic parallelization using Cetus Start configuration ICC Intel Xeon Dual Quad-core Final configuration

RESULTS

CONTRIBUTIONS • Described a compiler + empirical system that detects parallel loops in serial and parallel programs and selects the combination of parallel loops that gives highest performance • Finding profitable parallelism can be done using a generic tuning method • The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections • Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

THANK YOU!

Automatic Tuning of Parallel Programs for Maximum Performance

Automatic Tuning of Parallel Programs for Maximum Performance

Presentation Transcript

Towards Automated Tuning of Parallel Programs

Designing and Evaluating Parallel Programs

Relative Reality—Parallelized

PARALLELIZED CONVOLUTION

Support for Debugging Automatically Parallelized Programs

The Measured Network Traffic of Compiler Parallelized Programs

Automatic Tuning for Parallel FFTs

Parallelized Evolution System

Parallel Programs

Designing Parallel Programs

Automatically Tuning Task-Based Programs for Multi-core Processors

Parallel Computing Explained Scalar Tuning

Evaluating Parallel Programs

On Tuning Microarchitecture for Programs

Dynamic Tuning of Parallel Programs with DynInst

Parallel Computing Explained Parallel Code Tuning

Parallelized Boosting

On Tuning Microarchitecture for Programs

Designing Parallel Programs

Towards Automated Tuning of Parallel Programs

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs