Semester report summary

Semester report summary Adam Leko 1/25/2005 HCS Research Laboratory University of Florida

Programming practices overview

Programming practices:CAMEL • CAMEL: parallelization of an existing cipher written by members of HCS lab • MPI and UPC versions written • Spinlock implementation in MPI version forced rewrite of master/worker-style code • Relatively easy to port existing C code, only slight restructuring of application required • Conclusions • Good overall performance • Not much difference between MPI/UPC or platforms • MPI code longer (100 LoC) than UPC

Programming practices:Bench9 (Mod 2N inverse) & convolution • Convolution: simple image/signal processing operation • Embarrassingly parallel operation • MPI, UPC, and SHMEM versions written • Bench9: part of the NSA benchmark suite • Quick, embarrassingly-parallel computation (memory intensive) • Bandwidth-intensive, sequential check phase • MPI, UPC, and SHMEM versions written • Conclusions • UPC compiler can add overhead • MPI most difficult to write (necessary to map out all communication manually) • One-sided SHMEM get/put simplified things • bench9 • UPC easiest to write, but worst performance • UPC also most sensitive to performance optimizations • Convolution • Near-linear speedup obtained for all platforms and versions

Programming practices:Concurrent wave equation • Wave equation: Parallelization of an existing program to simulate waveforms of a stationary plucked string • Compute-bound, memory-intensive algorithm • Conclusions • Near-linear speedup obtained • Construct performance difference: array+j performed slightly better than &(array[j])

Programming practices:Depth-first search • Depth-first search: tree searching algorithm • Represent 2-ary tree via array • Simple to implement sequentially • UPC implementation strategy: “spawn” workers as depth of search increases • Conclusions • UPC doesn’t directly support dynamically spawning threads! • Optimizations can have large effect (see left) • Negative speedup obtained due to communication overhead

Optimizations overview

Optimizations reviewed • Broken into categories of when optimization is to be performed • Sequential compiler optimizations – specific to sequential compilers, includes such techniques as loop unrolling, software pipelining, etc • Pre-compilation optimization methods – deals with high-level issues such as data placement and load balancing • Compile-time optimization – strategies used by HPF, Co-Array Fortran, and OpenMP compilers • Runtime optimizations – dynamic load balancing, etc • Post-runtime optimizations – analyze trace files, etc

Sequential compiler optimizations • Reduction transformations • Purpose • Eliminate duplicated work • Transform individual statements to equivalent statements of lesser cost • Examples • Replace X^2 with X * X (algebraic simplification and strength reduction) • Store common subexpressions so they are computed only once (common subexpression elimination) • Short-circuiting the evaluation of boolean expressions (short-circuiting) • Function transformations • Purpose: reduce overhead of function calls • Examples • Store arguments to functions in registers instead of on the stack (parameter promotion) • Replicating function code to eliminate function call overhead (function inlining) • Storing results from functions that have no side effects (function memoization) • Transforming code loops • Purpose: • Reduce computational complexity • Increase parallelism • Improve memory access characteristics • Examples • Moving loop invariant code outside of loop to reduce computation per iteration (loop-invariant code motion) • Reordering instructions to pipeline memory accesses (loop pipelining) • Merging different loops to reduce loop counter overhead (loop fusion) • Splitting loops into different pieces to vectorize operations (strip mining, loop tiling) • Memory access transformations • Purpose • Reduce cost of memory operations • Restructure program to reduce number of memory • Examples • Padding arrays so they fit in cache line sizes (array padding) • Replicating code in binary to improve I-cache efficiency (code co-location) • Keeping commonly used memory locations pegged in registers (scalar replacement)

Pre-compilation optimizations • Tiling • Purpose: automatically parallelize sequential loops • Similar to loop tiling performed by vectorizing sequential compilers • Works for programs that make heavy use of nested for loops • Takes loops and transforms them into atomic pieces that can be independently executed • Issues: tile shapes, mapping tiles to processors • Augmented data access descriptors • Purpose: automatically parallelize Fortran do loops • Instead of analyzing loop dependencies, ADADs represent how sections of code affect each other • Can apply loop fusion and other loop parallelization techniques directly to ADADs • Potentially lets compilers use ADADs to choose between different optimization techniques

Compile-time optimizations • General compile-time strategies • Purpose • Eliminate unnecessary communication • Reduce cost of communications • Examples • Aligning arrays to fit in shared-memory cache line sizes (cache alignment) • Grouping data together before sending out as to reduce # of messages sent (message vectorization, coalescing, and aggregation) • Overlapping communication and computation by splitting receive operation into two phases (message pipelining) • Existing compilers • PARADIGM • HPF compiler that uses an abstract model to determine how to decompose HPF statements • Optimizations performed: Message coalescing, vectorization, pipelining; overlapping of loops that cannot be parallelized due to loop-carried dependencies (course-grained pipelining) • McKinley’s algorithm • Splits compilation phase into 4 stages: optimization, fusing, parallelization, and enabling • Uses a wide variety of optimization techniques • Author argues all techniques are necessary to get good performance out of “dusty deck” (unmodified sequential) codes • ASTI compiler • Existing sequential compiler developed by IBM extended to support SMP machines • Uses • Models of cache misses and TLB access costs in addition to many sequential optimizations • “Function outlining” (opposite of function inlining) to simplify thread storage • Dynamic self-scheduling load balancing library • Not very good results on 4-CPU machine compared to hand-tuned code • dHPF compiler • High-performance Fortran compiler developed at Rice to automatically parallelize HPF code • Uses many (previously-listed) existing communication optimizations • Adds two which are necessary for good performance on NAS benchmarks • Bringing in local copies of read-only, loop invariant (minus antidependencies) variables for each thread • Replication of computation via special LOCALIZE statement to reduce unnecessary communication with quick computations • Competitive results obtained on NAS benchmarks

Runtime optimizations • Why do optimizations at runtime? • Less costly to do earlier • But, for irregular applications, only choice • Inspector/executor scheme • Created for applications whose work distribution is not known until runtime • Inspector creates “plan” for work distribution at runtime • Executor in charge of orchestrating execution of plan created by inspector • Overhead of inspector must be balanced with overall work distribution • Implemented in PARTI library • Nikolopolous’ method • OpenMP-specific method which uses unmodified OpenMP APIs • Uses a few short probing iterations • Probing iterations indicate where work imbalance exists • Greedy method redistributes work among processors to even things out • Worked well for such a simple method (within 33% of hand-tuned MPI code)

Post-runtime optimizations • Ad-hoc methods • Rely on rudimentary analysis to guide programmer on what to work on • Uses code, instrument, run, analyze, code, instrument, … loop • Relies heavily on luck & skill of programmer • Most widely-used method today! • PARADISE • Analyzes trace files generated by Charm++ parallel library/runtime system (developed at UIUC) • Optimizations suggested deal with distributed object-based systems • Can be automatically performed by means of a “hint” file given to the Charm++ runtime system • KAPPA-PI • Knowledge-based system which identifies bottlenecks using a rule-based system • Bottlenecks are presented to user, correlated with source code • Also has recommendations on how to fix problems identified • Seems very rudimentary & aimed at novice programmers • Difficult problem, but seems potentially very valuable

Performance modeling overview

Performance modeling overview • Why? Several reasons • Grid systems: need a way to estimate how long a program will take (billing/scheduling issues) • Could be used in conjunction with optimization methods to suggest improvements to user • Also can guide user on what kind of benefit can be expected from optimizing aspects of code • Figure out how far code is from optimal performance • Indirectly detect problems: if a section of code is not performing as predicted, it probably has cache locality problems/etc • Challenge • Many models already exist, with varying degrees of accuracy and speed • Choose best model to fit in UPC/SHMEM PAT • Existing performance models categorized into different categories • Formal models (process algebras, petri nets) • General models that provide “mental pictures” of hardware/performance • Predictive models that try to estimate timing information

Formal performance models • Least useful for our purposes • Formal methods are strongly rooted in math • Can make strong statements and guarantees • However, difficult to adapt and automate for new programs • Petri nets • Specialized graphs that represent processes and systems • Very generic method of modeling many different things • Older (invented 1962), more mature, but Petri nets don’t provide much groundwork for parallel program modeling • Process algebras • Formal algebra for specifying parallel processes and how they interact • Hoare’s CSP, Milner’s CCS • Entire books devoted to this subject • Complicated to use, but can prove things like deadlock-free algorithms • Queuing theory • Very strongly rooted in math (ECE course on the subject) • Hard to apply to real-world programs • PAMELA • C-style imperative language used to model concurrent and time-related operations • Similar to process algebras, but geared towards simulation of models created in the PAMELA language • Much work required to create PAMELA models directly from source code or trace files • Models encode high-level parallel information about what is going on in a program • Reductions are necessary to reduce size of PAMELA models for feasible simulation times

General performance models • Provide user with “mental picture” • Rules of thumb for cost of operations • Guides strategies used while creating programs • PRAM • Classic model that uses unit cost operations for all memory accesses • Useful for determining the parallel complexity of an algorithm • Very easy to deal with  • Not very accurate  • No synchronization costs • Uniform memory access cost • Simplistic contention model (combination of concurrent/exclusive reads and writes) • BSP • Aims to provide a bridging tool between software and hardware, much as the von Neumann model has done for sequential programming • Breaks communication and computation into phases (supersteps) • Barriers performed between all supersteps • Uses a simplistic communication model (processors are assumed to send fixed # messages in each superstep) • Reasonable accuracy (~20% for CFD)  • LogP • Model that only takes into consideration communication cost • Latency, overhead, gap, # processors • Simple model to work with  • Predicts network performance well (extensions such as needed for modern networks) • Has been applied to predict memory performance in the past with only moderate success (memory LogP)  • Other interesting finds • One paper modeled compiler overhead introduced by Dataparallel C compilers (scaling factor) • Application-specific models are not useful to a PAT • Adding in tons of parameters from microbenchmarks gives complicated equations but not necessarily better accuracy

Predictive performance models [1] • Models that specifically predict performance of parallel codes • Lost cycles analysis • Geared towards real-world usage  • Simple idea : anything that is not computation is not useful • Program state recorded by setting flags (manual(?) instrumentation) • States are sampled or logged • Predicates are used to determine if a program is losing cycles • E.g.: Load Imbalance(x)≣Work Exists ^ Processors Idle(x) • Authors assert predicates they use are orthogonal and complete • Good accuracy (within ~12.5% for FFT)  • Not clear how to relate information to source level  • Task graphs • Common technique similar to process algebras and PAMELA • Graphically model amount of parallelism inherent in given program • Also takes into account dependencies between tasks • Complex program control is approximated via mean execution times (assumed deterministic) • Graphs are used in conjunction with system models to quickly predict performance • Have good accuracy (although based on quality of system models)  • Open-ended enough to adapt to a PAT  • Can also incorporate analysis into task graph since it represents program structure  • Generating task graphs may be difficult, even from program traces  • VFCS • Vienna Fortran Compilation System, a parallelizing Fortran compiler that uses a predictive model to parallelize code • Uses a profiling stage on sequential code to determine sequential code characteristics • Predictive model uses analytic techniques (earlier version used simulation) • GUI/IDE incorporates “cost” of each statement during coding phase • Cannot extend (old, large code base), but useful to examine techniques used by the system

Predictive performance models [2] • PACE • Novel idea: generate predictive traces that can be viewed by existing tools (SvPablo) • Geared towards grid applications • Uses performance language CHIP3S to model program performance • Models are compiled and can be quickly evaluated  • No standard way of creating performance models illustrated  • Convolution • Uses several existing tools to predict performance of MPI applications • “Convolves” system characteristics with application characteristics • System: memory performance (MAPS) & network performance (PMB) • Application: memory accesses (MetaSim tracer) & network accesses (MPITrace) • Fairly good accuracy (within ~20% for predicting matrix multiply on another platform)  • Currently limited to Alpha platform  • Requires programs to be run before predictions are made  • Convolution method is not detailed in any available papers  • Several other models considered (see report, section 5)

Semester report summary

Semester report summary

Presentation Transcript

Summary Report

SHEQ- summary Report

Brief summary of this semester:

Semester Summary

POS Summary Report

Verification Summary Report

Summary Report

Summary Report

LOSWG Summary Report

Summary Report

Summary Report

First Semester Accomplishment Report

Fusion Technology Summary Report

NA2 Summary Report

Summary Report, Fall 2010

Semester report Semester Plan

AP Biology Semester One Summary

QCD Summary Report

Semester Conversion Report

Summary Report

Summary Report

Summary Report