An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

Motivation • Performance analysis is conceptually easy • Just run the program! • The “what” of performance. Is this interesting? • Is that realistic? • Huge programs with large data sets • “Uncertainty principle” and intractability of profiling/instrumenting • Performance prediction and analysis is in practice very hard • Not just interested in wall clock time • The “why” of performance is a big concern • How to accurately characterize program behavior? • What about architecture effects? • Can’t reuse wall clock time • Can reuse program characteristics

Motivation (2) • What about the future? • Different architecture = better results? • Compiler transformations (loop unrolling) • Need a fast, scalable, automated way of determining program characteristics • Determine what causes poor performance • What does profiling tell us? • How can the programmer use profiling (low-level) information?

Overview • Approach • High level / low level synergy • Not architecture-bound • Experimental results • CG core • Caveats and future work • Conclusion

Low versus High level information la $r0, a lw $r1 i mult $offset, $r1, 4 add $offset, $offset, $r0 lw $r2, $offset add $r3, $r2, 1 la $r4, b sw $r4, $r3 or • Which can provide meaningful performance information to a programmer? • How do we capture the information at a low level while maintaining the structure of high level source?

Low versus High level information (2) • Drawbacks of looking at low-level • Too much data! • You found a “problem” spot. What now? • How do programmers relate information back to source level? • Drawbacks of looking at source-level • What about the compiler? • Code may look very different • Architecture impacts? • Solution: Look at high-level structure, try to anticipate compiler

Experimental Approach • Goal: Derive performance expectations from source code for different architectures • What should the performance be and why? • What is limiting the performance? • Data-dependencies? • Architecture limitations? • Use high level information • WHIRL intermediate representation in Open64 • Arrays not lowered • Construct DFG • Decorate graph with latency information • Schedule the DFG • Compute as-soon-as-possible schedule • Variable number of functional units • ALU, Load/Store, Registers • Pipelining of operations

Compilation process OPR_STID: B OPR_ADD OPR_ARRAY OPR_LDA: A OPR_LDID: i OPR_CONST: 1 for (i; i < 0; … … B = A[i] + 1 … 1. Source (C/Fortran) 2. Open64 WHIRL (High-level) 3. Annotated DFG

i is a loop induction variable Array node represents address calculation at a high level Register hit? Assign latency Array expression is affine. Assume a cache hit, and assign latency accordingly Memory modeling approach 0

Example: CG do 200 j = 1, n xj = x(j) do 100 k = colstr(j) , colstr(j+1)-1 y(rowidx(k)) = y(rowidx(k)) + a(k) + xj 100 continue 200 continue

Figure 4. Validation results of CG on a MIPS R10000 machine CG Analysis Results Prediction results consistent with un-optimized version of the code

Figure 5. Cycle time for an iteration of CG with varying architectural configurations CG Analysis Results (2) • What’s the best way to use processor space? • Pipelined ALUs? • Replicate standard ALUs?

Caveats, Future Work • More compiler-like features are needed to improve accuracy • Control flow • Implement trace scheduling • Multiple-paths can give upper/lower performance bounds • Simple compiler transformations • Common sub-expression elimination • Strength reduction • Constant folding • Register allocation • “Distance”-based methods? • Anticipate cache for spill code • Software pipelining? • Unrolling exploits ILP • Run-time data? • Array references, loop trip counts, access patterns from performance skeletons

Conclusions • SLOPE provides very fast performance prediction and analysis results • High-level approach gives more meaningful information • Still try to anticipate compiler and memory hierarchy • More compiler transformations to be added • Maintain high-level approach, refine low-level accuracy

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

Presentation Transcript

Challenges in Performance Evaluation and Improvement of Scientific Codes

An Introduction to Open64 Compiler

GPU Performance Prediction

Development of an efficient DSP Compiler based on Open64

Scientific Benchmarks for Structure Prediction Codes

PERFORMANCE-BASED CODES CONCEPTS

Exploiting Nonstationarity for Performance Prediction

OpenCL Compiler Support Based on Open64 for MPUs+GPUs

Scientific discovery, analysis and prediction made possible through high performance computing.

Statistical Performance Analysis for Scientific Applications

Open64: A Framework for High performance Compiler

Performance Prediction Engineering

Performance Analysis and Compiler Optimizations

Performance Modeling and Prediction for Scientific Java Applications

Probe Sensitivity and Anticipated Performance

Hybrid Approach to System-Level Performance Analysis

Performance characterization and sensitivity analysis

Compiler Technology for Productivity and Performance

Mathematical Approach to Performance Analysis for Web-based Enterprise System

Development of an efficient DSP Compiler based on Open64