Performance Tools for Empirical Autotuning

Performance Toolsfor Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende {malony,nchaimov,khuck,scott,sameer}@cs.uoregon.edu University of Oregon

Outline • Motivation • Performance engineering and autotuning • Performance tool integration with autotuning process • TAU performance system overview • Performance database (TAUdb) • Framework for empirical-based performance tuning • Integration with CHiLL and Active Harmony • Integration with Orio (preliminary) • Conclusions and future directions

Parallel Performance Engineering • Scalable, optimized applications deliver HPC promise • Optimization through performance engineering process • Understand performance complexity and inefficiencies • Tune application to run optimally on high-end machines • How to make the process more effective and productive? • What is the nature of the performance problem solving? • What is the performance technology to be applied? • Performance tool efforts have been focused on performance observation, analysis, problem diagnosis • Application development and optimization productivity • Programmability, reusability, portability, robustness • Performance technology part of larger programming system

Parallel Performance Engineering Process PerformanceTechnology • Data mining • Models • Expert systems PerformanceTechnology • Experimentmanagement • Performancestorage PerformanceTechnology • Instrumentation • Measurement • Analysis • Visualization • Traditionally an empirically-based approach observation experimentation diagnosis tuning • Performance technology developed for each level PerformanceTuning hypotheses Performance Diagnosis properties Performance Experimentation characterization Performance Observation

Parallel Performance Diagnosis

“Extreme” (Progressive) Performance Engineering • Increased performance complexity forces the engineering process to be more intelligent and automated • Automate performance data analysis / mining / learning • Automated performance problem identification • Performance engineering tools and practice must incorporate a performance knowledge discovery process • Model-oriented knowledge • Computational semantics of the application • Symbolic models for algorithms • Performance models for system architectures / components • Application developers can be more directly involved in the performance engineering process

“Extreme” Performance Engineering • Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

Autotuning is a Performance Engineering Process • Autotuning methodology incorporates aspects common to “traditional” application performance engineering • Empirical performance observation • Experiment-oriented • Autotuning embodies progressive engineering techniques • Automated experimentation and performance testing • Guided optimization by (intelligent) search space exploration • Model-based (domain-specific) comptational semantics • Autotuning is a different approach to performance diagnosis style of performance engineering • There are shared objectives for performance technology and opportunities for tool integration

TAU Performance System® (http://tau.uoregon.edu) • Parallel performance framework and toolkit • Supports all HPC platforms, compilers, runtime system • Provides portable instrumentation, measurement, analysis

TAU Components • Instrumentation • Fortran, C, C++, UPC, Chapel, Python, Java • Source, compiler, library wrapping, binary rewriting • Automatic instrumentation • Measurement • MPI, OpenSHMEM, ARMCI, PGAS • Pthreads, OpenMP, other thread models • GPU, CUDA, OpenCL, OpenACC • Performance data (timing, counters) and metadata • Parallel profiling and tracing • Analysis • Performance database technology (TAUdb, formerly PerfDMF) • Parallel profile analysis (ParaProf) • Performance data mining / machine learning (PerfExplorer)

TAU Performance Database – TAUdb • Started in 2004 (Huck et al., ICPP 2005) • Performance Data Management Framework (PerfDMF) • Database schema and Java API • Profile parsing • Database queries • Conversion utilities (parallel profiles from other tools) • Provides DB support for TAU profile analysis tools • ParaProf, PerfExplorer, EclipsePTP • Used as regression testing database for TAU • Used as performance regression database • Ported to several DBMS • PostgreSQL, MySQL, H2, Derby, Oracle, DB2

TAUdb Database Schema • Parallel performance profiles • Timer and counter measurements with 5 dimensions • Physical location: process / thread • Static code location: function / loop / block / line • Dynamic location: current callpathand context (parameters) • Time context: iteration / snapshot / phase • Metric: time, HW counters, derived values • Measurement metadata • Properties of the experiment • Anything from name:value pairs to nested, structured data • Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)

TAUdb Programming APIs • Java • Original API • Basis for in-house analysis tool support • Command line tools for batch loading into the database • Parses 15+ profile formats • TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, … • Supports Java embedded databases (H2, Derby) • C programming interface under development • PostgreSQL support first, others as requested • Query Prototype developed • Plan full-featured API: Query, Insert, & Update • Evaluating SQLite support

TAUdbTool Support • ParaProf • Parallel profile viewer / analyzer • 2, 3+D visualizations • Single experiment analysis • PerfExplorer • Data mining framework • Clustering, correlation • Multi-experiment analysis • Scripting engine • Expert system

TAU Integration with CHiLLand Active Harmony • Major goals: • Integrate TAU with existing autotuning frameworks • Use TAU to gather performance data for autotuning/specialization • Store performance data tagged with metadata about execution environment and input in a centralized database • Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization • Using TAU in two ways: • Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code) • Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization

Components • ROSE Outliner • ROSE is a compiler with built-in support for source-to-source transformations • ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file • CHiLL • provides a domain specific language for specifying transformations on loops • Active Harmony • Searches space of parameters to transformation recipes • TAU • Performance instrumentation and measurement

Multi-Parameter Profiling • Added multi-parameter-based profiling in TAU to support specialization • User can select which parameters are of interest using a selective instrumentation file • Consider a matrix multiply function • We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N

Using Parameterized Profiling in TAU intmatmult(float **, float **, float **, int, int, int)<L=100, M=8, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=100, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=8, N=8> C BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#”param file=“foo.c” routine=“matmult”param=“L”param=“M”param=“N” END_INSTRUMENT_SECTION

Parameterized Profiling / Autotuning with TauDB

Autotuning with TauDB Methodology • Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced: • Source function • Name of CHiLL recipe • Parameters to CHiLL recipe • The database also contains metadata on what parameters were called and also on the execution environment: • OS name, version, release, native architecture • CPU vendor, ID, clock speed, cache sizes, # cores • Memory size • Any metadata specified by the end user

Machine Learning • Given all these data stored in TauDB... • OS name, OS Release, CPU Type, CPU MHz, CPU cores • param<N>, param<M> • Chillscript • Metric<TIME> ... we can build a decision tree which selects the best-performing code variant given information available at run-time

Decision Tree • PerfExplorer already has an interface to Weka • Use Weka to generate decision trees based upon the data in the performance database

Wrapper Generation • Use a ROSE-based tool to generate a wrapper function • Carries out the decisions in the decision tree and executes the best code variant • Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input • If using custom metadata, the user needs to provide a custom decision function • Decision functions for metadata automatically collected by TAU are provided

Example: Sparse MM Specialization with CHiLL • Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size. • Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually. • We can automate these processes with parameterized profiling and machine learning over the collected data. • Replicated small-matrix specialization study using TAU and TauDB.

Introduction to Orio • Orio is an annotation-based empiricalperformance tuning framework • Source code annotations allow Orio to generate a set of low-level performance optimizations • After each optimization (or transformation) is applied the kernel is run • Set of optimizations is searched for the best transformations to be applied for a given kernel • First effort to integrate Orio with TAU was to collect performance data about each experiment that Orioruns • Move performance data from Orio into TAUdb • Orio could read from TAUdb in future

TAU's GPU Measurement Library • Focused on Orio’s CUDA kernel transformations • TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution • Memory transfers • Kernels (runtime performance, performance attributes) • GPU counters • Using the CUPTI interface does not require any recompiling or re-linking of the application

Orio and TAU Integration

OrioTuning of Vector Multiplication • Orio tuning of a simple 3D vector multiplication • 2,048 experiments fed into TAUdb • Use TAU PerfExplorer with Weka to do component analysis Threads Per Block # of Blocks Unroll factor Preferred L1 Size CFLAG Warps Per SM Kernel Execution Time Small correlation with runtime Better correlated with runtime

Number of Threads per Kernel • GPU occupancy (# warps) increases with larger # threads • Greater occupancy improves memory latency hiding resulting in faster execution time Kernel Execution Time (us) GPU occupancy (warps) Number of Threads

Conclusions • Autotuning IS a performance engineering process • It is complementary with performance engineering for empirical-based performance diagnosis and optimization • There are opportunities to integrate application parallel performance tools with autotuning methodologies • Performance experiment instrumentation and measurement • Performance data/metadata, analysis, data mining • Knowledge engineering is key (at all levels) • Performance + parallel computation + system • Represented in form the tools can reason about • Bridge between application performance characterization methodology with autotuning methodology • Integration will help to explain performance

Future Work • DOE SUPER SciDAC project • Integration of TAU with autotuning frameworks • CHiLL, Active Harmony, Orio • Apply tools for end-to-end application performance • Build performance databases • Enable exploration and understanding of search spaces • Enable association of multi-dimensional data/metadata • Relate performance across compilers, platforms, … • Feedback of semantic information to explain performance • Explore data mining and machine learning techniques • Discovery of relationships, factors, latent variables, … • Create performance models and feedback to autotuning • Learn optimal algorithm parameters for specific scenarios • Bridge between model-based and experimental-based • Create knowledge-based integrated performance system

Performance Tools for Empirical Autotuning

Performance Tools for Empirical Autotuning

Presentation Transcript

Empirical Issues Portfolio Performance Evaluation

Tools for High Performance Network Monitoring

Fifteen Tools for Efficient Performance Auditing

Tools for Performance Debugging HPC Applications

Performance Tools

Performance Tools

TRACTOR PERFORMANCE (EMPIRICAL)

Tools for High Performance Scientific Computing

OpenGL Performance Tools

Autotuning sparse matrix kernels

Autotuning sparse matrix kernels

Traditional Performance Tools

Empirical Tools of Public Finance

Autotuning in Web100

Performance Analysis Tools

OpenGL Performance Tools

Empirical Issues Portfolio Performance Evaluation

Performance Tools

Tools for High Performance Network Monitoring

“Tools for Peak Performance”

Performance Analysis Tools