310 likes | 455 Vues
Performance Tools for Empirical Autotuning. Allen D. Malony , Nick Chaimov , Kevin Huck, Scott Biersdorff , Sameer Shende { malony,nchaimov,khuck,scott,sameer }@ cs.uoregon.edu University of Oregon. Outline. Motivation Performance engineering and autotuning
E N D
Performance Toolsfor Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende {malony,nchaimov,khuck,scott,sameer}@cs.uoregon.edu University of Oregon
Outline • Motivation • Performance engineering and autotuning • Performance tool integration with autotuning process • TAU performance system overview • Performance database (TAUdb) • Framework for empirical-based performance tuning • Integration with CHiLL and Active Harmony • Integration with Orio (preliminary) • Conclusions and future directions
Parallel Performance Engineering • Scalable, optimized applications deliver HPC promise • Optimization through performance engineering process • Understand performance complexity and inefficiencies • Tune application to run optimally on high-end machines • How to make the process more effective and productive? • What is the nature of the performance problem solving? • What is the performance technology to be applied? • Performance tool efforts have been focused on performance observation, analysis, problem diagnosis • Application development and optimization productivity • Programmability, reusability, portability, robustness • Performance technology part of larger programming system
Parallel Performance Engineering Process PerformanceTechnology • Data mining • Models • Expert systems PerformanceTechnology • Experimentmanagement • Performancestorage PerformanceTechnology • Instrumentation • Measurement • Analysis • Visualization • Traditionally an empirically-based approach observation experimentation diagnosis tuning • Performance technology developed for each level PerformanceTuning hypotheses Performance Diagnosis properties Performance Experimentation characterization Performance Observation
“Extreme” (Progressive) Performance Engineering • Increased performance complexity forces the engineering process to be more intelligent and automated • Automate performance data analysis / mining / learning • Automated performance problem identification • Performance engineering tools and practice must incorporate a performance knowledge discovery process • Model-oriented knowledge • Computational semantics of the application • Symbolic models for algorithms • Performance models for system architectures / components • Application developers can be more directly involved in the performance engineering process
“Extreme” Performance Engineering • Empirical performance data evaluated with respect to performance expectations at various levels of abstraction
Autotuning is a Performance Engineering Process • Autotuning methodology incorporates aspects common to “traditional” application performance engineering • Empirical performance observation • Experiment-oriented • Autotuning embodies progressive engineering techniques • Automated experimentation and performance testing • Guided optimization by (intelligent) search space exploration • Model-based (domain-specific) comptational semantics • Autotuning is a different approach to performance diagnosis style of performance engineering • There are shared objectives for performance technology and opportunities for tool integration
TAU Performance System® (http://tau.uoregon.edu) • Parallel performance framework and toolkit • Supports all HPC platforms, compilers, runtime system • Provides portable instrumentation, measurement, analysis
TAU Components • Instrumentation • Fortran, C, C++, UPC, Chapel, Python, Java • Source, compiler, library wrapping, binary rewriting • Automatic instrumentation • Measurement • MPI, OpenSHMEM, ARMCI, PGAS • Pthreads, OpenMP, other thread models • GPU, CUDA, OpenCL, OpenACC • Performance data (timing, counters) and metadata • Parallel profiling and tracing • Analysis • Performance database technology (TAUdb, formerly PerfDMF) • Parallel profile analysis (ParaProf) • Performance data mining / machine learning (PerfExplorer)
TAU Performance Database – TAUdb • Started in 2004 (Huck et al., ICPP 2005) • Performance Data Management Framework (PerfDMF) • Database schema and Java API • Profile parsing • Database queries • Conversion utilities (parallel profiles from other tools) • Provides DB support for TAU profile analysis tools • ParaProf, PerfExplorer, EclipsePTP • Used as regression testing database for TAU • Used as performance regression database • Ported to several DBMS • PostgreSQL, MySQL, H2, Derby, Oracle, DB2
TAUdb Database Schema • Parallel performance profiles • Timer and counter measurements with 5 dimensions • Physical location: process / thread • Static code location: function / loop / block / line • Dynamic location: current callpathand context (parameters) • Time context: iteration / snapshot / phase • Metric: time, HW counters, derived values • Measurement metadata • Properties of the experiment • Anything from name:value pairs to nested, structured data • Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)
TAUdb Programming APIs • Java • Original API • Basis for in-house analysis tool support • Command line tools for batch loading into the database • Parses 15+ profile formats • TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, … • Supports Java embedded databases (H2, Derby) • C programming interface under development • PostgreSQL support first, others as requested • Query Prototype developed • Plan full-featured API: Query, Insert, & Update • Evaluating SQLite support
TAUdbTool Support • ParaProf • Parallel profile viewer / analyzer • 2, 3+D visualizations • Single experiment analysis • PerfExplorer • Data mining framework • Clustering, correlation • Multi-experiment analysis • Scripting engine • Expert system
TAU Integration with CHiLLand Active Harmony • Major goals: • Integrate TAU with existing autotuning frameworks • Use TAU to gather performance data for autotuning/specialization • Store performance data tagged with metadata about execution environment and input in a centralized database • Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization • Using TAU in two ways: • Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code) • Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization
Components • ROSE Outliner • ROSE is a compiler with built-in support for source-to-source transformations • ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file • CHiLL • provides a domain specific language for specifying transformations on loops • Active Harmony • Searches space of parameters to transformation recipes • TAU • Performance instrumentation and measurement
Multi-Parameter Profiling • Added multi-parameter-based profiling in TAU to support specialization • User can select which parameters are of interest using a selective instrumentation file • Consider a matrix multiply function • We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N
Using Parameterized Profiling in TAU intmatmult(float **, float **, float **, int, int, int)<L=100, M=8, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=100, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=8, N=8> C BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#”param file=“foo.c” routine=“matmult”param=“L”param=“M”param=“N” END_INSTRUMENT_SECTION
Autotuning with TauDB Methodology • Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced: • Source function • Name of CHiLL recipe • Parameters to CHiLL recipe • The database also contains metadata on what parameters were called and also on the execution environment: • OS name, version, release, native architecture • CPU vendor, ID, clock speed, cache sizes, # cores • Memory size • Any metadata specified by the end user
Machine Learning • Given all these data stored in TauDB... • OS name, OS Release, CPU Type, CPU MHz, CPU cores • param<N>, param<M> • Chillscript • Metric<TIME> ... we can build a decision tree which selects the best-performing code variant given information available at run-time
Decision Tree • PerfExplorer already has an interface to Weka • Use Weka to generate decision trees based upon the data in the performance database
Wrapper Generation • Use a ROSE-based tool to generate a wrapper function • Carries out the decisions in the decision tree and executes the best code variant • Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input • If using custom metadata, the user needs to provide a custom decision function • Decision functions for metadata automatically collected by TAU are provided
Example: Sparse MM Specialization with CHiLL • Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size. • Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually. • We can automate these processes with parameterized profiling and machine learning over the collected data. • Replicated small-matrix specialization study using TAU and TauDB.
Introduction to Orio • Orio is an annotation-based empiricalperformance tuning framework • Source code annotations allow Orio to generate a set of low-level performance optimizations • After each optimization (or transformation) is applied the kernel is run • Set of optimizations is searched for the best transformations to be applied for a given kernel • First effort to integrate Orio with TAU was to collect performance data about each experiment that Orioruns • Move performance data from Orio into TAUdb • Orio could read from TAUdb in future
TAU's GPU Measurement Library • Focused on Orio’s CUDA kernel transformations • TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution • Memory transfers • Kernels (runtime performance, performance attributes) • GPU counters • Using the CUPTI interface does not require any recompiling or re-linking of the application
OrioTuning of Vector Multiplication • Orio tuning of a simple 3D vector multiplication • 2,048 experiments fed into TAUdb • Use TAU PerfExplorer with Weka to do component analysis Threads Per Block # of Blocks Unroll factor Preferred L1 Size CFLAG Warps Per SM Kernel Execution Time Small correlation with runtime Better correlated with runtime
Number of Threads per Kernel • GPU occupancy (# warps) increases with larger # threads • Greater occupancy improves memory latency hiding resulting in faster execution time Kernel Execution Time (us) GPU occupancy (warps) Number of Threads
Conclusions • Autotuning IS a performance engineering process • It is complementary with performance engineering for empirical-based performance diagnosis and optimization • There are opportunities to integrate application parallel performance tools with autotuning methodologies • Performance experiment instrumentation and measurement • Performance data/metadata, analysis, data mining • Knowledge engineering is key (at all levels) • Performance + parallel computation + system • Represented in form the tools can reason about • Bridge between application performance characterization methodology with autotuning methodology • Integration will help to explain performance
Future Work • DOE SUPER SciDAC project • Integration of TAU with autotuning frameworks • CHiLL, Active Harmony, Orio • Apply tools for end-to-end application performance • Build performance databases • Enable exploration and understanding of search spaces • Enable association of multi-dimensional data/metadata • Relate performance across compilers, platforms, … • Feedback of semantic information to explain performance • Explore data mining and machine learning techniques • Discovery of relationships, factors, latent variables, … • Create performance models and feedback to autotuning • Learn optimal algorithm parameters for specific scenarios • Bridge between model-based and experimental-based • Create knowledge-based integrated performance system