1 / 31

Performance Tools for Empirical Autotuning

Performance Tools for Empirical Autotuning. Allen D. Malony , Nick Chaimov , Kevin Huck, Scott Biersdorff , Sameer Shende { malony,nchaimov,khuck,scott,sameer }@ cs.uoregon.edu University of Oregon. Outline. Motivation Performance engineering and autotuning

lamar
Télécharger la présentation

Performance Tools for Empirical Autotuning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Toolsfor Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende {malony,nchaimov,khuck,scott,sameer}@cs.uoregon.edu University of Oregon

  2. Outline • Motivation • Performance engineering and autotuning • Performance tool integration with autotuning process • TAU performance system overview • Performance database (TAUdb) • Framework for empirical-based performance tuning • Integration with CHiLL and Active Harmony • Integration with Orio (preliminary) • Conclusions and future directions

  3. Parallel Performance Engineering • Scalable, optimized applications deliver HPC promise • Optimization through performance engineering process • Understand performance complexity and inefficiencies • Tune application to run optimally on high-end machines • How to make the process more effective and productive? • What is the nature of the performance problem solving? • What is the performance technology to be applied? • Performance tool efforts have been focused on performance observation, analysis, problem diagnosis • Application development and optimization productivity • Programmability, reusability, portability, robustness • Performance technology part of larger programming system

  4. Parallel Performance Engineering Process PerformanceTechnology • Data mining • Models • Expert systems PerformanceTechnology • Experimentmanagement • Performancestorage PerformanceTechnology • Instrumentation • Measurement • Analysis • Visualization • Traditionally an empirically-based approach observation experimentation diagnosis tuning • Performance technology developed for each level PerformanceTuning hypotheses Performance Diagnosis properties Performance Experimentation characterization Performance Observation

  5. Parallel Performance Diagnosis

  6. “Extreme” (Progressive) Performance Engineering • Increased performance complexity forces the engineering process to be more intelligent and automated • Automate performance data analysis / mining / learning • Automated performance problem identification • Performance engineering tools and practice must incorporate a performance knowledge discovery process • Model-oriented knowledge • Computational semantics of the application • Symbolic models for algorithms • Performance models for system architectures / components • Application developers can be more directly involved in the performance engineering process

  7. “Extreme” Performance Engineering • Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

  8. Autotuning is a Performance Engineering Process • Autotuning methodology incorporates aspects common to “traditional” application performance engineering • Empirical performance observation • Experiment-oriented • Autotuning embodies progressive engineering techniques • Automated experimentation and performance testing • Guided optimization by (intelligent) search space exploration • Model-based (domain-specific) comptational semantics • Autotuning is a different approach to performance diagnosis style of performance engineering • There are shared objectives for performance technology and opportunities for tool integration

  9. TAU Performance System® (http://tau.uoregon.edu) • Parallel performance framework and toolkit • Supports all HPC platforms, compilers, runtime system • Provides portable instrumentation, measurement, analysis

  10. TAU Components • Instrumentation • Fortran, C, C++, UPC, Chapel, Python, Java • Source, compiler, library wrapping, binary rewriting • Automatic instrumentation • Measurement • MPI, OpenSHMEM, ARMCI, PGAS • Pthreads, OpenMP, other thread models • GPU, CUDA, OpenCL, OpenACC • Performance data (timing, counters) and metadata • Parallel profiling and tracing • Analysis • Performance database technology (TAUdb, formerly PerfDMF) • Parallel profile analysis (ParaProf) • Performance data mining / machine learning (PerfExplorer)

  11. TAU Performance Database – TAUdb • Started in 2004 (Huck et al., ICPP 2005) • Performance Data Management Framework (PerfDMF) • Database schema and Java API • Profile parsing • Database queries • Conversion utilities (parallel profiles from other tools) • Provides DB support for TAU profile analysis tools • ParaProf, PerfExplorer, EclipsePTP • Used as regression testing database for TAU • Used as performance regression database • Ported to several DBMS • PostgreSQL, MySQL, H2, Derby, Oracle, DB2

  12. TAUdb Database Schema • Parallel performance profiles • Timer and counter measurements with 5 dimensions • Physical location: process / thread • Static code location: function / loop / block / line • Dynamic location: current callpathand context (parameters) • Time context: iteration / snapshot / phase • Metric: time, HW counters, derived values • Measurement metadata • Properties of the experiment • Anything from name:value pairs to nested, structured data • Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)

  13. TAUdb Programming APIs • Java • Original API • Basis for in-house analysis tool support • Command line tools for batch loading into the database • Parses 15+ profile formats • TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, … • Supports Java embedded databases (H2, Derby) • C programming interface under development • PostgreSQL support first, others as requested • Query Prototype developed • Plan full-featured API: Query, Insert, & Update • Evaluating SQLite support

  14. TAUdbTool Support • ParaProf • Parallel profile viewer / analyzer • 2, 3+D visualizations • Single experiment analysis • PerfExplorer • Data mining framework • Clustering, correlation • Multi-experiment analysis • Scripting engine • Expert system

  15. TAU Integration with CHiLLand Active Harmony • Major goals: • Integrate TAU with existing autotuning frameworks • Use TAU to gather performance data for autotuning/specialization • Store performance data tagged with metadata about execution environment and input in a centralized database • Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization • Using TAU in two ways: • Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code) • Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization

  16. Components • ROSE Outliner • ROSE is a compiler with built-in support for source-to-source transformations • ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file • CHiLL • provides a domain specific language for specifying transformations on loops • Active Harmony • Searches space of parameters to transformation recipes • TAU • Performance instrumentation and measurement

  17. Multi-Parameter Profiling • Added multi-parameter-based profiling in TAU to support specialization • User can select which parameters are of interest using a selective instrumentation file • Consider a matrix multiply function • We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N

  18. Using Parameterized Profiling in TAU intmatmult(float **, float **, float **, int, int, int)<L=100, M=8, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=100, N=8> C intmatmult(float **, float **, float **, int, int, int)<L=10, M=8, N=8> C BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#”param file=“foo.c” routine=“matmult”param=“L”param=“M”param=“N” END_INSTRUMENT_SECTION

  19. Parameterized Profiling / Autotuning with TauDB

  20. Autotuning with TauDB Methodology • Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced: • Source function • Name of CHiLL recipe • Parameters to CHiLL recipe • The database also contains metadata on what parameters were called and also on the execution environment: • OS name, version, release, native architecture • CPU vendor, ID, clock speed, cache sizes, # cores • Memory size • Any metadata specified by the end user

  21. Machine Learning • Given all these data stored in TauDB... • OS name, OS Release, CPU Type, CPU MHz, CPU cores • param<N>, param<M> • Chillscript • Metric<TIME> ... we can build a decision tree which selects the best-performing code variant given information available at run-time

  22. Decision Tree • PerfExplorer already has an interface to Weka • Use Weka to generate decision trees based upon the data in the performance database

  23. Wrapper Generation • Use a ROSE-based tool to generate a wrapper function • Carries out the decisions in the decision tree and executes the best code variant • Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input • If using custom metadata, the user needs to provide a custom decision function • Decision functions for metadata automatically collected by TAU are provided

  24. Example: Sparse MM Specialization with CHiLL • Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size. • Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually. • We can automate these processes with parameterized profiling and machine learning over the collected data. • Replicated small-matrix specialization study using TAU and TauDB.

  25. Introduction to Orio • Orio is an annotation-based empiricalperformance tuning framework • Source code annotations allow Orio to generate a set of low-level performance optimizations • After each optimization (or transformation) is applied the kernel is run • Set of optimizations is searched for the best transformations to be applied for a given kernel • First effort to integrate Orio with TAU was to collect performance data about each experiment that Orioruns • Move performance data from Orio into TAUdb • Orio could read from TAUdb in future

  26. TAU's GPU Measurement Library • Focused on Orio’s CUDA kernel transformations • TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution • Memory transfers • Kernels (runtime performance, performance attributes) • GPU counters • Using the CUPTI interface does not require any recompiling or re-linking of the application

  27. Orio and TAU Integration

  28. OrioTuning of Vector Multiplication • Orio tuning of a simple 3D vector multiplication • 2,048 experiments fed into TAUdb • Use TAU PerfExplorer with Weka to do component analysis Threads Per Block # of Blocks Unroll factor Preferred L1 Size CFLAG Warps Per SM Kernel Execution Time Small correlation with runtime Better correlated with runtime

  29. Number of Threads per Kernel • GPU occupancy (# warps) increases with larger # threads • Greater occupancy improves memory latency hiding resulting in faster execution time Kernel Execution Time (us) GPU occupancy (warps) Number of Threads

  30. Conclusions • Autotuning IS a performance engineering process • It is complementary with performance engineering for empirical-based performance diagnosis and optimization • There are opportunities to integrate application parallel performance tools with autotuning methodologies • Performance experiment instrumentation and measurement • Performance data/metadata, analysis, data mining • Knowledge engineering is key (at all levels) • Performance + parallel computation + system • Represented in form the tools can reason about • Bridge between application performance characterization methodology with autotuning methodology • Integration will help to explain performance

  31. Future Work • DOE SUPER SciDAC project • Integration of TAU with autotuning frameworks • CHiLL, Active Harmony, Orio • Apply tools for end-to-end application performance • Build performance databases • Enable exploration and understanding of search spaces • Enable association of multi-dimensional data/metadata • Relate performance across compilers, platforms, … • Feedback of semantic information to explain performance • Explore data mining and machine learning techniques • Discovery of relationships, factors, latent variables, … • Create performance models and feedback to autotuning • Learn optimal algorithm parameters for specific scenarios • Bridge between model-based and experimental-based • Create knowledge-based integrated performance system

More Related