1 / 42

TAU Parallel Performance System

TAU Parallel Performance System. Allen D. Malony , Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li {malony,sameer,amorris,khuck,ntrebon,suravee}@cs.uoregon.edu Department of Computer and Information Science

lois
Télécharger la présentation

TAU Parallel Performance System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAU Parallel Performance System Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li {malony,sameer,amorris,khuck,ntrebon,suravee}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon

  2. Outline • Motivation • TAU architecture and toolkit • Instrumentation • Measurement • Analysis • Example DOE ASC applications • TAU status • Conclusion

  3. Problem Domain • ASC defines leading edge parallel systems and software • Large-scale systems and heterogenous platforms • Multi-model, multi-module simulation • Complex, multi-layered software integration • Multi-language programming • Mixed-model parallelism and hybrid parallelism • Complexity challenges performance analysis tools • System diversity requires portable tools • Need for cross-language support • Support different parallel computation models • Operate at scale

  4. PerformanceTuning PerformanceTechnology hypotheses Performance Diagnosis • Experimentmanagement • Performancedatabase PerformanceTechnology properties Performance Experimentation • Instrumentation • Measurement • Analysis • Visualization characterization Performance Observation Research Motivation • Tools for parallel performance problem solving • Empirical-based performance optimization process • Performance technology concerns

  5. TAU Performance System • Tuning and Analysis Utilities (12+ year project effort) • Performance system framework for HPC systems • Integrated, scalable, flexible, and parallel • Targets a general complex system computation model • Entities: nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • Portable performance profiling and tracing facility • Open software approach with technology integration • University of Oregon , Research Center Jülich, LANL

  6. TAU Performance System Objectives • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid • Enable performance mapping across semantic levels • Support for object-oriented and generic programming • Integration in complex software systems and applications

  7. General Complex System Computation Model • Node: physically distinct shared memory machine • Message passing node interconnection network • Context: distinct virtual memory space within node • Thread: execution threads (user/system) in context Interconnection Network Inter-node messagecommunication * * Node Node Node node memory memory memory physicalview SMP VM space … modelview … Context Threads

  8. TAU Performance System Architecture

  9. TAU Performance System Architecture

  10. TAU Instrumentation Approach • Support for standard program events • Routines and statement-level blocks • Classes and templates • Based on begin/end (paired) event semantics • Support for user-defined events • “User-defined timers” (begin/end semantics) • Atomic events (standard and user-defined) • Selection of event statistics • Support definition of “semantic” entities for mapping • Support for event groups • Instrumentation optimization

  11. TAU Instrumentation Mechanisms • Flexible instrumentation mechanisms at multiple levels • Source code • manual • automatic • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP spec) • Object code • pre-instrumented libraries (e.g., MPI using PMPI) • statically-linked and dynamically-linked • Executable code • dynamic instrumentation (pre-execution) (DyninstAPI) • virtual machine instrumentation (e.g., Java using JVMPI) • TAU_COMPILER to automate instrumentation process

  12. User-level abstractions problem domain linker OS Multi-Level Instrumentation and Mapping • Multiple interfaces • Simultaneously active • Information sharing between interfaces • Selective instrumentation • Within/between levels • Associate performance data with high-level semantic abstractions source code instrumentation preprocessor instrumentation source code instrumentation compiler instrumentation object code libraries executable instrumentation instrumentation runtime image instrumentation instrumentation VM performancedata run

  13. TAU Source Instrumentation • Automatic source instrumentation (tau_instrumentor) • Routine entry/exit and class method entry/exit • Block entry/exit and statement level (to be added) • Uses an instrumentation specification file • Include/exclude list for events and files • Uses command line options for group selection • Instrumentation event selection (tau_select) • Automatic generation of instrumentation specification file • Instrumentation language to describe event constraints • Event identity and location • Event performance properties (e.g., overhead analysis) • Create tau_select scripts for performance experiments

  14. Program Database Toolkit (PDT) • Program code analysis framework • Use to develop source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • Commercial grade front-end parsers • Portable IL analyzer, database format, and access API • Open software approach for tool development • Multiple source languages • Implement automatic performance instrumentation tools • tau_instrumentor

  15. Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation TAU_instr DUCTAPE

  16. PDT Status • Cleanscape Flint parser fully integrated for F90/95 • Flint parser is very robust • Produces PDB records for TAU instrumentation • Linux x86, HP Tru64, IBM AIX • Tested on SAGE, POP, ESMF, PET benchmarking codes • C++ and Fortran statement-level information • for/while loops, declarations, initialization, assignment,… • PDB records defined for most constructs • PDT applications • CHASM: C++ / Fortran 90/95 interoperability • CCA: proxy generation, component instrumentation

  17. TAU Performance Measurement • TAU supports profiling and tracing measurement • Robust timing and hardware performance support • Online profile access and sampling • Extension of TAU measurement for multiple counters • User-defined TAU counters and system-level metrics • Integration with trace measurement • Support for memory and callpath profiling • Fully portable parallel performance tracing solution • Hierarchical trace merging and trace translation • Component software monitoring • Online performance profile overhead compensation

  18. TAU Measurement Mechanisms • Performance data sources • High-resolution timer library (real-time / virtual clocks) • General software counter library (user-defined events) • Hardware performance counters • PCL (Performance Counter Library) (ZAM, Germany) • PAPI (Performance API) (UTK, Ptools Consortium) • consistent, abstract, portable API • Organization • Node, context, thread levels • Profile groups for collective events (runtime selective) • Performance data mapping between software levels

  19. TAU Measurement Mechanisms (continued) • Parallel profiling • Function-level, block-level, statement-level • Supports user-defined events • TAU parallel profile data stored during execution • Hardware counts values • Support for multiple counters • Support for callgraph and callpath profiling • Tracing • All profile-level events • Inter-process communication events • Inclusion of counter data in traced events

  20. Performance Analysis and Visualization • Analysis of parallel profile and trace measurement • Parallel profile analysis • ParaProf: parallel profile analysis and presentation • ParaVis: parallel performance data visualization (proto) • Profile generation from trace data • Performance data management framework (PerfDMF) • Parallel trace analysis • Translation to VTF (V3.0) and EPILOG formats • Integration with VNG (Technical University of Dresden) • Online parallel analysis and visualization • Integration with CUBE browser (UTK, FZJ)

  21. ParaProf Framework Architecture • Portable, extensible, and scalable tool for profile analysis • Try to offer “best of breed” capabilities to analysts • Build as profile analysis framework for extensibility

  22. TAU PerfDMF Architecture

  23. Selected Applications of TAU • Center for Simulation of Accidental Fires and Explosion • University of Utah, ASCI ASAP Center, C-SAFE • Uintah Computational Framework (UCF) (C++) • Center for Simulation of Dynamic Response of Materials • California Institute of Technology, ASCI ASAP Center • Virtual Testshock Facility (VTF) (Python, Fortran 90) • Earth Systems Modeling Framework (ESMF) • NSF, NOAA, DOE, NASA, … • Instrumentation for ESMF framework and applications • C, C++, and Fortran 95 code modules • MPI wrapper library for MPI calls

  24. Selected Applications of TAU (continued) • Lawrence Livermore National Lab • Hydrodynamics (Miranda) • Radiation diffusion (KULL) • C++ automatic instrumentation, callpath profiling • Sandia National Lab • DOE CCTTSS SciDAC project • Common component architecture (CCA) integration • Combustion code (C++, Fortran 90, GrACE, MPI) • Los Alamos National Lab • Monte Carlo transport (MCNP) (Susan Post) • ASCI Q validation and scaling • SAIC’s Adaptive Grid Eulerian (SAGE) (Jack Horner) • Fortran 90 automatic instrumentation testcase

  25. Component-Based Scientific Applications • How to support performance analysis and tuning process consistent with application development methodology? • Common Component Architecture (CCA) applications • Performance tools should integrate with software • Design performance observation component • Measurement port and measurement interfaces • Build support for application component instrumentation • Interpose a proxy component for each port • Inside proxy, track caller/callee invocations and timings • Automate the process of proxy component creation • using PDT for static analysis of components • include support for selective instrumentation

  26. Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE

  27. Earth Systems Modeling Framework • Coupled modeling with modular software framework • Instrumentation for ESMF framework and applications • PDT automatic instrumentation • Fortran 95 code modules • C / C++ code modules • MPI wrapper library for MPI calls • ESMF Component instrumentation (using CCA) • CCA measurement port manual instrumentation • Proxy generation using PDT and runtime interposition • Significant use of callpath profiling by ESMF team

  28. TAU’s Paraprof Profile Browser (ESMF Data) Global profile Callpath profile

  29. CUBE Browser (UTK, FZJ) (ESMF Data) metric calltree location TAU callpathprofile dataconvertedto CUBE form

  30. TAU Traces with Hardware Counters (ESMF)

  31. TAU Traces with User-Defined Counters

  32. F Uintah Computational Framework (UCF) • University of Utah, Center for Simulation of AccidentalFires and Explosions (C-SAFE), DOE ASCI Center • UCF analysis • Scheduling • MPI library • Components • Performancemapping • Use for onlineand offlinevisualization • ParaVis tools 500 processes

  33. Scatterplot Displays (UCF, 500 processes) • Each pointcoordinatedeterminedby threevalues: MPI_Reduce MPI_Recv MPI_Waitsome • Min/Maxvalue range • Effective forclusteranalysis Relation between MPI_Recv and MPI_Waitsome

  34. Online Unitah Performance Profiling • Demonstration of profiling sampling capability • Multiple profile samples • Each profile taken at major iteration (~ 60 seconds) • Colliding elastic disks C-SAFE application • Test material point method (MPM) code • Executed on 512 processors ASCI Blue Pacific at LLNL • Example • 3D bargraph visualization • MPI execution time • Performance mapping • Multiple time steps

  35. Online Unitah Performance Profiling

  36. Miranda Performance Analysis • Miranda is a research hydrodynamics code at LLNL • Fortran 95, MPI • Mostly synchronous • MPI_ALLTOALL on Np x,y communicators • Some MPI reductions and broadcasts for statistics • Good communications scaling • ACL and MCR Linux cluster • Up to 1728 CPUs • Fixed workload per CPU • Ported to BlueGene/L

  37. Profiling of Miranda on BG/L (Miller, LLNL) • Profile code performance (automatic instrumentation) • Scaling studies (problem size, number of processors) • Run on 8K and 16K processors just two week ago! 128 Nodes 512 Nodes 1024 Nodes

  38. Fine Grained Profiling via Tracing on Miranda • Use TAU to generate VTF3 traces for Vampir analysis • Combines MPI calls with HW counter information • Detailed code behavior to focus optimization efforts

  39. Memory Usage Analysis in Miranda on BG/L • BG/L will have limited memory per node (512 MB) • Miranda uses TAU to profile memory usage • Streamlines code • Squeeze largerproblems on themachine • TAU’s footprintis small • Approximately60 bytes per eventper thread Max Heap Memory (KB) used for 1283 problemon 16 processors of ASC Frost at LLNL

  40. TAU Performance System Status • Computing platforms (selected) • IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows, BG/L • Programming languages • C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python • Thread libraries • pthreads, SGI sproc, Java,Windows, OpenMP • Compilers (selected) • Intel (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft

  41. Concluding Remarks • Complex ASC parallel systems and software pose challenging performance analysis problems that require robust methodologies and tools • To build more sophisticated performance tools, existing proven performance technology must be utilized • Performance tools must be integrated with software and systems models and technology • Performance engineered software • Function consistently and coherently • TAU performance system offers robust performance technology that can be broadly integrated in next-generation scalable software and systems

  42. Acknowledgements • Department of Energy (DOE) • MICS office • “Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System” • “Performance Analysis of Parallel Component Software” • NNSA/ASC • University of Utah DOE ASCI Level 1 sub-contract • ASCI Level 3 project (LANL, LLNL, SNL) • Research Centre Juelich • John von Neumann Institute for Computing • Dr. Bernd Mohr • Los Alamos National Laboratory

More Related