1 / 23

SCALASCA

SCALASCA. A Parallel Approach for Scalable, Trace-Based Performance Analysis. Outline. Introduction Parallel Trace-Based Performance Analysis Early Results Conclusion Future Work. Event Tracing. Post-mortem analysis of program behavior Recording of time-stamped events at runtime

iren
Télécharger la présentation

SCALASCA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCALASCA A Parallel Approach for Scalable, Trace-Based Performance Analysis

  2. Outline • Introduction • Parallel Trace-Based Performance Analysis • Early Results • Conclusion • Future Work

  3. Event Tracing • Post-mortem analysis of program behavior • Recording of time-stamped events at runtime • Entering/leaving a function • Sending/receiving a message • Collective operations • Synchronization • Accomplished by source-code instrumentation • High level of detail

  4. Discovery of wait states zoom in Time-line visualization • „Serial“ human client

  5. Low-level event trace Problem Call tree High-level profile Reduction  System Automatic Off-Line Trace Analysis • Idea: • Automatic search for patterns of inefficient behavior • Quantification of significance • Data distillation • Guaranteed to cover the entire trace

  6. KOJAK Project • Software package for automatic performance analysis of parallel applications • Message passing and multi-threading (MPI, OpenMP, SHMEM) • Parallel performance • CPU and memory performance • Collaborative research project between • Forschungszentrum Jülich, Germany • University of Tennessee, USA • URLshttp://www.fz-juelich.de/zam/kojak/ http://icl.cs.utk.edu/kojak/

  7. Local Definition & Trace Files Execution on Parallel Machine Analysis Process Instrumentation Automatic Multilevel Instrumentation Source Code Executable Measurement Analysis Global CUBE File Global Definition & Trace Files Merge Sequential Analyzer (EXPERT)

  8. Which type of problem? Where in the source code? Which call path? Which process / thread ? Analysis Report

  9. Long traces Wide traces Problem size large t small t Granularity / event rate high Number of processes t low t Temporal coverage width full t disabled partial t t Scalability Problems

  10. SCALASCA • Follow-up project • Funded by German Helmholtz Association • Specifically addresses scalability • Basic idea: Parallelization of analysis • Current focus: Single-threaded MPI-1 applications • URL http://www.scalasca.org/

  11. Local Definition & Trace Files Execution on Parallel Machine Sequential Analysis Process Instrumentation Automatic Multilevel Instrumentation Source Code Executable Measurement Analysis Global CUBE File Global Definition & Trace Files Merge Sequential Analyzer (EXPERT)

  12. Local Definition & Trace Files Execution on Parallel Machine Parallel Analyzer Parallel Analysis Process Instrumentation Automatic Multilevel Instrumentation Source code Executable Measurement Analysis Global CUBE File

  13. Local Definition Files ID Mapping Tables Local Trace Files Local Results Execution on Parallel Machine Parallel Analyzer Current Prototype Instrumentation Automatic Multilevel Instrumentation Source code Executable Measurement Analysis Unification Global CUBE File Combine

  14. Parallel Pattern Analysis • Analyze separate local trace files in parallel • Exploits distributed memory • Often allows keeping whole trace in main memory • Parallel Replay of target application‘s communication behavior • Analyze communication with an operation of same type • Parallel traversal of event streams • Exchange of data at synchronization points of target application

  15. waiting time Example: Late Sender • Sequential approach (EXPERT) • Scan the trace in sequential order • Watch out for receive event • Use links to access other constituents • Calculate waiting time • New parallel approach • Each process identifies local constituents • Sender sends local constituents to receiver • Receiver calculates waiting time

  16. Example: Wait at NN • Waiting time due to inherentsynchronization in N-to-Noperations (e.g. MPI_Allreduce) • Parallel analysis: • Identify local exit event of collectiveN-to-N operation • Determine timestamp of latest enter event (A single call to MPI_Allreduce is sufficient!) • Calculate local waiting time time

  17. Experimental Evaluation • ASC SMG2000 benchmark • Semi-coarsening multigrid solver • Fixed 646432 problem size per process, 5 solver iterations • Weak scaling behavior • PEPC-B • Parallel tree code for computing long-range forces in N-body problems • Fixed overall problem size • Strong scaling behavior

  18. Test Platform Jülicher BlueGene/L (JuBL) • 8 Racks with 8192 dual-core nodes • 288 I/O nodes • p720 service and login nodes (81.6 GHz Power5 CPUs each)

  19. Results: SMG2000

  20. Results: PEPC-B

  21. Conclusion • Scalability can be addressed by parallelization • Process local trace files in parallel • Replay target applications communication behavior • Promising results with prototype implementation

  22. Future Work • Integrate & parallelize remaining sequential parts • Resolve file I/O issues • Address long traces • Selective tracing • Sophisticated data structures (e.g. cCCGs) • Extend to other programming paradigms • OpenMP • MPI-2

  23. Thank you! Questions?

More Related