1 / 15

Performance Modeling and Analysis with PEBIL

Performance Modeling and Analysis with PEBIL. Michael Laurenzano, Ananta Tiwari , Laura Carrington Performance Modeling and Characterization ( PMaC ) Laboratory San Diego Supercomputer Center. Outline. Motivation Performance modeling in High Performance Computing (HPC)

xenia
Télécharger la présentation

Performance Modeling and Analysis with PEBIL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Modeling and Analysis with PEBIL Michael Laurenzano, AnantaTiwari, Laura Carrington Performance Modeling and Characterization (PMaC) Laboratory San Diego Supercomputer Center

  2. Outline • Motivation • Performance modeling in High Performance Computing (HPC) • How does binary instrumentation fit in? • PEBIL = PMaC’sEfficient Binary Instrumentation for Linux/x86 • Binary instrumentation overview • Use case: memory tracing • Use case: function profiling

  3. HPC Target System PMaC HPC Performance Models Performance Model – a calculable expression of the runtime, efficiency, memory use, etc. of an HPC program on some machine HPC Target System HPC Application HPC Application Application signature – detailed summaries of the fundamental operations to be carried out by the application Machine Profile – characterizations of the rates at which a machine can carry out fundamental operations Requirements of HPC Application – Application Signature Characteristics of HPC system – Machine Profile Measured or projected via simple benchmarks on 1-2 nodes of the system Collected via trace tools Performance of Application on Target system Convolution Methods map Application Signatures to Machine Profiles produce performance prediction

  4. Application Signature • Application signature – fundamental operations used by the application • Requires low-level details of application • Details attached to specific structures within the application • Measurement? (e.g. timers or hardware counters) • Measuring at fine grain with reasonable overheads & transparently is HARD Use binary instrumentation

  5. Binary Instrumentation • Instrumentation – inserting extra code into a program, usually to inspect some aspect of behavior • Binary instrumentation – instrumentation of the compiled object/executable void incrementby(int& n, int c){ counter++; // instrumentation code n += c; }

  6. The Case for Binary Instrumentation • Low-level details of application • Program is in its binary form • Compilers transform and optimize • Basic program structure • Memory access • Vectorization • Data dependencies • The executable might be all we have • Easy to tie details to application structures int identity(int n){ int c = 0; while (c < n) c++; return c; } int identity(int n){ return n; }

  7. Runtime Overhead is a Big Deal • PEBIL… the E stands for Efficient • We want to model real HPC applications • Relatively long runtimes: minutes, hours, days? • Lots of CPUS: O(105) in largest supercomputers • High slowdowns create problems • Too long for queue • Unsympathetic administrators/managers • Inconvenience • Unnecessarily use resources Mitigate problems by minimizing runtime overhead

  8. Example Use Cases • Memory address trace collection • Capture all application loads/stores • Use a buffer, batch process them • Very widely used • Performance/energy models (e.g., PMaC) • Cache design • Memory bug detection • For efficiency, this is often used with sampling • Function/loop measurement • Insert calls to measurement routines around functions/loops • TAU uses this feature

  9. PEBIL Design • Efficiency is priority #1 • Designed around a few use cases • Execution counting • Memory tracing • Static binary rewriter • Write instrumented + runnable executable to disk • Keep original behavior intact • Gather information as a side-effect • Instrument once, run many times • No instrumentation cost at runtime • Code patching (not just-in-time compiled!)

  10. How Binary Instrumentation Works (Basic block counting) Original Instrumented 0000c000 <foo>: c000: 48 89 7d f8 mov %rdi,-0x8(%rbp) c004: 5e pop %rsi c005: 75 f8 jne 0xc004 c007: c9 leaveq c008: c3 retq Basic Block 1 Basic Block 2 Basic Block 3 0000d000 <foo>: d000: e9 de ad be efjmp 0x1000 # to instrumentation d005: 48 89 7d f8 mov %rdi,-0x8(%rbp) d000: e9 de ad be efjmp 0x1010 # to instrumentation d00a: 5e pop %rsi d00b: 75 00 00 00 f8 jne 0xd009 d000: e9 de ad be efjmp 0x1020 # to instrumentation d00a: c9 leaveq d00b: c3 retq // do stuff // jump back

  11. Use case: Memory Address Collection • Collect the address of every load/store issued by the application • Put addresses in a buffer, process addresses in batch • Fewer function calls • Less cache pollution for (i = 0; i < n; i++){ A[i] = B[i]; } if (cur + 2 > BUF_SIZE) clear_buf(); buffer[cur + 0] = &(A[i]); buffer[cur + 1] = &(B[i]);

  12. Optimization – Sampling w/ Instrumentation Point Disabling • Processing addresses is usually expensive • Cache simulation (multiple caches), locality analysis, address stream compression • Use interval-based sampling • Process the first X of every Y addresses (Y >= X) • Obvious result: reduced processing overhead • Not so obvious: reduced collection overhead by skipping address collection during sampled regions • Different approaches • PEBIL – swap instrumentation with nops • Very lightweight, limited functionality • PIN / Dyninst – Arbitrarily remove, re-instrument • Heavyweight, rich functionality

  13. Memory Trace Overhead w/ Sampling OpenMP NAS Parallel Benchmarks (8 threads)

  14. Use case: Inserting Profiling Routines • Insert calls to timers/tracking code around functions and loops • Want l low overhead, especially where no instrumentation is introduced • Small overhead = accurate profile • “throttle” instrumentation points that are called too frequently • Don’t just ignore them, disable them! • Collaboration w/ Tuning Analysis and Utilities (TAU) project void compute(){ // function id 0 for (i = 0; i < n; i++){ // loop id 1 A[i] = B[i]; } for (i = 0; i < n; i++){ // loop id 2 A[i] += C[i]; } } profile_begin(0); profile_begin(1); profile_end(1); profile_begin(2); profile_end(2); profile_end(0);

  15. Contact Info download https://github.com/mlaurenzano/PEBIL email michaell@sdsc.edu, lcarring@sdsc.edu

More Related