1 / 20

Performance Monitoring Tools on TCS

Performance Monitoring Tools on TCS. Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications. Objective. Measure single PE performance Operation counts, wall time, MFLOP rates Cache utilization ratio Study scalability

hea
Télécharger la présentation

Performance Monitoring Tools on TCS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications

  2. Objective • Measure single PE performance • Operation counts, wall time, MFLOP rates • Cache utilization ratio • Study scalability • Time spent in MPI calls vs. computation • Time spent in OpenMP parallel sections

  3. Atom Tools • atom(1) • Various tools • Low overhead • No recompiling or re-linking in some cases

  4. Useful Tools • Flop2: • Floating point operations count • Timer5: • Wall time (inclusive & exclusive) per routine • Calltrace: • Detailed statistics of calls and their arguments • Developed by Dick Foster @ Compaq

  5. Instrumentation • Load atom module • module load atom • Create routines file • nm –g a.out | awk ‘{if($5==“T”) print $1}’ > routines • Edit routines file • place main routine first; remove unwanted ones • Instrument executable • cat routines | atom –tool flop2 a.out • cat routines | atom –tool timer5 a.out • Execute • a.out.[flop2,timer5]to create fprof.* and tprof.*

  6. Single PE Performance Analysis Sample Timer5 output file: Procedure Calls Self Time Total Time ========= ===== ========= ========== $null_evol$null_j_ 3072 60596709 79880903 $null_eth$null_d1_ 72458 45499161 45499161 $null_hyper_u$null_u_ 3328 39889655 44500045 $null_hyper_w$null_w_ 3328 19195271 33769541 ... ... ... ... ============= ========== ============ ============ Total 1961226 248258934 248258934

  7. Single PE Performance Analysis Sample Flop2 output file: Procedure Calls Fops ========= ===== ==== $null_evol$null_j_ 3072 20406036288 $null_eth$null_d1_ 72458 20220926518 $null_hyper_u$null_u_ 3328 14062774258 $null_hyper_w$null_w_ 3328 3823795456 ... ... ... ========================================== Total 1936818 70876179927 Obtain MFLOPS = Fops/(Self Time)

  8. MPI calltrace • module load atom • cat $ATOMPATH/mpicalls | atom –tool \ calltrace a.out • Execute a.out.calltrace to generate one trace file per PE • Gather timings for desired MPI routines • Repeat for increasing number of processors

  9. Sample calltrace statistics: Number of processors 8 PEs 128 PEs 256 PEs Processor grid 2x2x2 8x4x4 8x8x4 Total Run time: 277.028 314.857 422.170 MPI_ISEND Statistics 1.250 1.498 2.265 MPI_RECV Statistics 4.349 19.779 26.537 MPI_WAIT Statistics 9.172 16.311 20.150 MPI_ALLTOALL Statistics 5.072 9.433 12.894 MPI_REDUCE Statistics 0.013 0.162 0.002 MPI_ALLREDUCE Statistics 0.391 2.073 10.313 MPI_BCAST Statistics 0.061 1.135 1.382 MPI_BARRIER Statistics 14.959 28.694 62.028 ____________________________________________________ Total MPI Time 35.267 79.085 135.571

  10. calltrace timings graph

  11. DCPI • Digital Continuous Profiling Infrastructure • daemon and profiling utilities • Very low overhead (1-2%) • Aggregate or per-process data and analysis • No code modifications • Requires interactive access to compute nodes

  12. DCPI Example • Driver script • creates map file and host list • calls daemon and profiling scripts • Daemon startup script • starts daemon with selected options • Daemon shutdown script • halts daemon • Profiling script • executes post-processing utility with selected options

  13. DCPI Driver Script • PBS job file • dcpi.pbs • Creates map file and host list • Image map generated by dcpiscan(1) • Host list used by dsh(1) commands • Executes daemon and profiling scripts • Start daemon, run test executable, stop daemon, post-process

  14. DCPI Startup Script • C shell script • dcpi_start.csh • Three arguments defined by driver job • MAP, WORK, EXE • Creates database directory (DCPIDB) • Derived from WORK + hostname • Starts dcpid(1) process • Events of interest are specified here

  15. DCPI Stop Script • C shell script • dcpi_stop.csh • No arguments • dcpiquit(1) flushes buffers and halts the daemon process

  16. DCPI Profiling Script • C shell script • dcpi_post.csh • Three arguments defined by driver job • MAP, WORK, EXE • Determines database location (as before) • Uses dcpiprof(1) to post-process database files • Profile selection(s) must be consistent with daemon startup options

  17. DCPI Example Output • Profiler writes to stdout by default • dcpi.output • Single node output in four sections • Start daemon, run test, halt daemon • Basic dcpiprof output • Memory operations (MOPS) • Floating point operations (FOPS) • Reference profiling script for details

  18. Other DCPI Options • Per-process output files • See dcpid(1) –bypid option • Trim output • See dcpiprof(1) –keep option • Host list can also be cropped • ProfileMe events for EV67 and later • Focus on –pm events • See dcpiprofileme(1) options

  19. Common DCPI Problems • Login denied (dsh) • Requires permission to login on compute nodes • Daemon not started in background • NFS is flaky for larger node counts (100+) • Set filemode of DCPIDB directory correctly • Mismatch between startup configuration and profiling specifications • See dcpid(1), dcpiprof(1), and dcpiprofileme(1)

  20. Summary • Low-level interfaces provide access to hardware counters • Very effective, but requires experience • Minimal overhead costs • Report timings, flop counts, MFLOP rates for user code and library calls, e.g. MPI • More information available, e.g. message sizes, time variability, etc.

More Related