1 / 44

Benchmarks on BG/L: Parallel and Serial

Benchmarks on BG/L: Parallel and Serial. John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center. Overview. Single node benchmarks Architecture Algorithms Linpack Dealing with a bottleneck Communication operations Benchmarks of the Future. Dual Core Dual FPU/SIMD

dagmar
Télécharger la présentation

Benchmarks on BG/L: Parallel and Serial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarks on BG/L: Parallel and Serial John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center

  2. Overview • Single node benchmarks • Architecture • Algorithms • Linpack • Dealing with a bottleneck • Communication operations • Benchmarks of the Future

  3. Dual Core Dual FPU/SIMD Alignment issues Three-level cache Pre-fetching Non-coherent L1 caches 32 KB, 64-way, Round-Robin L2 & L3 caches coherent Outstanding L1 misses (limited) Compute Node: BG/L

  4. Programming OptionsHigh  Low Level • Compiler optimization to find SIMD parallelism • User input for specifying memory alignment and lack of aliasing • alignx assertion • disjoint pragma • Dual FPU intrinsics (“built-ins”) • Complex data type used to model pair of double-precision numbers that occupy a (P, S) register pair • Compiler responsible for register allocation and scheduling • In-line assembly • User responsible for instruction selection, register allocation, and scheduling

  5. Out-of-box performance is 50-65% of tuned performance Lessons learned in tuning will be transferred to compiler where possible Comparison with commodity microprocessors is competitive STREAM Performance

  6. 16 bytes/cycle (L1 – bandwidth) 5.3 bytes/cycle (L3 – bandwidth) DAXPY Bandwidth Utilization

  7. M1 F2 8 8 16 16 M2 F1 Matrix MultiplicationTiling for Registers (Analysis) • Latency tolerance (not bandwidth) • Take advantage of register count • Unroll by factor of two • 24 register pairs • 32 cycles per unrolled iteration • 15 cycle load-to-use latency (L2 hit) • Could go to 3-way unroll if needed • 32 register pairs • 32 cycles per unrolled iteration • 31 cycle load-to-use latency

  8. Dual Register Blocks Register Set Blocks L1 Cache Blocks L3 Cache Blocks Recursive Data Format • Mapping 2-D (Matrix) to 1-D (RAM) • C/Fortran do not map well • Space-Filling Curve Approximation • Recursive Tiling • Enables • Streaming/pre-fetching • Dual core “scaling”

  9. Dual Core • Why? • It’s a effortless way to double your performance

  10. Dual Core • Why? • It exploits the architecture and may allow one to double the performance of their code in some cases/regions

  11. Single-Node DGEMM Performance at 92% of Peak 92.27% • Near-perfect scalability (1.99) going from single-core to dual-core • Dual-core code delivers 92.27% of peak flops (8 flop/pclk) • Performance (as fraction of peak) competitive with that of Power3 and Power4

  12. Performance Scales Linearly with Clock Frequency • Measured performance of DGEMM and STREAM scale linearly with frequency • DGEMM at 650 MHz delivers 4.79 Gflop/s • STREAM COPY at 670 MHz delivers 3579 MB/s

  13. The Linpack Benchmark

  14. Already factored Current block DTRSM DGEMM Pivot and scale columns LU Factorization: Brief Review

  15. N 16n repetitions ... n repetitions LINPACKProblem Mapping

  16. Panel Factorization: Option #1 • Stagger the computations • PF Distributed over relatively few processors • May take as long as several DGEMM updates • DGEMM load imbalance • Block size trades balance for speed • Use collective communication primitives • May require no “holes” in communication fabric

  17. Speed-up Option #2 • Change the data distribution • Decrease the critical path length • Consider the communication abilities of machine • Complements Option #1 • Memory size (small favors #2; large #1) • Memory hierarchy (higher latency: #1) • The two options can be used in concert

  18. Communication Routines • Broadcasts precede DGEMM update • Needs to be architecturally aware • Multiple “pipes” connect processors • Physical to logical mapping • Careful orchestration is required to take advantage of machines considerable abilities

  19. Row BroadcastMesh

  20. Row BroadcastMesh

  21. Row BroadcastMesh

  22. Row BroadcastMesh

  23. Row BroadcastMesh

  24. Row BroadcastMesh

  25. Row BroadcastMesh

  26. Row BroadcastMesh

  27. Row BroadcastMesh

  28. Row BroadcastMesh

  29. Row BroadcastMesh Recv 2 Send 4 Hot Spot!

  30. Row BroadcastMesh Recv 2 Send 3

  31. Row BroadcastTorus

  32. Row BroadcastTorus

  33. Row BroadcastTorus

  34. Row BroadcastTorus

  35. Row BroadcastTorus (sorry for the “fruit salad”)

  36. Broadcast • Bandwidth/Latency • Bandwidth: 2 bytes/cycle per wire • Latency: • Sqrt(p), pipelined (large msg.) • Deposit bit: 3 hops • Mesh • Recv 2/Send 3 • Torus • Recv 4/Send 4 (no “hot spot”) • Recv 2/Send 2 (red-blue only … again, no bottleneck) • Pipe • Recv/Send: 1/1 on mesh; 2/2 on torus

  37. What Else? • It’s a(n) … • FPU Test • Memory Test • Power Test • Torus Test • Mode Test

  38. Conclusion • 1.435 TF Linpack • #73 in TOP500 List (11/2003) • Limited Machine Access Time • Made analysis (prediction) more important • 500 MHz Chip • 1.507 TF run at 525MHz demonstrates scaling • Would achieve >2 TF at 700MHz • 1TF even if machine used in “true” heater mode

  39. Conclusion

  40. Additional Conclusions • Models, extrapolated data • Use models to the extent that the architecture and algorithm are understood • Extrapolate from small processor sets • Vary as many (yes) parameters as possible at the same time • Consider how they interact and how they don’t • Also remember that instruments affect timing • Often can compensate (incorrect answer results) • Utilize observed “eccentricities” with caution (MPI_Reduce)

  41. Current Fronts • HPC Challenge Benchmark Suite • STREAMS, HPL, etc. • HPCS Productivity BenchmarksMath Libraries • Focused Feedback to Toronto • PERCS Compiler/Persistent Optimization • Linpack Algorithm on Other Machines

  42. Leonardo Bachega: BLAS-1, performance results Sid Chatterjee, Xavier Martorell: Coprocessor, BLAS-1 Fred Gustavson, James Sexton: Data structure investigations, design, sanity tests Thanks to …

  43. Thanks to … • Gheorghe Almasi, Phil Heidelberger & Nils Smeds: MPI/Communications • Vernon Austel: Data copy routines • Gerry Kopcsay & Jose Moreira: System & machine configuration • Derek Lieber & Martin Ohmacht: Refined memory settings • Everyone else: System software, hardware, & Machine time!

  44. Benchmarks on BG/L: Parallel and Serial John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center

More Related