1 / 21

Benchmarks for Parallel Systems

Benchmarks for Parallel Systems. Sources/Credits:

wendi
Télécharger la présentation

Benchmarks for Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

  2. LINPACK (Dongarra: 1979) • Dense system of linear equations • Initially used as a user’s guide for LINPACK package • LINPACK – 1979 • N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark

  3. LINPACK benchmark • Implemented on top of BLAS1 • 2 main operations – DGEFA(Gaussian elimination - O(n3)) and DGESL(Ax = b – O(n2)) • Major operation (97%) – DAXPY: y = y + α.x • Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.) • 64-bit floating point arithmetic

  4. LINPACK • N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations • N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2 • “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 • Based on 64-bit floating point arithmetic

  5. LINPACK • 100x100 – inner loop optimization • 1000x1000 – three-loop/whole program optimization • Scalable parallel program – Largest problem that can fit in memory • Template of Linpack code • Generate • Solve • Check • Time

  6. HPL (Implementation of HPLinpack Benchmark)

  7. HPL Algorithm • 2-D block-cyclic data distribution • Right-looking LU • Panel factorization: various options • - Crout, left or right-looking recursive variants based on matrix multiply • - Number of sub-panels • - recursive stopping criteria • - pivot search and broadcast by binary-exchange

  8. HPL algorithm • Panel broadcast: - • Update of trailing matrix: - look-ahead pipeline • Validity check - should be O(1)

  9. Top500 (www.top500.org) • Top500 – 1993 • Twice a year – June and November • Top500 gives Nmax, Rmax, N1/2, Rpeak

  10. India and Top 500

  11. NAS Parallel Benchmarks - NPB • Also for evaluation of Supercomputers • A set of 8 programs from CFD • 5 kernels, 3 pseudo applications • NPB 1 – Original benchmarks • NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O • NPB 3 – based on OpenMP, HPF, Java • GridNPB3 – for computational grids • NPB 3 multi-zone – for hybrid parallelism

  12. NPB 1.0 (March 1994) • Defines class A and class B versions • “Paper and pencil” algorithmic specifications • Generic benchmarks as compared to MPI-based LinPack • General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. • Sample implementations provided

  13. Kernel Benchmarks • EP – embarrassingly parallel • MG – multigrid. Regular communication • CG – conjugate gradient. Irregular long distance communication • FT – a 3-D PDE using FFT. Rigorous test of long distance communication • IS – large integer sort • Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results

  14. Pseudo applications / Synthetic CFDs • Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) • Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) • Benchmark 3 - perform few iterations of SSOR (LU)

  15. Class A and Class B Class A Sample Code Class B

  16. NPB 2.0 (1995) • MPI and Fortran 77 implementations • 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) • Class C – bigger size • Benchmark rules – 0%, 5%, >5% change in source code

  17. NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) • EP and IS added • FT rewritten • NPB 2.4 – class D and rationale for class D sizes • 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities • A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.

  18. Thank You !

More Related