Profiling LINPACK Benchmark for Parallel Systems Evaluation

Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

LINPACK (Dongarra: 1979) • Dense system of linear equations • Initially used as a user’s guide for LINPACK package • LINPACK – 1979 • N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark

LINPACK benchmark • Implemented on top of BLAS1 • 2 main operations – DGEFA(Gaussian elimination - O(n3)) and DGESL(Ax = b – O(n2)) • Major operation (97%) – DAXPY: y = y + α.x • Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.) • 64-bit floating point arithmetic

LINPACK • N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations • N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2 • “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 • Based on 64-bit floating point arithmetic

LINPACK • 100x100 – inner loop optimization • 1000x1000 – three-loop/whole program optimization • Scalable parallel program – Largest problem that can fit in memory • Template of Linpack code • Generate • Solve • Check • Time

HPL (Implementation of HPLinpack Benchmark)

HPL Algorithm • 2-D block-cyclic data distribution • Right-looking LU • Panel factorization: various options • - Crout, left or right-looking recursive variants based on matrix multiply • - Number of sub-panels • - recursive stopping criteria • - pivot search and broadcast by binary-exchange

HPL algorithm • Panel broadcast: - • Update of trailing matrix: - look-ahead pipeline • Validity check - should be O(1)

Top500 (www.top500.org) • Top500 – 1993 • Twice a year – June and November • Top500 gives Nmax, Rmax, N1/2, Rpeak

India and Top 500

NAS Parallel Benchmarks - NPB • Also for evaluation of Supercomputers • A set of 8 programs from CFD • 5 kernels, 3 pseudo applications • NPB 1 – Original benchmarks • NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O • NPB 3 – based on OpenMP, HPF, Java • GridNPB3 – for computational grids • NPB 3 multi-zone – for hybrid parallelism

NPB 1.0 (March 1994) • Defines class A and class B versions • “Paper and pencil” algorithmic specifications • Generic benchmarks as compared to MPI-based LinPack • General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. • Sample implementations provided

Kernel Benchmarks • EP – embarrassingly parallel • MG – multigrid. Regular communication • CG – conjugate gradient. Irregular long distance communication • FT – a 3-D PDE using FFT. Rigorous test of long distance communication • IS – large integer sort • Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results

Pseudo applications / Synthetic CFDs • Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) • Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) • Benchmark 3 - perform few iterations of SSOR (LU)

Class A and Class B Class A Sample Code Class B

NPB 2.0 (1995) • MPI and Fortran 77 implementations • 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) • Class C – bigger size • Benchmark rules – 0%, 5%, >5% change in source code

NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) • EP and IS added • FT rewritten • NPB 2.4 – class D and rationale for class D sizes • 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities • A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.

Thank You !

Profiling LINPACK Benchmark for Parallel Systems Evaluation

Profiling LINPACK Benchmark for Parallel Systems Evaluation

Presentation Transcript

BENCHMARKS

Parallel Systems

Benchmarks

Parallel Database Systems

Parallel Systems

Parallel Systems

Parallel Systems

The PFunc Implementation of NAS Parallel Benchmarks.

Benchmarks

Performance Technology for Scalable Parallel Systems

Benchmarks for supporting

Benchmarks

Predictions for Parallel Applications and Systems

Benchmarks on BG/L: Parallel and Serial

Benchmarks

SPEC MPI2007 Benchmarks for HPC Systems

Benchmarks

Benchmarks

Parallel Database Systems