Profiling LINPACK Benchmark for Parallel Systems Evaluation
210 likes | 306 Vues
Explore history, implementation strategies, and importance of LINPACK benchmark in parallel system evaluation. Learn about performance metrics and top benchmarks like HPL and NPB.
Profiling LINPACK Benchmark for Parallel Systems Evaluation
E N D
Presentation Transcript
Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/
LINPACK (Dongarra: 1979) • Dense system of linear equations • Initially used as a user’s guide for LINPACK package • LINPACK – 1979 • N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark
LINPACK benchmark • Implemented on top of BLAS1 • 2 main operations – DGEFA(Gaussian elimination - O(n3)) and DGESL(Ax = b – O(n2)) • Major operation (97%) – DAXPY: y = y + α.x • Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.) • 64-bit floating point arithmetic
LINPACK • N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations • N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2 • “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 • Based on 64-bit floating point arithmetic
LINPACK • 100x100 – inner loop optimization • 1000x1000 – three-loop/whole program optimization • Scalable parallel program – Largest problem that can fit in memory • Template of Linpack code • Generate • Solve • Check • Time
HPL Algorithm • 2-D block-cyclic data distribution • Right-looking LU • Panel factorization: various options • - Crout, left or right-looking recursive variants based on matrix multiply • - Number of sub-panels • - recursive stopping criteria • - pivot search and broadcast by binary-exchange
HPL algorithm • Panel broadcast: - • Update of trailing matrix: - look-ahead pipeline • Validity check - should be O(1)
Top500 (www.top500.org) • Top500 – 1993 • Twice a year – June and November • Top500 gives Nmax, Rmax, N1/2, Rpeak
NAS Parallel Benchmarks - NPB • Also for evaluation of Supercomputers • A set of 8 programs from CFD • 5 kernels, 3 pseudo applications • NPB 1 – Original benchmarks • NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O • NPB 3 – based on OpenMP, HPF, Java • GridNPB3 – for computational grids • NPB 3 multi-zone – for hybrid parallelism
NPB 1.0 (March 1994) • Defines class A and class B versions • “Paper and pencil” algorithmic specifications • Generic benchmarks as compared to MPI-based LinPack • General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. • Sample implementations provided
Kernel Benchmarks • EP – embarrassingly parallel • MG – multigrid. Regular communication • CG – conjugate gradient. Irregular long distance communication • FT – a 3-D PDE using FFT. Rigorous test of long distance communication • IS – large integer sort • Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results
Pseudo applications / Synthetic CFDs • Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) • Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) • Benchmark 3 - perform few iterations of SSOR (LU)
Class A and Class B Class A Sample Code Class B
NPB 2.0 (1995) • MPI and Fortran 77 implementations • 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) • Class C – bigger size • Benchmark rules – 0%, 5%, >5% change in source code
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) • EP and IS added • FT rewritten • NPB 2.4 – class D and rationale for class D sizes • 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities • A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.