Performance Evaluation of the Parallel Fast Multipole Algorithm Using Optimal Effectiveness Metric

Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of Computer Science and NSF/ERC for Computational Field Simulation Mississippi State University

Overview • Scientific Applications • Performance Evaluation • Scalability Analysis • Optimal Effectiveness Metric • Parallel Fast Mutipole Algorithm • Experimental Results • Conclusions and Future Work

Scientific Applications • Large, computationally intensive, irregular • Parallel Implementation (various algorithms) • Performance degradation factors • Communication and load imbalance • architecture independent • architecture dependent

Architecture Independent Factors • Problem characteristics • nonuniformity of input data • Algorithmic • serial section • communication patterns • local / non-local dependencies

Architecture Dependent Factors • Architectural charateristics • Language, OS • Interconnected Network • Characteristics of each component processor • speed, memory, etc.

Performance Evaluation • Parallel Applications • Scalability • algorithm, architecture, mapping • Evaluation • Isolated to particular applications • Different types of performance metrics • Performance metric characteristics • Relevant, consistent, quantitative, predictive

Performance Metrics • Commonly used (time, speedup, efficiency, cost) • Speedup [Amdahl ‘67] • Scaled Speedup [Gustafson ‘88] • Fixed time size-up [Sun and Gustafson ‘91] • Isoefficiency [Gupta & Kumar ‘93] • Optimal effectiveness [Luke, Banicescu, Li ‘98]

Isoefficiency • Algorithms that can add processors at faster rate are able to achieve higher performance. • Does not identify the number of processors required before an algorithm becomes an effective option. • It discounts valuable parallel algorithms for which an isoefficiency does not exists.

Performance - Cost Tradeoffs • High performance application seek performance-cost balance. • Scalability analysis - theoretical, experimental. • Optimal effectiveness [Luke, Banicecsu, Li ‘98] • Similar to (E*S)max [Tang, Li ‘90] • Asymptotic relationship between isoefficiency and (E*S)max

Optimal Effectiveness • Cost Effectiveness: • Optimal Effectiveness:

Optimal Effectiveness (contd.) • Compare the performance of different parallel algorithms. • Identify specific conditions of problem size and number of processors that characterize crossover points and intervals where one algorithm becomes more cost effective than another. • Prescribe the number of processors that are relevant to particular problem size: Popt.

The N-body Problem Resulting force • Problem: Simulate the evolution of N particles over time (given initial positions and velocities) • Compute new positions and velocities of the N particles after one time step • Applications: astrophysics, molecular dynamics • Naive algorithm: O(N2)

Approximation Algorithms • O(N) [Appel85] • O(NlogN) [Barnes-Hut86] • O(N) Fast Multipole Algorithm (FMA) [Greengard87]a • Particles interaction approximation within a specified accuracy (Zhao, Board, Pringle,..) • O(N) Adaptive Fast Multipole Algorithm (AFMA) [Greengard87]b • Singh et al., Nyland et al., etc

The Greengard Algorithm • Two traversals: • upward • downward • 2D: Quad-tree • 3D: Oct-tree

group of particles evaluation point well-separated equivalent particle Traversing the Tree Upwards • Computing combined field • effects of particles • in regions • Multipole expansion

Traversing the Tree Downwards Higher level Lower level

Implementation • 3D-PFMA, LB[Duke], Fractiling • KSR-1, IBM-SP2, SuperMSPARC • Pthreads, MPI • Uniform, Nonuniform (Gaussian, Corner) • 4 - 64 processors, 1k - 100k particles

3-d Cost: nonuniform (corner)(KSR1) • Densely packed (50K5) • Lightly packed (50K6) Cost in seconds Number of processors • LB better 4-16 proc

3-d Cost (IBM-SP2)

3-d Cost (SuperMSPARC)

Optimal Effectiveness(KSR-1)

Optimal Effectiveness(IBM-SP2)

Optimal Effectiveness(SuperMSPARC)

Cost vs. Cost Effectiveness • 10k nonunioform corner • Fractiling cost < LB cost < PFMA cost (regardless of number of processors). • The IDEAL number of processors to use for a cost effective execution is unknown. • Allocate only Popt number of processors and leave the rest for other simultaneously executing applications.

Cost

Optimal Effectiveness

Conclusions • Cost effectiveness analysis - novel approach. • Qualitative and quantitative characteristics. • Optimal effectiveness derived from cost effectiveness curves. • Measurement of Γopt give the exact number of processors relevant to particular problem size.

Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Quantifies specific conditions that make a particular algorithm optimal. • Capability to compare any set of algorithms regardless of the existence of the isoefficiency. • Γopt shows the point at which using one of the algorithm is more advantageous than using another.

Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Allows intelligent allocation of available processors to other applications. • Improved throughput for the entire system. • Captures the impact and tradeoff in complexity of the conditions that dictate performance.

Performance Evaluation of the Parallel Fast Multipole Algorithm Using Optimal Effectiveness Metric

Performance Evaluation of the Parallel Fast Multipole Algorithm Using Optimal Effectiveness Metric

Presentation Transcript

Department of Mathematics and Computer Science

Department of Mathematics and Computer Science

Department of Mathematics, Statistics, and Computer Science

Department of Computer Science

Department of Computer Science and Engineering

Department of Computer and Information Science

Mark Hamner Texas Woman’s University Department of Mathematics and Computer Science

Department of Computer Science and Computer Engineering

Department of Computer Science and Engineering

Department of Computer Science and Information Engineering

Department of Computer Science and Electrical Engineering

Felipe Huici and Mark Handley Networks Research Group Department of Computer Science

Department of Mathematics and Computer Science

Department of Computer Science and Information Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Information Engineering

Department of Mathematics, Statistics and Computer Science

Department of Computer Science

Department of Mathematics, Statistics and Computer Science

Mark Hamner Texas Woman’s University Department of Mathematics and Computer Science

Mark Ackerman Department of Electrical Engineering and Computer Science and School of Information