470 likes | 575 Vues
This paper explores the performance evaluation of the Parallel Fast Multipole Algorithm (PFMA) through the lens of the Optimal Effectiveness metric. The study focuses on scalability analysis, architecture-dependent performance factors, and the impact of various computational structures on algorithm effectiveness. With an emphasis on large, irregular scientific applications, the authors present experimental results highlighting communication patterns, load balancing, and performance degradation. Insights into algorithmic efficiency and cost-performance trade-offs are provided, alongside future work directions in parallel computing.
E N D
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of Computer Science and NSF/ERC for Computational Field Simulation Mississippi State University
Overview • Scientific Applications • Performance Evaluation • Scalability Analysis • Optimal Effectiveness Metric • Parallel Fast Mutipole Algorithm • Experimental Results • Conclusions and Future Work
Scientific Applications • Large, computationally intensive, irregular • Parallel Implementation (various algorithms) • Performance degradation factors • Communication and load imbalance • architecture independent • architecture dependent
Architecture Independent Factors • Problem characteristics • nonuniformity of input data • Algorithmic • serial section • communication patterns • local / non-local dependencies
Architecture Dependent Factors • Architectural charateristics • Language, OS • Interconnected Network • Characteristics of each component processor • speed, memory, etc.
Performance Evaluation • Parallel Applications • Scalability • algorithm, architecture, mapping • Evaluation • Isolated to particular applications • Different types of performance metrics • Performance metric characteristics • Relevant, consistent, quantitative, predictive
Performance Metrics • Commonly used (time, speedup, efficiency, cost) • Speedup [Amdahl ‘67] • Scaled Speedup [Gustafson ‘88] • Fixed time size-up [Sun and Gustafson ‘91] • Isoefficiency [Gupta & Kumar ‘93] • Optimal effectiveness [Luke, Banicescu, Li ‘98]
Isoefficiency • Algorithms that can add processors at faster rate are able to achieve higher performance. • Does not identify the number of processors required before an algorithm becomes an effective option. • It discounts valuable parallel algorithms for which an isoefficiency does not exists.
Performance - Cost Tradeoffs • High performance application seek performance-cost balance. • Scalability analysis - theoretical, experimental. • Optimal effectiveness [Luke, Banicecsu, Li ‘98] • Similar to (E*S)max [Tang, Li ‘90] • Asymptotic relationship between isoefficiency and (E*S)max
Optimal Effectiveness • Cost Effectiveness: • Optimal Effectiveness:
Optimal Effectiveness (contd.) • Compare the performance of different parallel algorithms. • Identify specific conditions of problem size and number of processors that characterize crossover points and intervals where one algorithm becomes more cost effective than another. • Prescribe the number of processors that are relevant to particular problem size: Popt.
The N-body Problem Resulting force • Problem: Simulate the evolution of N particles over time (given initial positions and velocities) • Compute new positions and velocities of the N particles after one time step • Applications: astrophysics, molecular dynamics • Naive algorithm: O(N2)
Approximation Algorithms • O(N) [Appel85] • O(NlogN) [Barnes-Hut86] • O(N) Fast Multipole Algorithm (FMA) [Greengard87]a • Particles interaction approximation within a specified accuracy (Zhao, Board, Pringle,..) • O(N) Adaptive Fast Multipole Algorithm (AFMA) [Greengard87]b • Singh et al., Nyland et al., etc
The Greengard Algorithm • Two traversals: • upward • downward • 2D: Quad-tree • 3D: Oct-tree
group of particles evaluation point well-separated equivalent particle Traversing the Tree Upwards • Computing combined field • effects of particles • in regions • Multipole expansion
Traversing the Tree Downwards Higher level Lower level
Implementation • 3D-PFMA, LB[Duke], Fractiling • KSR-1, IBM-SP2, SuperMSPARC • Pthreads, MPI • Uniform, Nonuniform (Gaussian, Corner) • 4 - 64 processors, 1k - 100k particles
3-d Cost: nonuniform (corner)(KSR1) • Densely packed (50K5) • Lightly packed (50K6) Cost in seconds Number of processors • LB better 4-16 proc
Cost vs. Cost Effectiveness • 10k nonunioform corner • Fractiling cost < LB cost < PFMA cost (regardless of number of processors). • The IDEAL number of processors to use for a cost effective execution is unknown. • Allocate only Popt number of processors and leave the rest for other simultaneously executing applications.
Conclusions • Cost effectiveness analysis - novel approach. • Qualitative and quantitative characteristics. • Optimal effectiveness derived from cost effectiveness curves. • Measurement of Γopt give the exact number of processors relevant to particular problem size.
Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Quantifies specific conditions that make a particular algorithm optimal. • Capability to compare any set of algorithms regardless of the existence of the isoefficiency. • Γopt shows the point at which using one of the algorithm is more advantageous than using another.
Conclutions (contd.) • Cost effectiveness / Optimal effectiveness: • Allows intelligent allocation of available processors to other applications. • Improved throughput for the entire system. • Captures the impact and tradeoff in complexity of the conditions that dictate performance.