180 likes | 305 Vues
This paper explores the NAS Parallel Benchmarks suite 2.2 (NPB), a critical tool for evaluating modern parallel systems. By examining scaling performance on the Network of Workstations (NOW) and SGI Origin 2000, it highlights key findings related to computation and communication efficiency. The study presents details on hardware configurations, time breakdowns, and benchmarks comparisons to demonstrate scalability in real-world applications. Insights into the performance dynamics underscore the challenges and advantages of using NPB for measuring parallel computing prowess.
E N D
Understanding Application ScalingNAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15th, 1998
Introduction • NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems • 7 scientific benchmarks that represents the most common computation kernels • NPB is written on top of Message Passing Interface (MPI) for portability • NPB is a Constant Problem Size (CPS) scaling benchmark suite • This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000
Motivation • Early study on NPB shows ideal speedup on NOW! • Scaling as good as T3D and better than SP-2 • Per node performance better than T3D, close to SP-2 • Submitted results for Origin 2000 show a spread
Presentation Outline • Hardware Configuration • Time Breakdown of the Applications • Communication Performance • Computation Performance • Conclusion
Hardware Configuration • SGI Origin 2000 (64 nodes) • MIPS R10000 processor, 195 MHz, 32KB/32KB L1 • 4MB external L2 cache per processor • 16GB memory total • MPI performance: 13 sec one-way latency, 150 MB peak, half-power at 8KB message size • Network Of Workstations (NOW) • UltraSPARC I processor, 167MHz, 16KB/16KB L1 • 512KB external L2 cache per processor • 128 MB memory per processor • MPI performance: 22 sec one-way latency, 27 MB peak, half-power at 4KB message size
Time Breakdown -- LU • Black line -- total running time • a single-man - 10 secs job • ideally, requires 5 secs for 2 men • total amount of work -- 10 secs • More work, need communication
Communication Performance • Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW
Communication Efficiency • absolute bandwidth delivered are close • SP/32 on NOW -- 215s • SP/32 on SGI -- 289s • comm. efficiency on SGI only achieved 30% of potential bandwidth • protocols tradeoff are pronounce • hand-shake vs. bulk-send in pt2pt • collective ops
Computation Performance • Relative performance of the benchmarks on single node roughly close to the processor performance difference • Both computational CPI and L2 misses change significantly on both platforms when scaled
Recap on CPS Scaling 4 8 16 32 64 128 256
LU Working Set • 4-processor • Knee starts at 256KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB • 32-processor • Knee starts at 32KB • miss rate drops from 2MB to 4 MB global cache
SP Working Set • Cost under scaling • extra work worsen memory system’s performance • total memory references on SGI • 4-processor has 64.38 billion memory reference • 25-processor has 72.35 billion memory reference • 12.38% increase Cost Benefit
Conclusion • NPB • -benchmarks hard to predict comm performance • global cache increases effectively reduce comp. time • sequential node arch. is a dominant factor in NPB perf. • NOW • an inexpensive way to go parallel • absolute performance is excellent • MPI on NOW has good scalability and performance • NOW vs. proprietary system -- detail instrumentation ability • speedup cannot tell the whole story, scalability involves: • the interplay of program and machine scaling • delivered comm. performance, not -benchmarks • complicated memory system performance