160 likes | 282 Vues
This presentation explores the performance benchmarking of Unified Parallel C (UPC) across various platforms, as discussed at the 4th PMEO-PDS Workshop in Denver, Colorado. It covers background on UPC, its implementations, previous performance studies, and results from experiments conducted using synthetic and application benchmarks. Notable findings include performance comparisons of the MuPC, HP UPC, and Berkeley UPC compilers across platforms like Cray, SGI, and Linux clusters. The analysis highlights the impact of memory access patterns on performance, emphasizing the need for optimization in shared memory references.
E N D
Presentation at the 4th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005
Presentation Outline • Background • Unified Parallel C, implementations and users. • Previous UPC performance studies. • Experiments • Available UPC platforms • Benchmarks • Performance measurements • Conclusions
UPC Overview • UPC is an extension of C for partitioned shared memory parallel programming. • A special case of shared memory programming model. • Similar languages: Co-Array Fortran, Titanium. • UPC homepage: http://www.upc.gwu.edu • Platforms supported: • Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX, Linux clusters, IBM SP. • UPC compilers: • Open source: MuPC, Berkeley UPC, Intrepid UPC • Commercial: HP UPC, Cray UPC • Users: • LBNL, IDA, AHPCRC, …
Related UPC Performance Studies • Performance benchmark suites • UPC_Bench (GWU) • Synthetic microbenchmark based on the STREAM benchmark. • Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem • UPC NAS Parallel Benchmarks (GWU) • Performance monitoring • Performance analysis for HP UPC compiler (GWU) • Performance of Berkeley UPC on HP AlphaServer (Berkeley) • Performance of Intrepid UPC on SGI Origin (GWU)
Benchmarking UPC Systems • Extended shared memory bandwidth microbenchmarks to cover various reference patterns: • Scalar references: 11 access patterns • Block memory operations: 9 access patterns • Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code). • Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC • Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E • The first comparison of performance for currently available UPC implementations. • The first report on MuPC performance.
Benchmarks • Synthetic benchmarks: • The STREAM microbenchmark was rewritten using UPC with more diversities of shared memory access patterns: • Local shared read / write • Unit stride shared read / write / copy • Random shared read / write / copy • Stride-n shared read / write / copy • Block transfers with variations of source and sink affinities. • NAS Parallel Benchmark Suite v2.4 • The UPC version was developed at GWU. • Five cores: CG, EP, FT, IS and MG. • Two variations: Naïve version and Hand-tuned version. • Input size: Class A workload.
Local Shared References • Intrepid UPC: performance is poor on local shared accesses. • HP UPC: cache state has significant effects on local shared accesses.
Remote Shared References • HP UPC and MuPC: caches help unit stride remote shared accesses. • Intrepid UPC does the best for remote shared accesses.
Block Memory Operations • HP UPC: performance is poor on certain string functions. • Intrepid UPC: low performance on all categories.
NPB – CG • The only case that scales well: Berkeley UPC + optimized code.
NPB – FT • HP, Berkeley and MuPC: performance is comparable.
NPB – IS • HP, Berkeley and MuPC: performance is comparable.
NPB – MG • MG performance is very inconsistent.
Conclusions • STREAM benchmarking: • UPC language overhead reduces performance of local shared references. • Remote reference caching helps stride-1 accesses. • Copying between two locations with the same affinity to a remote thread needs optimization. • NPB benchmarking: • Some implementation failed for some benchmarks. More stable and reliable implementations are needed. • Hand-tuning techniques (e.g. prefetching) are critical in performance. • Berkeley UPC is the best at handling unstructured, fine-grained references. • MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.
Thank you! For more information: http://www.upc.mtu.edu