70 likes | 162 Vues
Evaluation of Modern Parallel Vector Architectures. Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker. Previous Research.
E N D
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker
Previous Research • Examined complex interactions between high-level algorithms, leading programming paradigms, and modern architectural platforms • Evaluated three parallelization strategies of a dynamic unstructured mesh adaptation algorithm • Examined two major classes of adaptive applications under three parallel programming model (UMA and N-Body) • Investigated effects of algorithmic orderings on sparse matrix computations • Evaluated performance of shared-virtual memory systems on PC-SMP clusters using six application kernels (structured and unstructured) • Architectures Examined: T3E, Origin2000, SP, PC Cluster, MTA • Examined scientific kernels on emerging microarchitectures: VIRAM (Berkeley PIM) and Imagine (Stanford Stream arch) • Programming Paradigms: MPI, OpenMP, hybrid, SHMEM, shared-memory, multithreading, vectorization, streaming
New Evaluation Project:Modern Parallel Vector Systems • Vector Architectures: SX6, X1, and ES • Plan to study key factors of modern parallel vector systems: runtime, scalability, programmability, portability, and memory overhead while identifying potential bottlenecks • Examine microbenchmarks, kernels, and application codes • What fraction of scientific codes suitable for these arch?What best programming paradigm?What required algorithmic modifications?What are scalability limiting factors?What migration issues in terms of performance portability?
Microbenchmark and Kernel Codes • Examine memory bandwidth within a node for simple and complex array addressing. • Examine low level message-passing characteristics:point-to-point, intra-node, extra-node, aggregate operations, and one-sided performance, as well as I/O • Task and thread performance: thread creation, task management locks, semaphores, and barriers. Explicit threads vs. implicit OpenMP • Evaluate NAS Parallel Benchmarks using MPI, OpenMP, and Hybrid programming. New class D and E size problems being developed by Rob Wijngaar at NASA Ames
Application Codes • Astrophysics: • MADCAP Microwave Anisotropy Dataset Computational Analysis Package. Analyses cosmic microwave background radiation datasets to extract the maximum likelihood angular power spectrum. Julian Borrill LBNL • CACTUS Direct evolution of Einstein's equations. Involves a coupled set of non-linear hyperbolic, elliptic equations with thousands of terms. John Shalf LBNL • Climate: • CCM3 Community Climate Model Michael Wehner LBNL • Fluid Dynamics • OverflowD Overset Navier-Stokes grid solver. Simulates complex rotorcraft vortex dynamics problems.Mohammad Djomehri NASA
Application Codes (cont) • Fusion • GTC Gyrokinetic Toroidal Code. 3D particle-in-cell code to study microturbulence in magnetic confinement fusion. Stephane Ethier Princeton Plasma Physics Laboratory • TLBE Thermal Lattice Boltzmann equation solver for modeling turbulence and collisions in plasma. Jonathan Carter LBNL • Material Science • PARATEC PARAllel Total Energy Code. Electronic structure code which performs ab-initio quantum-mechanical total energy calculations. Andrew Canning LBNL • Molecular Dynamics • NAMD Object-oriented molecular dynamics code designed for simulation of large biomolecular systems. David Skinner LBNL
Benchmarking Timelineand Evaluation Goals • Currently porting codes to single node SX6 (USA) • Will soon have multi-node SX6 access from DKRZ (Germany) • Early System Access to the Cray X1 expected in early February (ORNL) • Hope to gain Earth Simulator access summer 2003 • Opportunity will allow us to compare performance and programmability with leading conventional architectures (Power4, Alpha EV67) • Allow comparison with significantly different X1 system: • X1 vector pipes are “distributed” within the X1 multistreaming processor • Cache based architecture and support for globally addressable memory • Compiler must identify both streaming (microtasking) and vectorization, while maximizing cache reuse • Is the same programming style effective on both X1 and ES • Help guide future system acquisition and scientific code development • Potential to run applications at unprecedented scale