Performance and Productivity of Emerging Architectures

Performance and Productivityof Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies

Overview 300 GPU 200 Performance (GF) 100 CPU 0 2004 1998 2000 2002 2006

Dual-core (Woodcrest) Quad-core (Clovertown) Programming models MPI OpenMP Pthreads Recourse contention Memory bandwidth I/O bandwidth x8 x8 SunriseLake PCI-X 10 GbEI/OATiSCSI SAS/SATA-2 PCI-X 10 GbE Intel Xeon multicore CPU DempseyWoodcrestClovertown DIB1066/1333 MHz8.5/10.5 GB/s Up to 21 GB/s ESB-2I/Obridge BlackfordMCH FBD FBD ESI FBD FBD FBD FBD FBD FBD FBD FBD FBD FBD x8 FBD FBD FBD FBD Configurable set of PCIe ports PCIe slot

IBM Cell broadband engine Cell heterogeneous multicore processor 8 synergistic processing element (SPE) cores 200 GFLOPS at 3.2 GHz 200-GB/s memory bandwidth Programming models Multisource compilation Threadlike launch semantics for SPEs SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC SPU SXU LS MFC EIB (up to 96 B/cycle) SPE 16 B/cycle 16 B/cycle 16 B/cycle (2x) PPE 16 B/cycle BIC BIC L2 PPU LI PXU 32 B/cycle 16 B/cycle Flex10 Dual XDR Source: M. Gschwiind et al., Hot Chips-17, August 2005

Graphics processing unit (GPU) NVIDIA 7900GTX 24 SIMD pixelpipelines 200 GFLOPS at 650 MHz 50-GB/s memorybandwidth Programming models Multiple-sourcecompilation Gather semantics define parallelism Host CPU drives program setup and data transfer Memorypartition Memorypartition Memorypartition Vertexshader units Triangle setup Z-cull Shader instruction dispatch Pixel pipelines L2 Tex Fragment crossbar ROP engine Memorypartition

Cray XMT system MTA-I and MTA-II processor architecture XT3/4 scalable infrastructure AMD Torrenzatechnology Fine-grain multithreading(128 concurrent threads) Uniform memory hierarchy in MTA-Iand MTA-II Extreme multithreading with Cray MTA-II Programs runningin parallel 1 2 3 4 i=n i=n Sub- problemA SerialCode i=3 i=1 Concurrent threads of computation Sub- problemB i=2 i=0 i=1 Sub-problem A Hardwarestreams (128) Unusedstreams Instructionready pool Pipeline of executinginstructions

Application kernels

Performance OpenMP improves performance moderately on commodity CPUs High parallelismof Cell and GPUcan improve performance, but more sowhen memoryaccess is regular Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) 8 Cell (8 SPEs) GPU (NVIDIA 7900GTX) MTA-2 (1 processor, 128 streams) 6 Speedup 4 2 0 Molecular Dynamics Covariance Matrix

Productivity Despite small performance increases through OpenMP, the ease of using it means that productivity can still be increased Conversely, the high speed of Cell and GPU means that even substantial effortresults in higher productivity Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) Cell (8 SPEs) GPU (NVIDIA 7900GTX) 3.0 MTA-2 (1 processor, 128 streams) 2.0 Productivity improvement 1.0 0.0 Molecular Dynamics Covariance Matrix

Productivity scaling Increased parallelismfrom all architectureswith littleincreased effort results in higher productivity Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) MTA-2 (32 processors) 3 2 Speedup 1 0 Molecular Dynamics Covariance Matrix Scaling when accessing multiple devices

Contacts Jeremy Meredith Future Technologies Group (865) 241-5842 jsmeredith@ornl.gov Sadaf Alam Future Technologies Group (865) 241-1533 alamrs@ornl.gov Jeffrey Vetter Future Technologies Group (865) 576-7115 vetter@ornl.gov 11 Meredith_Architectures_SC07

Performance and Productivity of Emerging Architectures

Performance and Productivity of Emerging Architectures

Presentation Transcript

Performance Analysis of Software Architectures

Performance Analysis of Multiprocessor Architectures

Productivity Performance and Government Policy

Evolution of High Performance Cluster Architectures

ERD ITWG Emerging Research Architectures

X10: Performance and Productivity at Scale

Emerging architectures and standards in telecommunications

Performance and Productivity: NWChem

Performance Evaluation of Architectures

Optimization of Coupled Systems on Emerging Architectures

Lecture 26 Emerging Architectures

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures

Productivity and Performance Role of HR

Design and Performance Evaluation of Networked Storage Architectures

Digital and institutional repositories: emerging architectures

Compiler Technology for Productivity and Performance

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures

Evolution of High Performance Cluster Architectures