Parallel Computers Today

Parallel Computers Today Two Nvidia 8800 GPUs > 1 TFLOPS LANL / IBM Roadrunner > 1 PFLOPS Intel 80-core chip > 1 TFLOPS • TFLOPS = 1012 floating point ops/sec • PFLOPS = 1,000,000,000,000,000 / sec (1015)

Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)

Beowulf (18-processor cluster, lab machine)

AMD Opteron quad-core die

The nVidia G80 GPU • 128 streaming floating point processors @1.5Ghz • 1.5 Gb Shared RAM with 86Gb/s bandwidth • 500 Gflop on one chip (single precision)

U A L The Computer Architecture Challenge • Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices. • Originally, because linear algebra is the middleware of scientific computing. • Nowadays, mostly for bragging rights. P = x

Top 500 List • http://www.top500.org/list/2008/11/100

Generic Parallel Machine Architecture Storage Hierarchy Proc Proc Proc • Key architecture question: Where is the interconnect, and how fast? • Key algorithm question: Where is the data? Cache Cache Cache L2 Cache L2 Cache L2 Cache L3 Cache L3 Cache L3 Cache potential interconnects Memory Memory Memory

1MB victim 1MB victim 1MB victim 1MB victim Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron Opteron Opteron 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC 179 GB/s (fill) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC 4MB Shared L2 (16 way) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ 90 GB/s (writethru) EIB (Ring Network) MFC 256K SPE SPE EIB (Ring Network) 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s each direction XDR XDR 42.7GB/s (read), 21.3 GB/s (write) Fully Buffered DRAM 25.6GB/s 25.6GB/s XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade Multicore SMP Systems

More Detail on GPU Architecture

Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts • http://www.csm.ornl.gov/workshops/HPA/documents/1-arch/feeding_the_beast_perrone.pdf

Cray XMT (highly multithreaded shared memory)

Parallel Computers Today