180 likes | 285 Vues
Stanford Streaming Supercomputer (SSS) Project Meeting. Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford University October 2, 2001. Agenda. Introductions (now) Vision – subset of ASCI review slides Goals for the quarter Schedule of meetings for the quarter.
E N D
Stanford Streaming Supercomputer (SSS)Project Meeting Bill Dally, Pat Hanrahan, and Ron FedkiwComputer Systems LaboratoryStanford University October 2, 2001
Agenda • Introductions (now) • Vision – subset of ASCI review slides • Goals for the quarter • Schedule of meetings for the quarter
Computation is inexpensive and plentiful nVidea GeForce3 ~80 Gflops/sec ~800 Gops/sec Velio VC3003 1Tb/s I/O BW DRAM < $0.20/MB
But supercomputers are very expensive • Cost more per GFLOPS, GUPS, and GByte than low end machines • Hard to achieve high fraction of peak performance on global problems • Based on clusters of CPUs that are scaling at only 20%/year vs. 50% historically
Microprocessors no longer realize the potential of VLSI 52%/year 19%/year 30:1 74%/year 1,000:1 30,000:1
Streaming processors leverage emerging technology • Streaming supercomputer can achieve • $20/GFLOPs, $2/M-GUPS • Scalable to PFLOPS and 1013 GUPS • Enabled by • Stream architecture • Exposes and exploits parallelism and locality • High arithmetic intensity (ops/BW) • Hides latency • Efficient interconnection networks • High global bandwidth • Low latency
What is stream processing? Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Why does it get good performance – easily?
Domain-specific languageexample: Marble shader in RTSL float turbulence4_imagine_scalar (texref noise, float4 pos) { fragment float4 addr1 = pos; fragment float4 addr2 = pos * {2, 2, 2, 1}; fragment float4 addr3 = pos * {4, 4, 4, 1}; fragment float4 addr4 = pos * {8, 8, 8, 1}; fragment float val; val = (0.5) * texture(noise, addr1)[0]; val = val + (0.25) * texture(noise, addr2)[0]; val = val + (0.125) * texture(noise, addr3)[0]; val = val + (0.0625) * texture(noise, addr4)[0]; return val; } float3 marble_color(float x) { float x2; x = sqrt(x+1.0)*.7071; x2 = sqrt(x); return { .30 + .6*x2, .30 + .8*x, .60 + .4*x2 }; } surface shader float4 shiny_marble_imagine (texref noise) { float4 Cd = lightmodel_diffuse({ 0.4, 0.4, 0.4, 1 }, { 0.5, 0.5, 0.5, 1 }); float4 Cs = lightmodel_specular({ 0.35, 0.35, 0.35, 1 }, Zero, 20); fragment float y; fragment float4 pos = Pobj * {10, 10, 10, 1}; y = pos[1] + 3.0 * turbulence4_imagine_scalar(noise, pos); y = sin(y*pi); return ({marble_color(y), 1.0f} * Cd + Cs); }
Lights, Normals, & Materials Shader Traverser Ray Gen Intersector Stream-level application descriptionexample: SHARP Raytracer • Computation expressed as streams of records passing through kernels • Similar to computation required for Monte-Carlo radiation transport Camera Grid Triangles Rays + Rays Hits VoxID Rays Rays Pixels
Expected application performance • Arithmetic-limited applications • Includes applications where domain decomposition can be applied • Like TFLO and LES • Expected to achieve a large fraction of peak performance • Communication-limited applications • Such as applications requiring matrix solution Ax = b • At the very least will benefit from high global bandwidth • We hope to find new methods to solve matrix equations using streaming
Conclusion • Computation is cheap yet supercomputing is expensive • Streams enable supercomputing to exploit advantages of emerging technology • by exposing locality and concurrency • Order of magnitude cost/performance improvement for both arithmetic-limited and communication-limited codes • $20/GFLOPS and $2/M-GUPS • Scalable from desktop (1 TFLOPS) to machine room (1 PFLOPS) • A layered software system using domain-specific languages simplifies stream programming • MCRT, ODEs, PDEs • Early results on graphics and image processing are encouraging
Project Goals for Fall Quarter AY2001-2002 • Map two applications to the stream model • Fluid flow (TFLO), and molecular dynamics candidates • Define a high-level stream programming language • Generalize stream access without destroying locality • Draft strawman SSS architecture and identify key issues
Meeting Schedule Fall Quarter AY2001-2002 Goal: shared knowledge base and vision across the project • 10/9 – TFLO (Juan) • 10/16 – RTSL (Bill M.) • 10/23 – Molecular Dynamics (Eric) • 10/30 – Imagine and its programming system (Ujval) • 11/6 – C*, ZPL, etc… + SPL brainstorming (Ian) • 11/13 – Metacompilation (Ben C.) • 11/20 – Application followup (Ron/Heinz) • 11/27 – Strawman architecture (Ben S.) • 12/4 – Streams vs. CMP (Blue Gene/Light, etc…) (Bill D.)