180 likes | 296 Vues
On October 2, 2001, prominent figures including Bill Dally, Pat Hanrahan, and Ron Fedkiw convened at Stanford University to discuss the Stanford Streaming Supercomputer (SSS) Project. The meeting focused on the goals for the upcoming quarter, the architectural vision behind streaming supercomputers, and their potential cost-performance benefits. With operations achieving extraordinary performance levels, the project aims to exploit streaming architectures to enhance computation across various applications, including fluid dynamics and image processing, while challenging traditional supercomputing paradigms.
E N D
Stanford Streaming Supercomputer (SSS)Project Meeting Bill Dally, Pat Hanrahan, and Ron FedkiwComputer Systems LaboratoryStanford University October 2, 2001
Agenda • Introductions (now) • Vision – subset of ASCI review slides • Goals for the quarter • Schedule of meetings for the quarter
Computation is inexpensive and plentiful nVidea GeForce3 ~80 Gflops/sec ~800 Gops/sec Velio VC3003 1Tb/s I/O BW DRAM < $0.20/MB
But supercomputers are very expensive • Cost more per GFLOPS, GUPS, and GByte than low end machines • Hard to achieve high fraction of peak performance on global problems • Based on clusters of CPUs that are scaling at only 20%/year vs. 50% historically
Microprocessors no longer realize the potential of VLSI 52%/year 19%/year 30:1 74%/year 1,000:1 30,000:1
Streaming processors leverage emerging technology • Streaming supercomputer can achieve • $20/GFLOPs, $2/M-GUPS • Scalable to PFLOPS and 1013 GUPS • Enabled by • Stream architecture • Exposes and exploits parallelism and locality • High arithmetic intensity (ops/BW) • Hides latency • Efficient interconnection networks • High global bandwidth • Low latency
What is stream processing? Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Why does it get good performance – easily?
Domain-specific languageexample: Marble shader in RTSL float turbulence4_imagine_scalar (texref noise, float4 pos) { fragment float4 addr1 = pos; fragment float4 addr2 = pos * {2, 2, 2, 1}; fragment float4 addr3 = pos * {4, 4, 4, 1}; fragment float4 addr4 = pos * {8, 8, 8, 1}; fragment float val; val = (0.5) * texture(noise, addr1)[0]; val = val + (0.25) * texture(noise, addr2)[0]; val = val + (0.125) * texture(noise, addr3)[0]; val = val + (0.0625) * texture(noise, addr4)[0]; return val; } float3 marble_color(float x) { float x2; x = sqrt(x+1.0)*.7071; x2 = sqrt(x); return { .30 + .6*x2, .30 + .8*x, .60 + .4*x2 }; } surface shader float4 shiny_marble_imagine (texref noise) { float4 Cd = lightmodel_diffuse({ 0.4, 0.4, 0.4, 1 }, { 0.5, 0.5, 0.5, 1 }); float4 Cs = lightmodel_specular({ 0.35, 0.35, 0.35, 1 }, Zero, 20); fragment float y; fragment float4 pos = Pobj * {10, 10, 10, 1}; y = pos[1] + 3.0 * turbulence4_imagine_scalar(noise, pos); y = sin(y*pi); return ({marble_color(y), 1.0f} * Cd + Cs); }
Lights, Normals, & Materials Shader Traverser Ray Gen Intersector Stream-level application descriptionexample: SHARP Raytracer • Computation expressed as streams of records passing through kernels • Similar to computation required for Monte-Carlo radiation transport Camera Grid Triangles Rays + Rays Hits VoxID Rays Rays Pixels
Expected application performance • Arithmetic-limited applications • Includes applications where domain decomposition can be applied • Like TFLO and LES • Expected to achieve a large fraction of peak performance • Communication-limited applications • Such as applications requiring matrix solution Ax = b • At the very least will benefit from high global bandwidth • We hope to find new methods to solve matrix equations using streaming
Conclusion • Computation is cheap yet supercomputing is expensive • Streams enable supercomputing to exploit advantages of emerging technology • by exposing locality and concurrency • Order of magnitude cost/performance improvement for both arithmetic-limited and communication-limited codes • $20/GFLOPS and $2/M-GUPS • Scalable from desktop (1 TFLOPS) to machine room (1 PFLOPS) • A layered software system using domain-specific languages simplifies stream programming • MCRT, ODEs, PDEs • Early results on graphics and image processing are encouraging
Project Goals for Fall Quarter AY2001-2002 • Map two applications to the stream model • Fluid flow (TFLO), and molecular dynamics candidates • Define a high-level stream programming language • Generalize stream access without destroying locality • Draft strawman SSS architecture and identify key issues
Meeting Schedule Fall Quarter AY2001-2002 Goal: shared knowledge base and vision across the project • 10/9 – TFLO (Juan) • 10/16 – RTSL (Bill M.) • 10/23 – Molecular Dynamics (Eric) • 10/30 – Imagine and its programming system (Ujval) • 11/6 – C*, ZPL, etc… + SPL brainstorming (Ian) • 11/13 – Metacompilation (Ben C.) • 11/20 – Application followup (Ron/Heinz) • 11/27 – Strawman architecture (Ben S.) • 12/4 – Streams vs. CMP (Blue Gene/Light, etc…) (Bill D.)