Stanford Streaming Supercomputer (SSS) Project Meeting

Stanford Streaming Supercomputer (SSS)Project Meeting Bill Dally, Pat Hanrahan, and Ron FedkiwComputer Systems LaboratoryStanford University October 2, 2001

Agenda • Introductions (now) • Vision – subset of ASCI review slides • Goals for the quarter • Schedule of meetings for the quarter

Computation is inexpensive and plentiful nVidea GeForce3 ~80 Gflops/sec ~800 Gops/sec Velio VC3003 1Tb/s I/O BW DRAM < $0.20/MB

But supercomputers are very expensive • Cost more per GFLOPS, GUPS, and GByte than low end machines • Hard to achieve high fraction of peak performance on global problems • Based on clusters of CPUs that are scaling at only 20%/year vs. 50% historically

Microprocessors no longer realize the potential of VLSI 52%/year 19%/year 30:1 74%/year 1,000:1 30,000:1

Streaming processors leverage emerging technology • Streaming supercomputer can achieve • $20/GFLOPs, $2/M-GUPS • Scalable to PFLOPS and 1013 GUPS • Enabled by • Stream architecture • Exposes and exploits parallelism and locality • High arithmetic intensity (ops/BW) • Hides latency • Efficient interconnection networks • High global bandwidth • Low latency

What is stream processing? Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism

SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Why does it get good performance – easily?

Architecture of a Streaming Supercomputer

Streaming processor

A layered software system simplifies stream programming

Domain-specific languageexample: Marble shader in RTSL float turbulence4_imagine_scalar (texref noise, float4 pos) { fragment float4 addr1 = pos; fragment float4 addr2 = pos * {2, 2, 2, 1}; fragment float4 addr3 = pos * {4, 4, 4, 1}; fragment float4 addr4 = pos * {8, 8, 8, 1}; fragment float val; val = (0.5) * texture(noise, addr1)[0]; val = val + (0.25) * texture(noise, addr2)[0]; val = val + (0.125) * texture(noise, addr3)[0]; val = val + (0.0625) * texture(noise, addr4)[0]; return val; } float3 marble_color(float x) { float x2; x = sqrt(x+1.0)*.7071; x2 = sqrt(x); return { .30 + .6*x2, .30 + .8*x, .60 + .4*x2 }; } surface shader float4 shiny_marble_imagine (texref noise) { float4 Cd = lightmodel_diffuse({ 0.4, 0.4, 0.4, 1 }, { 0.5, 0.5, 0.5, 1 }); float4 Cs = lightmodel_specular({ 0.35, 0.35, 0.35, 1 }, Zero, 20); fragment float y; fragment float4 pos = Pobj * {10, 10, 10, 1}; y = pos[1] + 3.0 * turbulence4_imagine_scalar(noise, pos); y = sin(y*pi); return ({marble_color(y), 1.0f} * Cd + Cs); }

Lights, Normals, & Materials Shader Traverser Ray Gen Intersector Stream-level application descriptionexample: SHARP Raytracer • Computation expressed as streams of records passing through kernels • Similar to computation required for Monte-Carlo radiation transport Camera Grid Triangles Rays + Rays Hits VoxID Rays Rays Pixels

Expected application performance • Arithmetic-limited applications • Includes applications where domain decomposition can be applied • Like TFLO and LES • Expected to achieve a large fraction of peak performance • Communication-limited applications • Such as applications requiring matrix solution Ax = b • At the very least will benefit from high global bandwidth • We hope to find new methods to solve matrix equations using streaming

Conclusion • Computation is cheap yet supercomputing is expensive • Streams enable supercomputing to exploit advantages of emerging technology • by exposing locality and concurrency • Order of magnitude cost/performance improvement for both arithmetic-limited and communication-limited codes • $20/GFLOPS and $2/M-GUPS • Scalable from desktop (1 TFLOPS) to machine room (1 PFLOPS) • A layered software system using domain-specific languages simplifies stream programming • MCRT, ODEs, PDEs • Early results on graphics and image processing are encouraging

Plan for AY2001-2002

Project Goals for Fall Quarter AY2001-2002 • Map two applications to the stream model • Fluid flow (TFLO), and molecular dynamics candidates • Define a high-level stream programming language • Generalize stream access without destroying locality • Draft strawman SSS architecture and identify key issues

Meeting Schedule Fall Quarter AY2001-2002 Goal: shared knowledge base and vision across the project • 10/9 – TFLO (Juan) • 10/16 – RTSL (Bill M.) • 10/23 – Molecular Dynamics (Eric) • 10/30 – Imagine and its programming system (Ujval) • 11/6 – C*, ZPL, etc… + SPL brainstorming (Ian) • 11/13 – Metacompilation (Ben C.) • 11/20 – Application followup (Ron/Heinz) • 11/27 – Strawman architecture (Ben S.) • 12/4 – Streams vs. CMP (Blue Gene/Light, etc…) (Bill D.)

Stanford Streaming Supercomputer (SSS) Project Meeting

Stanford Streaming Supercomputer (SSS) Project Meeting

Presentation Transcript

sss

sss Employee Health sss

sss Employee Health sss

Proving Triangles Congruent!

sss Employee Health sss

sss Employee Health sss

EI Showcase August 2012

sss Employee Health sss

SSS

Streaming Supercomputer Strawman Architecture

Peer Mentoring: Strengthening Retention, Relationships and Civic Responsibility

SSS

4.3 Prove Congruence by SSS 4.4 Prove Congruence by SAS and HL

sss Employee Health sss

Overall Design of SSS Software

Molecular Dynamics Stanford Streaming Computing

Comparison of the Various Stanford Streaming Languages

sss

sss