Stream Architecture: Rethinking Media Processor Design

Stream Architecture:Rethinking Media Processor Design Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory

Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Media Processing Stream Architecture

Stereo Depth Extraction Left Camera Image Right Camera Image • 640x480 @ 30 fps • Requirements • 11 GOPS • Imagine stream processor • 12.1 GOPS, 4.6 GOPS/W Depth Map Stream Architecture

Outline • Stream Processing • VLSI Constraints • Register Organization • Imagine • Conclusions Stream Architecture

Media Processing Characteristics • Low-precision data • 24% 8-bit integer operations • 29% 16-bit integer operations • Abundant data-parallelism • Little global data reuse • Average of 1.5 references per global data word • Numerous computations per global reference • 50-500 operations per global data reference Stream Architecture

Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (>60 operations per memory reference) Stream Architecture

Locality and Concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism Stream Architecture

Sony PlayStation2 Emotion Engine FPU VPU0 VPU1 Graphics Synthesizer MIPS Core Display IPU RDRAM, I/O, DMAC, etc. Stream Architecture

Instruction Cache IP IR Registers Special vs. General Purpose • Special Purpose • Fixed function • High performance • General Purpose • Programmable • Insufficient performance Stream Architecture

Register Files Dwarf ALUs Stream Architecture

Register File Area • Each cell requires: • 1 word line per port • 1 bit line per port • Each cell grows as p2 • R registers in the file • Area: p2R µN3 Register Bit Cell Stream Architecture

Register File Access Delay • Signal must traverse: • Word line to access cell • Bit line to transfer data • Wire capacitance dominates • Delay: pR1/2 µN3/2 Register File Stream Architecture

Register File Power Dissipation • 100% utilization requires driving all pR1/2 bit lines • Wire capacitance dominates • Power: p2RµN3 Register File Stream Architecture

Centralized Register Organization • Area, Power µN3, Delay µN3/2 Stream Architecture

Partitioned Organizations • SIMD • Data-parallel axis • Distributed Register Files (DRF) • Instruction-level parallel axis • Hierarchical • Memory hierarchy axis • Stream • Optimizing for streams Stream Architecture

SIMD Register Organization • Area, Power µN3/C2, Delay µ (N/C)3/2 Stream Architecture

Distributed Register Organization • Area, Power µN2, Delay µN Stream Architecture

Combining SIMD and DRF Scalar SIMD Central DRF Stream Architecture

Hierarchical Register Organization • Area, Power µN3, Delay µN3/2 Hierarchical T=40 Stream Architecture

Hierarchical Organizations Scalar SIMD Central DRF Stream Architecture

Stream Register Organization • Area, Power µN2/C, Delay µN/C Stream Architecture

Stream Organizations Scalar SIMD Central DRF Stream Architecture

Comparison of Organizations • 48 ALUs (32-bit), 500 MHz • Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Stream Architecture

Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Stream Architecture

Stream Architecture • Stream Processing • Matched to media processing • Exposes locality and concurrency • Stream Register Organization • Efficiency of special-purpose hardware • Optimized for streaming applications • Data bandwidth • Bandwidth hierarchy • Memory access scheduling • Conditional streams Stream Architecture

SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor The Imagine Stream Processor Stream Architecture

Arithmetic Clusters Communication Unit Scratch-pad Register File Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Stream Architecture

Bandwidth Hierarchy • 41.2 32-bit operations per word of memory bandwidth SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Stream Architecture

Stream Recirculation Stream Architecture

Bandwidth Demands of FIR Filter Stream Architecture

Bandwidth Utilization of FIR Filter Stream Architecture

Performance floating-point application 16-bit kernels 16-bit applications floating-point kernel Stream Architecture

Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Stream Architecture

Relative Performance and Power Efficiency FFT Performance Power Efficiency Stream Architecture

Tapeout ~Q2 ’01 21 million T’s 6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Imagine Floorplan Stream Architecture

William J. Dally Ujval Kapasi Brucek Khailany Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Imagine Team Stream Architecture

Conclusions • Media Processing • Little data reuse • Highly data parallel • Compute intensive • VLSI • Stream register organization • Bandwidth hierarchy • Imagine • Stream architecture • 10 GOPS sustained application performance • 5 GOPS/W application power efficiency Stream Architecture

Stream Architecture: Rethinking Media Processor Design

Stream Architecture: Rethinking Media Processor Design

Presentation Transcript

Embedded Processor Architecture

Rethinking General Stream Adjudications

ARM Processor Architecture

Processor Design

Processor Architecture

Processor Design

Progress on media processor design

Data Stream Processor

Processor Architecture

Idempotent Processor Architecture

Stream Processor Simulator

Superscalar Processor Design Superscalar Architecture

Embedded Processor Architecture

Basic Processor Architecture

The Imagine Stream Processor

Processor Architecture

ARM Processor Architecture

Processor Design

Rethinking the Internet Architecture

80x86 Processor Architecture

80x86 Processor Architecture

Processor System Architecture