Evaluating The Raw Microprocessor: Scalability and Versatility

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. M.I.T.

Could processors be even more general purpose? Spec Office “General Purpose” Microprocessor Custom Chip Video/3D Graphics Network Encryption Wireless/Cell Phone Digital Camera MP3 Player Automotive Square inch of silicon Gets more powerful every generation Why can custom chips run these apps?

mem mem mem mem mem Custom Chips: EfficientExtraction ofParallelism GP Micro 3-8 2 1 10’s, 100’s or 1000’s of parallel operators 10’s or 100’s of parallel memory ports 10’s or 100’s of parallel I/O ops • Customized placement and routing of operators & operands • -High locality • -Minimum Control • -Operands routed over wires, not thru register files • Area and Power Efficient But, not general purpose! Can’t run GCC.

The Raw Goal Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands … while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions - like context switching, caching and instruction virtualization [IEEE Micro, “Billion Transistor” Issue, 1997]

Un-buildable Super-Wide Issue GP PC Control RF Wide Fetch (16 inst) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Unified Load/Store Queue

Area and Frequency Scalability Problems ~N2 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ~N3 N ALUs RF Bypass Net Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >> Bypass Net

Idea: Exploit Locality ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF Bypass Net

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Idea: Exploit Locality RF

Replace the crossbar with a point-to-point, pipelined, routed network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF

Replace the crossbar with a point-to-point, pipelined, routed network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >>

If we want to keep our ALUs busy, we better map communicating instructions nearby so communication is local. Operand Transport Scaling – Bandwidth and Area Scales as 2-D VLSI

If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we must map the instructions to ALUs in a local fashion. [ASPLOS98] Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs.

Distribute the Register File RF RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF

RF PC Control Wide Fetch (16 inst) RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF Unified Load/Store Queue SCALABLE

More Scalability Problems PC Wide Fetch (16 inst) Control Unified Load/Store Queue

Distribute the rest. RF PC Wide Fetch (16 inst) PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF Control RF Unified Load/Store Queue [ISCA99]

Tiles! RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF RF

Tiles!

Tiled Processor Architectures -composed of a replicated tile -all signals registered at tile boundaries -NO global signals -wire delay problem much easier - easy scalability story Easier to Tune the Frequency Easier to Verify Easier to do the Physical Design

Raw Compute Internals RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU r24 r24 D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ r25 r25 RF RF r26 r26 E r27 r27 M1 M2 A TL RF RF IF D F P U

We could not find this type of network • in Patterson & Hennessey. • optimizes time for delivery of • scalar operands between functional units • - we conceptualized this idea into the term • “scalar operand network” or SON • - CMP: 15-100 cycles • - iWarp: 12 cycles • - Raw: 3 cycles • - Alpha 21264: 1 cycle • - Superscalar: 0 cycle RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF scalable Intended for use as SON RF HPCA 2003 – “Scalar Operand Networks”

Evaluation of Raw - holistic approach - design a complete architecture - design and build the processor and enclosing system - build the compilers - used the chip in real systems - head-to-head versus Intel Chip in same litho generation

Raw 180 nm ASIC (IBM SA-27E) 16 tiles Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. 18 W (vpenta) Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

Raw Chips October 02

Raw motherboard Support Chipset implemented in FPGA (vs. custom ASICs for P3)

Comparison to Pentium 3 Honest: Self-comparisons hide architectural and compiler inefficiency. People can now compare to P3 and by extension to Raw. What’s hard: Normalizations between processors is very tricky. Especially academic projects versus indu$try. - ASIC cannot attain the same frequencies. Our solution: -Pick closest Intel processor implementation -Don’t scale any numbers in any way.

Methodology - HW Intel: Pentium III Coppermine 600 MHz Dell Precision 410, stocked with 2-2-2 PC100 DRAM Raw: Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system with conventional hardware i-cache.

Methodology - SW When applicable - normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing) - normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer P3: Intel Performance Primitives LAPACK/BLAS with SSE for linear algebra routines Raw: rawcc - home brew parallelizing compiler Streamit - home brew parallelizing compiler gcc 3.3 + snippets inline assembly for some parallel apps

Performance Survey

Sources of Speedup vs. P3 or 1 Tile

Future Work: Raw supercomputing fabric Emulator of a 1K-tile Raw chip circa. 2010 …Ultimate test of scaling

Related Work: AsTrO Taxonomy ALU ALU ALU ALU ALU ALU ALU ALU Assignment (Static/Dynamic) + + Is instruction assignment to ALUs predetermined? / & % Transport (Static/Dynamic) >> >> Are operand routes predetermined? Ordering (Static/Dynamic) Is the execution order of instructions assigned to a node predetermined?

Assignment Transport Ordering How Raw relates to otherdistributed microprocessorsusing AsTrO taxonomy Dynamic Static Static Dynamic Dynamic Static Static Dynamic Static Dynamic GRID [01] WaveScalar [03] OOO- Superscalar RawDyn [00] Raw [97] Scale [04] ILDP[00]

Conclusions • VLSI Scalable microprocessors are possible. • Constant factors are beginning to give way to asymptotics: • - 16 ALU Raw – Oct 2002 • - 64 ALU Raw – Now • - 1,024 ALU Raw - 2010 • - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm • There is an opportunity to make processors more • “versatile” i.e., steal applications from custom chips. • Tiled Processor Architectures are a promising approach and merit further research.

* * * *

Embedded system:1020 Element Microphone Array

Evaluating The Raw Microprocessor: Scalability and Versatility