An Evaluation of the TRIPS Computer System

An Evaluation of the TRIPS Computer System Mark Gebhart Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatmili, Aaron Smith, James Burrill, Stephen W. Keckler, Doug Burger, Kathryn S. McKinley The University of Texas at Austin ASPLOS ’09 March 9, 2009

Microprocessor Challenges • Well known challenges • Single threaded performance, communication, concurrency, power efficiency, agility • Hypothesis: Revisiting HW/SW boundary could address these challenges • ISA: Statically encode data dependence to expose concurrency • Microarchitecture: Distributed to eliminate global wires • Question: How well does overall system actually perform • ISA • Microarchitecture • Performance comparison to industrial processors ASPLOS ‘09

TRIPS Prototype • Partitioned chip design • Processor resources: 16 FPUs, 128 registers [MICRO ’06] • NUCA Cache: 32KB L1 D-cache, 80 KB L1 I-cache, 1 MB L2 cache [ASPLOS ’02] • Evaluation focuses on 1 processor ASPLOS ‘09

Single Processor Microarchitecture • Distributed design • Partitioned memory system • On-chip networks • Large instruction window • Window size of 1024 instructions • Order of magnitude larger than conventional processors when full TRIPS Tiles: G – Global Control Tile I – L1 Instruction Cache D – L1 Data Cache R – Register File E – Execution Unit ASPLOS ‘09

TRIPS Execution Model • EDGE: Explicit DataGraph Execution • Block atomic execution model • Dataflow execution within blocks • Block atomic execution model • Up to 128 instructions execute as one atomic unit • Amortizes execution bookkeeping • Single entry point, multiple exit points • Speculative microarchitecture: up to 8 blocks in flight • 1 non-speculative block • 7 speculative blocks int A, B, C; if (B > C) A =B + C; else A =B - C; Block 1 Block 2 Block 3 ASPLOS ‘09

TRIPS ISA • Predication • Almost all instructions can be predicated • Multiple control paths execute within a single block • Direct instruction communication • Values passed directly from producer to consumer instructions • Reduces accesses to global structures • Overheads • Fanout instructions when large number of consumers • Conditional moves to facilitate predication A = (B > C) ? B + C : B - C Dataflow graph read &B read &C load load test > test > add_true add_true sub_false sub_false read &A store ASPLOS ‘09

Instruction Scheduling • TRIPS Scheduler: Exploit concurrency, minimize communication delays • Scheduler automatically generates efficient schedules [ASPLOS ‘06] TRIPS Processor Dataflow graph TRIPS Scheduler read &B read &C read &B read &A read &C load test load load load add sub test > add_true sub_false store read &A store Global Control Data Cache Register File Execution Unit ASPLOS ‘09

Outline • TRIPS Background • Experimental Results • Evaluation Goals • Methodology • TRIPS ISA • Microarchitecture • Performance Comparison to Conventional Processors • Lessons Learned • Conclusions ASPLOS ‘09

ISA Questions • How large are blocks? • Generally larger blocks improve performance • Larger blocks complicate instruction placement • Larger blocks typically require more predication • What fraction of blocks are useful? • Because of predication not all instructions fetched / executed are needed • Dataflow overhead instructions • What is the effect of using direct instruction communication? • Reduces accesses to global structures • Overhead of fanning out results to a large number of consumers ASPLOS ‘09

Microarchitecture Questions • What is the utilization of the large instruction window? • Block size • Microarchitectural flushes / stalls • Branch mispredictions • Instruction cache misses • What is the effect of distributed design? • Delays from on-chip networks • Static delay (1 cycle per hop) • Dynamic delay (from contention) • Bandwidth of partitioned memory system ASPLOS ‘09

Methodology • Benchmarks • 53 Compiled benchmarks [CGO ‘06] • SPEC2K, EEMBC, VersaBench, Signal Processing • 13 Hand optimized benchmarks • Ideal Loop unrolling • Inlining • Advanced register allocation policies • Block merging • 2 Hand optimized and scheduled kernels • Minimize OPN contention • Maximize bandwidth utilization • Measurements • ISA and microarchitecture studies: Hardware and simulation • Performance (cycles): Hardware ASPLOS ‘09

Block Size • Compiler: (generally) produces large blocks • Hand optimizations • Sometimes smaller due to better scalar optimizations • Potential for larger blocks ASPLOS ‘09

Block Composition • Speculation leads to useless instructions, specifically on control intensive benchmarks • Dataflow overhead instruction large component of blocks ASPLOS ‘09

ISA Evaluation • Direct instruction communication • Memory ops  register ops • Register ops  direct operand communications • Dynamic code size increases • 11x Alpha • 55% from nops, read/write insts, block header • 18% from useless (predicated) instructions • 27% from instruction replication optimizations • 4 SPEC benchmarks have I-cache miss rates higher than 10%, max of 40% • 6x Alpha when using variable sized blocks Normalized to RISC (Alpha) FP - C INT - C EEMBC - C Simple Simple Benchmarks - H Benchmarks - C SPEC2K ASPLOS ‘09

Instruction Window • Speculation leads to useless instructions in the window • Microarchitectural events lead to low utilization • Branch mispredictions (>10%): crafty, gcc, perlbmk • I-cache misses (>14%): crafty, twolf, apsi, mesa ASPLOS ‘09

Distributed Microarchitecture Bottlenecks • Operand network latency • Static delay (hop count): 30% performance loss • Dynamic delay (congestion): 12% performance loss • Partitioned L1 cache bandwidth • Vector add • Compiler: ~2 mem ops/cycle • OPN Contention is bottleneck • Hand scheduling: 4 mem ops/cycle (4 is peak) • Matrix multiply [PPoPP ’08] • Compiler • ~.5 Mem ops/cycle • 1 Flops/cycle • Bottlenecks: OPN contention, address fanout, block formation • Hand scheduled • 3 Mem ops/cycle • 7 Flops/cycle • 10 Total ops/cycle ASPLOS ‘09

Hardware Comparison • Cycles counts for benchmarks on all platforms • Comparing architectures, not technology generations • Wall clock time (which factors in cycle time) does not make sense • Prototype simplifications • Lack of FP divide, 32-bit floats • Branch predictor table sizes ASPLOS ‘09

Hand-optimized Performance • Compiled versions competitive with Core 2 • Hand optimized benchmarks outperform Core 2 by almost 3x ASPLOS ‘09

Compiled Code Performance • Compiled code 60% slower than Core 2 on SPEC INT • TRIPS SPEC FP performance nearly equal to Core 2 SPEC INT SPEC FP ASPLOS ‘09

Lessons Learned • What worked well: • Dramatic reduction in data cache and register accesses • Compiler able to compile complex applications to new architecture • Prototype competitive with Core 2 on simple and SPEC FP compiled codes • Distributed execution protocols (fetch, commit) generally off critical path ASPLOS ‘09

Lessons Learned • Addressed challenges: • Distance to storage structures [Micro ’07] • Fixed execution width [Micro ’07] • Scalability of load/store queues [Micro ’07, ISCA ’07] • Dependence prediction [ISCA ’08] • Static routing latencies [Micro ’08] • Remaining challenges • Dataflow overhead instructions • Dynamic code size • I-cache efficiency • Predication overhead • Generating efficient code for control intensive benchmarks ASPLOS ‘09

Conclusions • Prototype was a successful research vehicle • Much better performance on regular hand optimized code • Performance competitive on compiled FP code • Noticeably worse performance on compiled INT code • Future Work • Operand multicasts to reduce dataflow overhead instructions • Predicate prediction to speed the evaluation of predicated instructions • Variable sized blocks to increase I-cache efficiency ASPLOS ‘09

Acknowledgements • TRIPS HW Team: Saurabh Drolia, Sibi Govindan, Divya Gulati, Paul Gratz, Heather Hanson, Haiming Liu, Changkyu Kim, Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, Karu Sankaralingam, Simha Sethumadhavan, Premkishore Shivakumar • TRIPS SW Team: Jim Burrill, Kevin Bush, Xia Chen, Katie Coons, Jeff Diamond, Sundeep Kushwaha, Madhavi Krishnan, Bert Maher, Mario Marino, Behnam Robatmili, Sadia Sharif, Aaron Smith, Bill Yoder • Supervising Professors: Doug Burger, Steve Keckler, Kathryn McKinley ASPLOS ‘09

An Evaluation of the TRIPS Computer System