Spatial Computation

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003

CPU Problems • Design Complexity • Power • Global Signals • Limited issue window ) limited ILP

Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant

Global Communication Instruction unit Reg Network

Our Approach: ASHApplication-Specific Hardware

1) Unroll Pipeline Instruction unit Reg Network Reg Network Reg Network originalprocessor

Resource Binding Time 1. 1. Programs 2. 2. Programs CPU ASH

2) Specialize Pipeline Fixed program Instruction unit Reg Network Reg Network Reg Network

2) Specialize Pipeline:Functional Units Fixed program Instruction unit Reg Network Reg Network Reg Network

2) Specialize Pipeline:Interconnection Network Fixed program Instruction unit Reg Reg Reg

2) Specialize Pipeline:Register Files Fixed program Instruction unit 0 1

2) Specialize Pipeline: Shrink Wires Fixed program Instruction unit 0 1

2) Specialize Pipeline:No Instruction Fetch, Decode, Issue 0 1

Loops 0 1

Memory Spatial Computation LSQ To memory 0 1

Outline • Introduction • CASH: Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions

Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable/custom hw

Asynchronous Computation + latch data ack data valid

FSM Distributed Control Logic ack rdy + - more info

Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y Conditionals ) Speculation

p ! Split (branch) Control Flow ) Data Flow data Merge data data predicate Gateway

0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;

Outline • Introduction • Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions

ASH vs: • 4- & 8-wide VLIWs • Superscalar, media kernels • Superscalar, SpecInt95

OpenDIVX IDCT, Normalized Running Time

OpenDIVX IDCT,Sustained IPC includes speculative ops no data

Media Kernels, vs 4-way OOO

Media Kernels, IPC

Cost of Performance

wrong! This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good)

SpecInt95, ASH vs 4-way OOO

Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Dissection • Conclusions

The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized

definition Dynamic Critical Path sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence

If Branch Prediction Correct LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained

SpecInt95, perfect prediction

Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

Prediction + Load Speculation ack edge ~4 cycles! Load not pipelined (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

register renaming OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: IF DA EX WB CT L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 L5 L1 L2

Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration

ASH Problems • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient • No virtualization • No dynamic optimization

Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Result Analysis • Conclusions

Conclusions • ASH promising for media processing; to evaluate • power • performance • cost • Prediction does much more than avoid issue stalls • von Neumann model of computation very powerful • hardware resources are not everything

Backup Slides • Evaluation model • Control logic • Pipeline balancing • Lenient execution • Dynamic Critical Path

How Performance Is Evaluated C Mem L2 1/4M L1 8K LSQ 2 limited BW (2 words/c) Unlimited ILP 8 72

Simulation Parameters • Compared to 4-wide OOO SimpleScalar • Same operation latencies • Same cache hierarchy • No measurements in library functions/OS • 3-cycle multiply, 20 cycle divide back

Control Logic rdyin C C ackin D ackout rdyout D datain dataout Reg back back to talk

Outline • Introduction • Compiling for ASH • ASH at run-time • ASH vs CPU • Conclusions

Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y

Spatial Computation