Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus,

Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus, Scott A. Mahlke, Richard B. BrownDepartment of Electrical Engineering and Computer Science University of Michigan, Ann Arbor 1

Architected Registers: More or Less? • Fewer registers: • Smaller hardware structures: more power efficient • Tighter instruction encoding, small memory footprint • However, more loads/stores to memory • Reduce performance, increase in power • More registers: • Larger hardware structures: less power efficient • Increase in code size, larger memory footprint • However, • Map more variables from memory to registers: reduce power • Enable ILP optimizations 2

Objective of this Work • Provide a large number of architected registers • But, maintain instruction encoding and thus code size • Use a windowed register file architecture • But, in an unconventional way • Traditional register window • Reduce function save/restore overhead • Our approach • Large register file partitioned into multiple windows • Appearance of a large register file 3

Windowed Register File Architecture Machine Status Register (MSR) 8-regs FU 16-regs window status bit 3-bit operand field 8-regs add r1, r2, r3 iw-mov r9, r1 win-swap #2 sub r2, r1, r3 r1: register 1 in register file 0 0 0 1 toggle active window r1: register 9 in register file 1 0 0 1 4

Wireless Integrated Microsystems (WIMS) Developed at the University of Michigan, (Robert Senger et al, DAC 2003) 5

Related Work • Traditional use of register windows • Reduce save/restore, context switch overhead • SPARC, IA-64, ADSP-219x, Tensilica • Procedure call overhead small in embedded domain • Procedure inlining reduces call/return overhead • ~ 2% increase in performance using infinite registerwindows in WIMS • Register connects[Kiyohara:93], register queues[Smelyanskiy:01] • Fixed ISA, provide more registers than allowed • Layer of indirection to access every operand 6

loop:ADD R1-3, R1-0, R1-6LOAD R1-2, [R1-3]ADD R1-3, R1-0, R1-7LOAD R1-4, [R1-3]MPY R1-3, R1-2, R1-4ADD R1-2, R1-0, R1-5ADD R1-1, R1-1, #1LOAD R1-4, [R1-2]ADD R1-2, R1-3, R1-4STORE [R1-0], R1-2ADD R1-0, R1-0, #4 CMP R1-1, #100BRCT loop loop: LOAD R1-1, [SP, #24] ADD R1-0, R1-3, R1-1 LOAD R1-0, [R1-0] LOAD R1-1, [SP, #32] STORE [SP, #72], R1-0 ADD R1-0, R1-3, R1-1 LOAD R1-0, [R1-0] LOAD R1-1, [SP, #72] MPY R1-0, R1-1, R1-0 STORE [SP,#40], R1-0 LOAD R1-0, [SP, #16] ADD R1-1, R1-3, R1-0 LOAD R1-0, [R1-1] LOAD R1-1, [SP, #40] ADD R1-0, R1-1, R1-0 LOAD R1-1, [SP, #80] STORE [R1-3], R1-0 ADD R1-0, R1-1, #1 ADD R1-3, R1-3, #4 CMP R1-0, #100 BRCT loop loop: IW-MOV R1-0, R2-1 WIN-SWAP #1 ADD R1-3, R1-2, R1-0 IW-MOV R1-0, R2-2 LOAD R1-1, [R1-3] ADD R1-3, R1-2, R1-0 LOAD R1-0, [R1-3] MPY R1-3, R1-1, R1-0 IW-MOV R1-1, R2-3 ADD R1-1, R1-2, R1-1 LOAD R1-1, [R1-0] ADD R1-0, R1-3, R1-1 STORE [R1-2], R1-0 WIN-SWAP #2 ADD R2-0, R2-0, #1 WIN-SWAP #1 ADD R1-2, R1-2, #4 WIN-SWAP #2 CMP R2-0, #100 BRCT loop Motivating Example 1-window of 8-registers 1-window of 4-registers 2-window of 4-registers each 7

Tradeoffs for the Compiler Register Utilization Register Management • Move variables from memory • to register • Reduces spill code • Distribute program variables • and temporaries to all • available registers in multiple • windows • Reduce overhead due to window • management instructions • Activate windows (swaps) • Data transfer (iw-moves) • Bundle accesses to same window • Fewer transitions between • windows VS Balance these issues in an intelligent manner 8

Register Window Partitioning Partition-1 VR4 VR5 VR3 VR6 VR2 VR1 Partition-2 • Weight Calculation • Partition weight: Over-commitment of register resources • Edge weight: Penalty of separating VRs • Partitioning algorithm: • Move nodes between partitions to minimize partition+edge wts • Modified FM graph partitioning algorithm 9

Edge Weight Calculation: Move Cost Computed once before partitioning • loop: • 1 ADD VR34, VR27, VR32 • 2 LOAD VR6, [VR34] • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop MPY VR10, VR6, VR9 3104 IW-MOVEVR100, VR9 ( x 3104) MPYVR10, VR6, VR100 1 edge weight = move-cost + swap-cost VR6 VR9 10

Edge Weight Calculation: Swap Cost Computed once before partitioning active window • loop: • 1 ADD VR34, VR27, VR32 • 2 LOAD VR6, [VR34] • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop LOAD VR6, [VR34] SWAP LOADVR9, [VR27] SWAP MPY VR10, VR6, VR9 • swap cost : 2 x 3104 = 6208 3104 1 edge weight = move-cost + swap-cost VR6 VR9 edge weight = 3104 + 6208 = 9312 11

VR6 VR9 VR2 VR10 VR32 VR20 VR27 VR34 Partition Weight Calculation VR20 VR2 VR34 VR27 VR32 • loop: • 1 ADD VR34, VR27, VR32 • 2 LOAD VR6, [VR34] • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop VR6 VR9 VR10 3104 1 • Estimates the spill pressure using crude register allocation • Partition weight = sum of the cost of all the spilled VRs • Computed dynamically during node assignment process 12

Partition Weight Calculation: Example • Assume 3-registers per window/partition, and all VRs are assigned to one partition • loop: • 1 • 2 LOAD VR6, [VR34] • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop VR20 VR2 VR34 VR27 VR32 VRs 32, 20 are spilled ADD VR24, VR27, VR32 VR6 VR9 VR10 3104 1 Spill Cost VR32 : 3104 VR2: 9312 VRs 10,6,20,34: 6208 VR27: 12416 Spilled VRs = {32, 20} 13

loop: • 1ADD VR34, VR27, VR32 • 2 • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop VR20 VR2 VR34 VR27 VR32 VRs 32, 20 are spilled VR6 VRs 6 are spilled LOAD VR6, [VR34] VR9 VR10 3104 1 Partition Weight Calculation: Example • Assume 3-registers per window/partition, and all VRs are assigned to one partition Spill Cost VR32 : 3104 VR2: 9312 VRs 10,6,20,34: 6208 VR27: 12416 Spilled VRs = {32, 20, 6} Continuing further, partition weight = spill cost of VRs 32, 20, 6, 10 = 21728 14

Node Partitioning: Example Total Gain = D Partition Weight + D Edge Weight P1 P2 VR6 VR9 VR2 VR10 VR32 VR20 VR27 VR34 Partition weight of P1 = sum of spill cost of VRs 32,20,6,10 = 21728 Partition weight of P2 = 0 VR2 15

Node Partitioning: Final Example P1 P2 • loop: • 1 WIN_SWAP #1 • 2 LOAD 32:R1-0, [SP, #0] • 3 ADD 34: R1-3, 27:R1-1, 32:R1-0 • 4 LOAD 9:R1-3, [27:R1-1] • 5 LOAD 6:R1-2, [34:R1-3] • 6 MPY 39:R1-0, 6:R1-2, 9:R1-3 • 7 IW_MOV 10:R2-2, 39:R1-0 • 8 WIN_SWAP #2 • 9 ADD 20:R2-1, 20:R2-1, 10:R2-2 • 10 ADD 2:R2-0, 2:R2-0, #1 • 11 WIN_SWAP #1 • 12 ADD 27:R1-1, 27:R1-1, #4 • 13 WIN_SWAP #2 • 14 CMP 2:R2-0, #32 • 15 BRCT loop VR6 VR9 VR2 VR32 VR20 VR27 VR10 VR34 Partition weight of P1 = spill cost of VRs 32 = 3104 Partition weight of P2 = 0 1 3104 • Reduced from 6-spill to 1-spill operations • Added 5 additional window management instructions • Performance remains the same but decrease in power 16

Performance of WIMS: 8 registers/window 1-window vs 2 and 4 windows 50 85 86 40 65 95 93 91 58 85 83 30 78 79 84 88 75 99 61 97 97 99 64 99 69 99 20 58 69 76 55 77 68 54 10 % cycles 0 -10 -20 -30 Performance Spill benefit Swap and move overhead -40 -50 fir sha rawd rawc yacc djpeg cjpeg unepic gsmdec gsmenc average g721dec g721enc mpg2dec compress 17

91 96 95 50 79 77 90 72 58 40 99 98 62 99 50 63 77 73 82 98 99 69 30 99 51 86 29 72 36 70 66 56 62 20 10 % cycles 0 -10 -20 -30 Performance Spill benefit Swap and move overhead -40 -50 fir sha rawd rawc yacc djpeg cjpeg unepic gsmdec gsmenc average g721dec g721enc mpg2dec compress Performance of VLIW: 8-registers/window1-window vs 2 and 4 windows 18

Power savings on the 8-register WIMS :1-window vs 2 and 4-window machine 20 4-window 2-window 15 % power savings 10 5 0 fir sha yacc rawc rawd cjpeg djpeg unepic gsmdec gsmenc average g721enc g721dec compress mpeg2dec 19

Conclusion • A novel graph partitioning based compiler algorithm to exploit windowed register files within a single procedure • Hardware/software solution to deal with reducing code size and maintaining effectively large number of register Average improvement in performance • 7% reduction in power for the 8-register case on WIMS 20

Swap Cost Over-Counting • loop: • 1 ADD VR34, VR27, VR32 • 2 LOAD VR6, [VR34] • 3 LOAD VR9, [VR27] • 4 MPY VR10, VR6, VR9 • 5 ADD VR20, VR20, VR10 • 6 ADD VR2, VR2, #1 • 7 ADD VR27, VR27, #4 • 8 CMP VR2, 32 • 9 BRCT loop LOAD VR9, [VR27] 4-swaps! In reality only, 1 swap required SWAP - VR9, VR6 3104 SWAP - VR9, VR10 SWAP - VR27, VR10 SWAP - VR27, VR6 MPY VR10, VR6, VR9 1 vr9 vr27 Solution : normalize swap cost Swap cost between VRs 6 and 9 = 1/4 of cost of single swap = 1/4 * 3104 = 776 vr10 vr6 21

WIN_SWAP #1 mov r1, #0 load r4 [_a] mul r2, r3, r4 WIN_SWAP #1 add r1, r2, r3 sub r4, r1, r2 load r1, [r4] IW_MOVr9, r1 WIN_SWAP #2 shl r3, r4, r5 add r3, r9 #2 load r2, [r3] Brl _foo() WIN_SWAP #1 load r4, [r5] add r4, r4 #4 Swap Insertion & Optimization • Remove redundant swaps • Hoist swaps to less frequently • executed region • Combine swaps with other • instructions • BRL/RTS optimization 22

Performance of WIMS: 2-window 8-register vs 1-window 16-register 23

Overall Compilation System PREPASS SCHEDULING REGISTER PARTITIONING CODE GENERATION REGISTER ALLOCATION SWAP INSERTION POSTPASS SCHEDULING FRONTEND CALCULATE PARTITION WEIGHTS CALCULATE EDGE WEIGHTS MOVE NODES NAIVE SWAP INSERTION SWAP OPTIMIZATION 24

Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus,