Exploiting Hardware for Loop Optimizations Using Zero Overhead Loop Buffers

Project Presentation by Joshua George Advisor – Dr. Jack Davidson

Exploiting hardware for loop optimizations • ZOLB – Zero Overhead Loop Buffers • Decrement, compare and jump instructions • Compare with zero instructions

VPO - Very Portable Optimizer • VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists) • VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines • VPO is small, easily extended, and extremely effective

Implementation - guidelines • Add minimum possible code to machine dependent part (MDP), while doing most of the implementation in the machine independent part (LIB) • Design the interface between LIB and MDP to allow for possible issues with other targets

ZOLB - Zero Overhead Loop Buffer • DSP algorithms – loop intensive • Eg. FIR filter • DSPs – power consumption and code size constraints • ZOLB hardware – compiler managed loop cache • No branch • Instructions executed from buffer • Doesn’t need more power • Reduces code size

ZOLB on TMS320C54X • TMS320C54X - popular DSP from TI • Has block repeat and single instruction repeat • Single repeat rpt #127 st #0, *ar0+

ZOLB on TMS320C54X • Block repeat stm #127, brc rptb L2 L1: …. L2:

Conversion example w[0] = _A; b[1]=L1; b[2]=EN[L1]; b[0]=9; L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1; w[0]=_A; L1: w[0]=w[0]+1;W[w[0]]=0; r[0] = (w[0]{24)}24; r[0] = r[0] - 10 - _A; PC = r[0]<0,L1; stm #_A, ar0 L1: st #0, *ar0+ ld *(ar0), A sub (_A+#10), A bc L1, Alt stm #_A, ar0 rpt #9 st #0, *ar0+

Retargetability • 205 lines of C code added to MDP • Various other parts of MDP re-used. For eg., code to return details of a comparison instruction. • 322 lines of C code added to LIB • Various other parts of VPO re-used. For eg., the loop analysis code.

Future work • How to prevent VPO from changing block size (for eg. when spills are added)? • In single repeat instruction, how to add support for auto-increment direct addressing mode. • Eg. rpt #123 mvdk *ar1, #800h

Count down loops • Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. • Reasoning • Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. • Some architectures can decrement, compare and jmp in a single instruction! • Sometimes it is possible to use one less register in the loop when using count down.

Example – TMS320C54X Exploiting the banz instruction w[0] = 0; L1: … w[0]=w[0] + 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 10; PC = r[0]<0,L1; w[0] = 10; L1: … w[0]=w[0] - 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 0; PC = r[0]!0,L1; w[0] = 10; L1: … PC=(w[0]-1)!0,L1;w[0]=w[0]=1; (banz *ar0-,L1) Before Conversion After Conversion After folding down

Example – x86 Exploiting the loop instruction. One register (r[6]) freed from the loop. r[4] = 0; L1: …. r[4] = r[4] + 1; n[0] = r[6] ? r[4]; PC = n[0]<0,L1; r[4] = r[6]; L1: …. r[4] = r[4] - 1; n[0] = 0 ? r[4]; PC = n[0]!0,L1; r[4] = r[6]; L1: … PC=0?r[4]-1!0,L1;r[4]=r[4]-1; (loop L1) Before conversion After conversion After folding down

Example – sparc Exploiting the subcc instruction r[16]=0; L6: ST=test; r[16]=r[16]+1; IC=r[16]?2; PC=IC<0,L6; IC r[16]=2; L6: ST=test; IC=r[16]-1:0;r[16]=r[16]-1; (subcc) PC=IC!0,L6; IC

Retargetability • Lines of C code added to MDP (including the elaborate comments!) • 83 on TMS320C54X • 87 on x86 • 78 on sparc • 634 lines of C added to LIB • Compared to ZOLB support, this optimization is almost completely implemented in LIB

Performance – spec on x86

Analysis • Average performance has improved after applying the count down optimization

Conclusion • More fine-tuning needed to realize substantial performance gains • Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman

Exploiting Hardware for Loop Optimizations Using Zero Overhead Loop Buffers

Exploiting Hardware for Loop Optimizations Using Zero Overhead Loop Buffers

Presentation Transcript

Project presentation

Project Presentation

Project presentation

Project Presentation

Project presentation

Project Presentation

Project Presentation

Project Presentation

Project Presentation

Project Presentation

Project Presentation

Project presentation:

Project Presentation

Project presentation

Presentation Project

Project Presentation

Project Presentation

Project Presentation

Project Presentation

Project Presentation