190 likes | 342 Vues
This presentation discusses the exploitation of hardware for optimizing loops using Zero Overhead Loop Buffers (ZOLB). It introduces the concept of decrement, compare, and jump instructions and highlights the Very Portable Optimizer (VPO), which operates on a machine-independent representation called RTLs. The presentation outlines the design and implementation guidelines for VPO on the TMS320C54X DSP, discussing performance improvements for loop-intensive algorithms. Future work will focus on enhancing support for auto-increment addressing modes and managing block sizes during optimization.
E N D
Project Presentation by Joshua George Advisor – Dr. Jack Davidson
Exploiting hardware for loop optimizations • ZOLB – Zero Overhead Loop Buffers • Decrement, compare and jump instructions • Compare with zero instructions
VPO - Very Portable Optimizer • VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists) • VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines • VPO is small, easily extended, and extremely effective
Implementation - guidelines • Add minimum possible code to machine dependent part (MDP), while doing most of the implementation in the machine independent part (LIB) • Design the interface between LIB and MDP to allow for possible issues with other targets
ZOLB - Zero Overhead Loop Buffer • DSP algorithms – loop intensive • Eg. FIR filter • DSPs – power consumption and code size constraints • ZOLB hardware – compiler managed loop cache • No branch • Instructions executed from buffer • Doesn’t need more power • Reduces code size
ZOLB on TMS320C54X • TMS320C54X - popular DSP from TI • Has block repeat and single instruction repeat • Single repeat rpt #127 st #0, *ar0+
ZOLB on TMS320C54X • Block repeat stm #127, brc rptb L2 L1: …. L2:
Conversion example w[0] = _A; b[1]=L1; b[2]=EN[L1]; b[0]=9; L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1; w[0]=_A; L1: w[0]=w[0]+1;W[w[0]]=0; r[0] = (w[0]{24)}24; r[0] = r[0] - 10 - _A; PC = r[0]<0,L1; stm #_A, ar0 L1: st #0, *ar0+ ld *(ar0), A sub (_A+#10), A bc L1, Alt stm #_A, ar0 rpt #9 st #0, *ar0+
Retargetability • 205 lines of C code added to MDP • Various other parts of MDP re-used. For eg., code to return details of a comparison instruction. • 322 lines of C code added to LIB • Various other parts of VPO re-used. For eg., the loop analysis code.
Future work • How to prevent VPO from changing block size (for eg. when spills are added)? • In single repeat instruction, how to add support for auto-increment direct addressing mode. • Eg. rpt #123 mvdk *ar1, #800h
Count down loops • Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. • Reasoning • Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. • Some architectures can decrement, compare and jmp in a single instruction! • Sometimes it is possible to use one less register in the loop when using count down.
Example – TMS320C54X Exploiting the banz instruction w[0] = 0; L1: … w[0]=w[0] + 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 10; PC = r[0]<0,L1; w[0] = 10; L1: … w[0]=w[0] - 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 0; PC = r[0]!0,L1; w[0] = 10; L1: … PC=(w[0]-1)!0,L1;w[0]=w[0]=1; (banz *ar0-,L1) Before Conversion After Conversion After folding down
Example – x86 Exploiting the loop instruction. One register (r[6]) freed from the loop. r[4] = 0; L1: …. r[4] = r[4] + 1; n[0] = r[6] ? r[4]; PC = n[0]<0,L1; r[4] = r[6]; L1: …. r[4] = r[4] - 1; n[0] = 0 ? r[4]; PC = n[0]!0,L1; r[4] = r[6]; L1: … PC=0?r[4]-1!0,L1;r[4]=r[4]-1; (loop L1) Before conversion After conversion After folding down
Example – sparc Exploiting the subcc instruction r[16]=0; L6: ST=test; r[16]=r[16]+1; IC=r[16]?2; PC=IC<0,L6; IC r[16]=2; L6: ST=test; IC=r[16]-1:0;r[16]=r[16]-1; (subcc) PC=IC!0,L6; IC
Retargetability • Lines of C code added to MDP (including the elaborate comments!) • 83 on TMS320C54X • 87 on x86 • 78 on sparc • 634 lines of C added to LIB • Compared to ZOLB support, this optimization is almost completely implemented in LIB
Analysis • Average performance has improved after applying the count down optimization
Conclusion • More fine-tuning needed to realize substantial performance gains • Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!
Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman