1 / 19

Project Presentation

Project Presentation. by Joshua George Advisor – Dr. Jack Davidson. Exploiting hardware for loop optimizations. ZOLB – Zero Overhead Loop Buffers Decrement, compare and jump instructions Compare with zero instructions. VPO - Very Portable Optimizer.

miracle
Télécharger la présentation

Project Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Presentation by Joshua George Advisor – Dr. Jack Davidson

  2. Exploiting hardware for loop optimizations • ZOLB – Zero Overhead Loop Buffers • Decrement, compare and jump instructions • Compare with zero instructions

  3. VPO - Very Portable Optimizer • VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists) • VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines • VPO is small, easily extended, and extremely effective

  4. Implementation - guidelines • Add minimum possible code to machine dependent part (MDP), while doing most of the implementation in the machine independent part (LIB) • Design the interface between LIB and MDP to allow for possible issues with other targets

  5. ZOLB - Zero Overhead Loop Buffer • DSP algorithms – loop intensive • Eg. FIR filter • DSPs – power consumption and code size constraints • ZOLB hardware – compiler managed loop cache • No branch • Instructions executed from buffer • Doesn’t need more power • Reduces code size

  6. ZOLB on TMS320C54X • TMS320C54X - popular DSP from TI • Has block repeat and single instruction repeat • Single repeat rpt #127 st #0, *ar0+

  7. ZOLB on TMS320C54X • Block repeat stm #127, brc rptb L2 L1: …. L2:

  8. Conversion example w[0] = _A; b[1]=L1; b[2]=EN[L1]; b[0]=9; L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1; w[0]=_A; L1: w[0]=w[0]+1;W[w[0]]=0; r[0] = (w[0]{24)}24; r[0] = r[0] - 10 - _A; PC = r[0]<0,L1; stm #_A, ar0 L1: st #0, *ar0+ ld *(ar0), A sub (_A+#10), A bc L1, Alt stm #_A, ar0 rpt #9 st #0, *ar0+

  9. Retargetability • 205 lines of C code added to MDP • Various other parts of MDP re-used. For eg., code to return details of a comparison instruction. • 322 lines of C code added to LIB • Various other parts of VPO re-used. For eg., the loop analysis code.

  10. Future work • How to prevent VPO from changing block size (for eg. when spills are added)? • In single repeat instruction, how to add support for auto-increment direct addressing mode. • Eg. rpt #123 mvdk *ar1, #800h

  11. Count down loops • Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. • Reasoning • Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. • Some architectures can decrement, compare and jmp in a single instruction! • Sometimes it is possible to use one less register in the loop when using count down.

  12. Example – TMS320C54X Exploiting the banz instruction w[0] = 0; L1: … w[0]=w[0] + 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 10; PC = r[0]<0,L1; w[0] = 10; L1: … w[0]=w[0] - 1; r[0] = (w[0]{24)}24; r[0] = r[0] – 0; PC = r[0]!0,L1; w[0] = 10; L1: … PC=(w[0]-1)!0,L1;w[0]=w[0]=1; (banz *ar0-,L1) Before Conversion After Conversion After folding down

  13. Example – x86 Exploiting the loop instruction. One register (r[6]) freed from the loop. r[4] = 0; L1: …. r[4] = r[4] + 1; n[0] = r[6] ? r[4]; PC = n[0]<0,L1; r[4] = r[6]; L1: …. r[4] = r[4] - 1; n[0] = 0 ? r[4]; PC = n[0]!0,L1; r[4] = r[6]; L1: … PC=0?r[4]-1!0,L1;r[4]=r[4]-1; (loop L1) Before conversion After conversion After folding down

  14. Example – sparc Exploiting the subcc instruction r[16]=0; L6: ST=test; r[16]=r[16]+1; IC=r[16]?2; PC=IC<0,L6; IC r[16]=2; L6: ST=test; IC=r[16]-1:0;r[16]=r[16]-1; (subcc) PC=IC!0,L6; IC

  15. Retargetability • Lines of C code added to MDP (including the elaborate comments!) • 83 on TMS320C54X • 87 on x86 • 78 on sparc • 634 lines of C added to LIB • Compared to ZOLB support, this optimization is almost completely implemented in LIB

  16. Performance – spec on x86

  17. Analysis • Average performance has improved after applying the count down optimization

  18. Conclusion • More fine-tuning needed to realize substantial performance gains • Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

  19. Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman

More Related