retargetting of vpo to the tms320c54x a status report n.
Skip this Video
Loading SlideShow in 5 Seconds..
Retargetting of VPO to the tms320c54x - a status report PowerPoint Presentation
Download Presentation
Retargetting of VPO to the tms320c54x - a status report

Retargetting of VPO to the tms320c54x - a status report

118 Vues Download Presentation
Télécharger la présentation

Retargetting of VPO to the tms320c54x - a status report

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson

  2. Status • Register assignment and allocation • Common sub-expression elimination • Constant propagation/Copy propagation • Induction variable elimination • Code motion • Recurrence detection

  3. Status (continued) • Strength reduction • Instruction selection • Dead code elimination • Constant folding (simp()) • Branch minimization • Support for repeat blocks

  4. The tms320c54x • 1 40-bit ALU, 2 40-bit accumulators (A,B) (r[0],r[2] in vpo) • 1 17x17bit parallel multiplier with adder for single cycle MAC operation • 1 barrel shifter • 8 16-bit address registers (AR0-AR7) (w[0]..w[7] in vpo)

  5. Compiler writer woes • Address arithmetic – can only add a constant to an address register. Causes complications in optimizer (eg. in strength reduction code). • Interesting note: r[0]=(w[0]{24)}24; r[0]=r[0]+1; w[1]=r[0]; /* w[1]=w[0]+1 gets rejected */ W[w[1]]=50; /* by instruction selection */ --------------------------- w[0]=w[0]+1; W[w[0]+1]=50; The first sequence cannot normally collapse into the more efficient second sequence. But after minimize_registers, instruction selection is able to fold them into a single instruction.

  6. Compiler writer woes • 16 bit word addressing – required special case handling in lcc frontend. • Only 2 accumulator registers. • Local Register Assigner had to be fixed to handle this. • Lots of spills. Refined vpo to use memory disambiguation techniques in instruction selection (maybe_same()).

  7. Compiler writer woes • No pipeline interlocks => unprotected pipeline conflicts. • 40 bit accumulator. Needed major change to simp(). Complicated machine description with sign-extends and ANDs. • Global data placed in special cinit section and is relocated to RAM at run-time. VISTA/EASE code instrumentation had to be done differently from other targets.

  8. Compiler writer woes • Compare and jump has the induction variable and the value to compare with, spread over two instructions. All targets till now had a simple compare and jump. Resulted in small change to vpo lib/md interface. • Eg. AR1 (w[1]) is the induction variable and runs from 0 to 9. The loop exit check – SSBX SXM // s[0]=1; (set sign-ext on) LD *(AR1),A ; // r[0]=(w[1]{24)}24; SUB #10,A,A ; // r[0]=r[0]-10; BC L1,ALT ; // PC=r[0],0?L1;

  9. Timeline of progress on this project • Spring 2002 • Code-expander completed. • Only basic addressing modes and instructions supported. • Stack layout • Calling sequence • Data declarations • Structure operations • Passes ctests/ptests with instruction selection. • Support for stdargs added.

  10. Timeline of progress on this project • Fall 2002 • Major changes to simp() to handle 40 bit arithmetic. • Enabled Register Coloring and CSE. • Lot of work on comp() to allow better instruction selection and other optimizations. (eg. w[1]=( (w[1]{24)}24)+1 ) & 65535 folds down to w[1]=w[1]+1; <- only now strength reduction can detect the induction variable) • Integrated VISTA into mainline vpo.

  11. Timeline of progress on this project • Spring 2003 • Enabled Code motion & Strength reduction. • Further refined the machine description/grammar. • Started work on Zero Overhead Loop Buffer (ZOLB) support. • Second merge of VISTA with vpo done. • Retargeted VISTA to the tms320c54x.

  12. To-Dos/Future work • Parallel instructions • Issues with ZOLB (details later) • Scheduling • The banz instruction (very useful for loops) – allows comparison of an address register with zero • Circular addressing

  13. TI’s compiler cl500 has.. • Inter-procedural analysis • For eg. if the parameters to a function are constants or globals, the actual parameters are substituted into the function, thus avoiding expensive stack frame setup. • Inline expansion of runtime-support library functions.

  14. Code comparison Code Fragment: Get address of local _a r[2]=(w[7]{24)}24; r[2]=r[2]+_l0_2_a; w[3]=r[2]&65535; // w[3]=w[7]+_l0_2_a ---------------------------- w[3]=w[7]; w[3]=w[3]+_l0_2_a; VPO cl500 (TI-compiler)

  15. Code comparison • Code fragment: for (i = 0; i < STRUCTSIZE; i++) // STRUCTSIZE=2 sum += b.field[i]; Because vpo maintains the running sum in a 16 bit register (address register) we use 2 extra instructions and lose the opportunity for converting into a repeat single instruction. The TI-compiler maintains the sum in an accumulator register.

  16. AR3 (w[3]) points to start of array. AR1 maintains the running count. brc=1; rptb .L10_rpt_end-1 .L10: ld *(AR1),A // r[0]=(w[1]{24)}24; add *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; stl A,*(AR1) // w[1]=r[0]&65535; .L10_rpt_end: -------------------------------------------- AR3 (w[3]) points to start of array. A (r[0]) maintains the running count. RPT #1 L5: ADD *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; L6: VPO cl500 (TI-compiler)

  17. Zero Overhead Loop Buffers • Loops are buffered in a special internal buffer using a rpt instruction whose parameters are start label, end label and loop count. Access to this buffer may be faster than fetching the instructions from memory. • The usual branch instruction at the end of the loop is no longer necessary when using a repeat instruction, and hence pipeline bubbles are avoided. • On the tms320c54x a single instruction rpt allows memory block copies/initializations without using an address register.

  18. Detail on ZOLB • Advantage of doing it in vpo • Can make use of all the information that vpo has already collected about the loop. • Easily retargetable • Code in machine independent part is reused. • Code in machine dependent part for one target provides a framework for the new target. • After conversion to a Repeat Block, registers may be freed up. Other optimizations may get enabled.

  19. Status of ZOLB • Repeat Blocks with compile time known loop iteration count implemented. • Plan to implement the banz instruction which is the next best option to ZOLB.

  20. Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman