1 / 22

Computer Architecture

Computer Architecture. Lecture 7 Compiler Considerations and Optimizations. Structure of Recent Compilers. Front End. Transform Language to Common Intermediate Form

uma
Télécharger la présentation

Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture Lecture 7 Compiler Considerations and Optimizations

  2. Structure of Recent Compilers Front End Transform Language to Common Intermediate Form Note: Only few companies make front for C. Source code for C++ Front end is about 30 times bigger than C. Most Front down convert C++ to C before compilation. High Level Optimization High Level Loop Optimization Example: Procedure In-lining (Lang Dep., Machine Ind.) Global and Local Optimization and register allocation (Small Lang Dep., Small Machine dep.) Global Optimization Detailed Instruction Selection and machine dependent optimization (No Lang Dep., Highly Machine Dep.) Code Generation

  3. Compiler Prime Target • Program Correctness • Speed • Compilation Time? • Phases of compilers help write bug-free code

  4. Optimizations • High-level • Local (Basic Block) • Global (across branches) • Register Allocation, Live Range Analysis • Processor Dependent

  5. Optimization Names • Procedure Integration • Common Sub-expression Elimination/Dead Code Elimination A = b+ c ;dead code eliminated, no subsequent use of b+c A = x+ y Similarly if a procedure does not return a value and uses only local variables will be eliminated. (Test this in VC++) • Constant Propagation: A variable used as constant. (Constants aren’t, Variable Won’t. Osborn’s Law) • Global Sub-expression Elimination • Copy Propagation (a = b, a will be replaced by b) • Code Motion (A code that does not change with index in a loop will be moved out of the loop.) • Induction Variable Elimination (A = A + 5 in a loop that runs n times will be replaced with A = A + 5 * n and moved out of loop, if A is not used,) • Strength Reduction (Multiply replaced with shift and add if possible, A*25 + b*25 will be replaced with (A+B) * 25) • Pipeline Scheduling • Branch Optimization

  6. Problems with Pointers A = 5; p = x+y; *p = 9 (only programmer knows &A = p) Compiler cannot assign a register

  7. Architecture Help • Provide Orthogonality • The Operations, The Data Types, The Addressing Modes, The Register Functions should be orthogonal • Simplify Trade-offs between alternatives (With caches and pipelining, trade-offs have become very complex) For Example: Most difficult one in register-memory architecture: How many times a variable is referenced before it is assigned a register. • Provide Instructions to Bind Variables with Constants • Most SIMD kernels are hand-coded as no compiler support

  8. Hand-Coded VS Compiler GeneratedOn TMS320C6203 (VLIW CPU)(reported May 2000)

  9. Basic Compiler Techniques • Basic Pipelining • Static Loop Unrolling Example:

  10. Example (Contd…) Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1, #-8 BNEQ R1,R2, Loop

  11. Example (Without Scheduling) Loop: L.D F0, 0(R1) stall ;LUD ADD.D F4,F0,F2 stall stall S.D F4, 0(R1) DADDUI R1,R1, #-8 stall BNEQ R1,R2, Loop stall ;Successor flushed Total 10cc

  12. Example (With Scheduling) Loop: L.D F0, 0(R1) DADDUI R1,R1, #-8 ADD.D F4,F0,F2 stall BNEQ R1,R2, Loop S.D F4, 8(R1) ;delay slot Total 6cc (3 for data, 3 overhead)

  13. Example (Static Loop Unrolling 4 times) Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1,R1, #-32 S.D F12, 16(R1) BNEQ R1,R2, Loop S.D F16, 8(R1) ; Delay slot Total 3.5cc per element • Compiler Considerations: • Use of delay slot • Loop level independence • Register Assignment • Proper Loop Adjustment

  14. Example (Static Dual Issue, 1 Int and 1 FP/CC) Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4,F0,F2 L.D F14, -32(R1) ADD.D F8,F6,F2 L.D F18, -36(R1) ADD.D F12,F10,F2 S.D F4, 0(R1) ADD.D F16,F14,F2 S.D F8, -8(R1) ADD.D F20,F18,F2 S.D F12, -16(R1) DADDUI R1,R1, #-40 S.D F16, 16(R1) BNEQ R1,R2, Loop S.D F20, 8(R1) ; Delay slot Total 2.4cc per element LUD

  15. VLIW • Compiler formats issue packets • Compiler ensures that dependencies are not present • 64 to 200-bit long instructions

  16. Example (VLIW, 1 Int, 2 FP, 2 LD/ST /CC 5-slots) 1.29cc per element, 23 slots used out of potential 45

  17. Loop Level Parallelism • Loop Carried Dependence: • Data calculated in one loop iteration is required in the next loop. • A Parallel Loop For (I = 1000; I > 0; I = i-1) x[i] = x[i] + s

  18. Example For (i = 1; i <= 100; i = i+1) { A[i+1] = A[i] + + C[i]; B[i+1] = B[i] + + A[i+1]; } Dependences?

  19. Example 2 • Make the following loop parallel. For (i = 1; i <= 100; i = i+1) { A[i] = A[i] + + B[i]; B[i+1] = C[i] + + D[i]; }

  20. The GCD Test • Loop stores in a  j + b and later fetches from c  k + d. • Sufficient test is that if loop carried dependence exits then GCD(c,a) must integer divide (d-b) (no remainder). For (i = 1; i <= 100; i = i+1) x(2*i+3] = x[2*i] *5 This test ignores loop bounds.

  21. Example 2 • Use renaming to find ILP For (i = 1; i <= 100; i = i+1) { Y[i] = X[i] /c1 X[i] = X[i] +c2 Z[i] = Y[i] + c3 Y[i] = c4 - Y[i] /c }

  22. Other techniques Addi R1, R2, # 4 Addi R1, R2, # 4 To Addi R1, R2, # 8 ;copy Propagation • And Add R1, R2, R3 Add R2, R1, R5 Addi R7, R2, R8 ;(tree height reduction) Sum = sum + x[i] Sum = (sum + x[1]) + ( x[2] + x[3]) + (x[4]+x[5]) ;recurrence optimization

More Related