1 / 100

Advanced Compiler Techniques

Advanced Compiler Techniques. Compiler for Modern Architectures. LIU Xianhua School of EECS, Peking University. Outline. Introduction & Basic Concepts Superscalar & VLIW Accelerator-Based Computing Compiler for Embedded Systems Compiler Optimization for Many-Core.

kele
Télécharger la présentation

Advanced Compiler Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Compiler Techniques Compiler for Modern Architectures LIU Xianhua School of EECS, Peking University

  2. Outline • Introduction & Basic Concepts • Superscalar & VLIW • Accelerator-Based Computing • Compiler for Embedded Systems • Compiler Optimization for Many-Core “Advanced Compiler Techniques”

  3. Features of Machine Architectures • Pipelining • Multiple execution units • pipelined • Vector operations • Parallel processing • Shared memory, distributed memory, message-passing • VLIW and Superscalar instruction issue • Registers • Cache hierarchy • Combinations of the above • Parallel-vector machines “Advanced Compiler Techniques”

  4. Moore’s Law “Advanced Compiler Techniques”

  5. Instruction Pipelining • Instruction pipelining • DLX Instruction Pipeline • What is the performance challenge? “Advanced Compiler Techniques”

  6. Replicated Execution Logic • Pipelined Execution Units • Multiple Execution Units What is the performance challenge? “Advanced Compiler Techniques”

  7. Vector Operations • Apply same operation to different positions of one or more arrays • Goal: keep pipelines of execution units full • Example: VLOADV1,A VLOADV2,B VADDV3,V1,V2 VSTOREV3,C How do we specify vector operations in Fortran 77 DO I = 1, 64 C(I) = A(I) + B(I) D(I+1) = D(I) + C(I) ENDDO “Advanced Compiler Techniques”

  8. Superscalar & VLIW • Multiple instruction issue on the same cycle • Wide word instruction (or superscalar) • Usually one instruction slot per functional unit • What are the performance challenges? • Finding enough parallel instructions • Avoiding interlocks • Scheduling instructions early enough “Advanced Compiler Techniques”

  9. SMP Parallelism • Multiple processors with uniform shared memory • Task Parallelism • Independent tasks • Data Parallelism • the same task on different data • What is the performance challenge? “Advanced Compiler Techniques”

  10. Bernstein’s Conditions • When is it safe to run two tasks R1 and R2 in parallel? • If none of the following holds: • R1 writes into a memory location that R2 reads • R2 writes into a memory location that R1 reads • Both R1 and R2 write to the same memory location • How can we convert this to loop parallelism? • Think of loop iterations as tasks “Advanced Compiler Techniques”

  11. Memory Hierarchy • Problem: memory is moving farther away in processor cycles • Latency and bandwidth difficulties • Solution • Reuse data in cache and registers • Challenge: How can we enhance reuse? • Coloring register allocation works well • But only for scalars • DO I = 1, N • DO J = 1, N • C(I) = C(I) + A(J) • Strip mining to reuse data from cache “Advanced Compiler Techniques”

  12. Distributed Memory • Memory packaged with processors • Message passing • Distributed shared memory • SMP clusters • Shared memory on node, message passing off node • What are the performance issues? • Minimizing communication • Data placement • Optimizing communication • Aggregation • Overlap of communication and computation “Advanced Compiler Techniques”

  13. Compiler Technologies • Program Transformations • Most of these architectural issues can be dealt with by restructuring transformations that can be reflected in source • Vectorization, parallelization, cache reuse enhancement • Challenges: • Determining when transformations are legal • Selecting transformations based on profitability • Low level code generation • Some issues must be dealt with at a low level • Prefetch insertion • Instruction scheduling • All require some understanding of the ways that instructions and statements depend on one another (share data) “Advanced Compiler Techniques”

  14. A Common Problem: Matrix Multiply DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  15. Problem for Pipelines • Inner loop of matrix multiply is a reduction • Solution: • work on several iterations of the J-loop simultaneously “Advanced Compiler Techniques”

  16. MatMult for a Pipelined Machine DO I = 1, N, DO J = 1, N, 4 C(J,I) = 0.0!Register 1 C(J+1,I) = 0.0!Register 2 C(J+2,I) = 0.0!Register 3 C(J+3,I) = 0.0!Register 4 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I) C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I) C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  17. Matrix Multiply on Vector Machines DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  18. Problems for Vectors • Inner loop must be vector • And should be stride 1 • Vector registers have finite length (Cray: 64 elements) • Would like to reuse vector register in the compute loop • Solution • Strip mine the loop over the stride-one dimension to 64 • Move the iterate over strip loop to the innermost position • Vectorize it there “Advanced Compiler Techniques”

  19. Vectorizing Matrix Multiply DO I = 1, N DO J = 1, N, 64 DO JJ = 0,63 C(JJ,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  20. Vectorizing Matrix Multiply DO I = 1, N DO J = 1, N, 64 DO JJ = 0,63 C(JJ,I) = 0.0 ENDDO DO K = 1, N DO JJ = 0,63 C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  21. MatMult for a Vector Machine DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K) *B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  22. Matrix Multiply on Parallel SMPs DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  23. Matrix Multiply on Parallel SMPs DO I = 1, N ! Independent for all I DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  24. Problems on a Parallel Machine • Parallelism must be found at the outer loop level • But how do we know? • Solution • Bernstein’s conditions • Can we apply them to loop iterations? • Yes, with dependence • Statement S2 depends on statement S1 if • S2 comes after S1 • S2 must come after S1 in any correct reordering of statements • Usually keyed to memory • Path from S1 to S2 • S1 writes and S2 reads the same location • S1 reads and S2 writes the same location • S1 and S2 both write the same location “Advanced Compiler Techniques”

  25. MatMult on a Shared-Memory MP PARALLEL DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO END PARALLEL DO “Advanced Compiler Techniques”

  26. MatMult on a Vector SMP PARALLEL DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  27. Matrix Multiply for Cache Reuse DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  28. Problems on Cache • There is reuse of C but no reuse of A and B • Solution • Block the loops so you get reuse of both A and B • Multiply a block of A by a block of B and add to block of C • When is it legal to interchange the iterate over block loops to the inside? “Advanced Compiler Techniques”

  29. MatMult on a Uniprocessor with Cache DO I = 1, N, S DO J = 1, N, S DO p = I, I+S-1 DO q = J, J+S-1 C(q,p) = 0.0 ENDDO ENDDO DO K = 1, N, T DO p = I, I+S-1 DO q = J, J+S-1 DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ST elements ST elements S2 elements “Advanced Compiler Techniques”

  30. MatMult on a Distributed-Memory MP PARALLEL DO I = 1, N PARALLEL DO J = 1, N C(J,I) = 0.0 ENDDO ENDDO PARALLEL DO I = 1, N, S PARALLEL DO J = 1, N, S DO K = 1, N, T DO p = I, I+S-1 DO q = J, J+S-1 DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO “Advanced Compiler Techniques”

  31. Compiler Optimizationfor Superscalar & VLIW “Advanced Compiler Techniques”

  32. What is Superscalar? A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well. • “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. • Equally applicable to RISC & CISC, but more straightforward in RISC machines. • The order of execution is usually assisted by the compiler. “Advanced Compiler Techniques”

  33. Superscalar vs. Superpipeline “Advanced Compiler Techniques”

  34. Limitations of Superscalar • Dependent upon: - Instruction level parallelism possible - Compiler based optimization - Hardware support • Limited by • Data dependency • Procedural dependency • Resource conflicts “Advanced Compiler Techniques”

  35. Effect of Dependencies onSuperscalar Operation Notes: 1) Superscalar operation is double impacted by a stall. 2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation “Advanced Compiler Techniques”

  36. In-Order Issue –In-Order Completion (Example) • Assume: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit “Advanced Compiler Techniques”

  37. In-Order Issue –Out-of-Order Completion (Example) • Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit How does this effect interrupts? “Advanced Compiler Techniques”

  38. Out-of-Order Issue –Out-of-Order Completion (Example) • Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit Note: I5 depends upon I4, but I6 does not “Advanced Compiler Techniques”

  39. Speedups of Machine Organizations (Without Procedural Dependencies) • Not worth duplication of functional units without register renaming • Need instruction window large enough (more than 8, probably not more than 32) “Advanced Compiler Techniques”

  40. View of Superscalar Execution “Advanced Compiler Techniques”

  41. Example: Pentium 4A Superscalar CISC Machine “Advanced Compiler Techniques”

  42. Pentium 4 Alternative View “Advanced Compiler Techniques”

  43. Pentium 4 pipeline 20 stages ! “Advanced Compiler Techniques”

  44. VLIW Approach • Superscalar hardware is tough • E.g. for the two issue superscalar: every cycle we examine the opcode of 2 instr, the 6 registers specifiers and we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate Functional Units(FU) • VLIW approach • Packages multiple independent instructions into one very long instruction • Burden of finding independent instructions is on the compiler • No dependence within issue package or at least indicate when a dependence is present • Simpler hardware “Advanced Compiler Techniques”

  45. Superscalar vs. VLIW “Advanced Compiler Techniques”

  46. Superscalar vs. VLIW “Advanced Compiler Techniques”

  47. Role of the Compiler forSuperscalar and VLIW “Advanced Compiler Techniques”

  48. What is EPIC? • Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word). • EPIC features: • Parallel instruction encoding • Flexible instruction grouping • Large register file • Predicated instruction set • Control and data speculation • Compiler control of memory hierarchy “Advanced Compiler Techniques”

  49. History Hard-wired discrete logic Microprocessor (advent of VLSI) CISC (microcode) RISC (microcode bad!) • A timeline of architectures: • Hennessy and Patterson were early RISC proponents. • VLIW (very long instruction word) architectures, 1981. • Alan Charlesworth (Floating Point Systems) • Josh Fisher (Yale / Multiflow – Multiflow Trace machines) • Bob Rau (TRW / Cydrome – Cydra-5 machine) EPIC Superpipelined, Superscalar VLIW “Advanced Compiler Techniques”

  50. History (cont.) • Superscalar processors • TilakAgerwala and John Cocke of IBM coined term • First microprocessor (Intel i960CA) and workstation (IBM RS/6000) introduced in 1989. • Intel Pentium in 1993. • Foundations of EPIC - HP • HP FAST (Fine-grained Architecture and Software Technologies) research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture. • In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects. • HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in 1997. • First implementation was Merced / Itanium, announced in 1999. “Advanced Compiler Techniques”

More Related