1 / 12

Optimizations (continue): Common subexpression elimination Code motion Strngth reduction

Optimizations (continue): Common subexpression elimination Code motion Strngth reduction Induction variable elimination Instruction selection Register allocation. Branch chaining: While c do if a then b:. L2: c If (c = false) goto L3 a If (a = false) goto L1

csilla
Télécharger la présentation

Optimizations (continue): Common subexpression elimination Code motion Strngth reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizations (continue): • Common subexpression elimination • Code motion • Strngth reduction • Induction variable elimination • Instruction selection • Register allocation

  2. Branch chaining: • While c do if a then b: L2: c If (c = false) goto L3 a If (a = false) goto L1 b L1: goto L2 L3: L2: c If (c = false) goto L3 a If (a = false) goto L2 b L1: goto L2 L3:

  3. If (a = 0) goto L1 goto L2 L1: If (a!=0) goto L2: L1: • Jump elimination • Instruction scheduling R2 = j K=R2 R1 = R1+1 R2 = j R1=R1+1 K=R2

  4. For (I=0; I < 4; I++) a(I) = 0; A(0) = 0; A(1) = 0; A(2) = 0; A(3) = 0; • Loop unrolling • Loop fusion For (I=0; I<4; I++) a(I) = 0; For (I=0; I<4; I++) b(I) = 0; For(I=0; I<4;I++) a(I) = 0; b(I) = 0;

  5. Difference in local optimization and global optimization: • Common subexpression elimination: • Local CSE: uses dag will solve the problem • Global CSE: needs (data) flow analysis R1 = I * 2 R1 = b[R1] R1 = I * 2 R1 = b[R1] R2 = I * 2 R2 = b[R2] R2 = I * 2 R2 = b[R2]

  6. Compilation techniques for distributed memory machines – problems: • How to allocate memory? • Shared arrays are allocated across different machines • Who to perform the operations? • The owner computes rule. • Referencing to the remote array elements results in explicit communication. For (I=1; I<100; I++) a(I) = b(I-1) A: 0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99 0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99 B: P0 P1 P2 P3

  7. Issues for compiling for distributed memory machines: • Data partitioning • Computational partitioning • Communication optimization • Efficient Code generation

  8. To motivate the problem: let us see a straight forward implementation for computational partitioning (Pingali, Cornell, 1992): • Assuming that all processors have the ghost image of all the shared arrays. For (I=0; I<100; I++) If (I own b(I-1) but not a(I)) then send (b(I-1) to the processor that owns a(I) If (I own a(I) but not b(I-1)) then Receive b(I-1) from the processor that owns b(I-1) If (I own a(I)) then perform a(I) = b(I-1) • This assumption is exactly what we don’t want. • Too many communications with small messages.

  9. Date partitioning: • Specified by hand – HPF (High Performance Fortran), Fortran D, Craft Fortran, Fortran 90D. • Allowing manual specification of data alignment and data distribution • Block, cyclic, block-cyclic distribution. • Automatically determined by the compiler: • In a number of research compiler (FX(CMU), paradigm(UIUC), …) • The compiler looks at the program and determines the data distribution that would minimize the communication cost. • Different loops usually require different distribution • Fancy distribution is usually useless • Related to another difficult problem: communication opt.

  10. Computation partitioning: • With the owner computes rule, the computation partitioning only depends on the data partitioning. • Communication is only needed for read. • Other computation partitioning technique is also possible. • Communication may also be needed for write. • The basic method to compile a program is the same regardless of the computation partitioning methods.

  11. Communication optimization: • Message vectorization: combining small messages within a loop to be a large message before the loop. • Redundant communication elimination: eliminate communications that have been performed. • Can be done partially. • Communication scheduling: combining message with the same pattern into one larger message • Overlap communication with computation. • Separate send and receive as far as possible. • Exploit pipeline communication. • Allow small messages instead of large messages.

  12. Generate efficient communication code: • Buffer management • Pack and unpack messages • Fancy data distribution methods usually result in high overheads. • Considering the features in the underlying communication system. • General note: very difficult, main results coming from the HPF group from Rice University. • Future of HPF:

More Related