Optimizations in Compilation Techniques for Distributed Memory Machines

Optimizations (continue): • Common subexpression elimination • Code motion • Strngth reduction • Induction variable elimination • Instruction selection • Register allocation

Branch chaining: • While c do if a then b: L2: c If (c = false) goto L3 a If (a = false) goto L1 b L1: goto L2 L3: L2: c If (c = false) goto L3 a If (a = false) goto L2 b L1: goto L2 L3:

If (a = 0) goto L1 goto L2 L1: If (a!=0) goto L2: L1: • Jump elimination • Instruction scheduling R2 = j K=R2 R1 = R1+1 R2 = j R1=R1+1 K=R2

For (I=0; I < 4; I++) a(I) = 0; A(0) = 0; A(1) = 0; A(2) = 0; A(3) = 0; • Loop unrolling • Loop fusion For (I=0; I<4; I++) a(I) = 0; For (I=0; I<4; I++) b(I) = 0; For(I=0; I<4;I++) a(I) = 0; b(I) = 0;

Difference in local optimization and global optimization: • Common subexpression elimination: • Local CSE: uses dag will solve the problem • Global CSE: needs (data) flow analysis R1 = I * 2 R1 = b[R1] R1 = I * 2 R1 = b[R1] R2 = I * 2 R2 = b[R2] R2 = I * 2 R2 = b[R2]

Compilation techniques for distributed memory machines – problems: • How to allocate memory? • Shared arrays are allocated across different machines • Who to perform the operations? • The owner computes rule. • Referencing to the remote array elements results in explicit communication. For (I=1; I<100; I++) a(I) = b(I-1) A: 0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99 0 1 ….24 25 26 …. 49 50 51 ….74 75 76 …. 99 B: P0 P1 P2 P3

Issues for compiling for distributed memory machines: • Data partitioning • Computational partitioning • Communication optimization • Efficient Code generation

To motivate the problem: let us see a straight forward implementation for computational partitioning (Pingali, Cornell, 1992): • Assuming that all processors have the ghost image of all the shared arrays. For (I=0; I<100; I++) If (I own b(I-1) but not a(I)) then send (b(I-1) to the processor that owns a(I) If (I own a(I) but not b(I-1)) then Receive b(I-1) from the processor that owns b(I-1) If (I own a(I)) then perform a(I) = b(I-1) • This assumption is exactly what we don’t want. • Too many communications with small messages.

Date partitioning: • Specified by hand – HPF (High Performance Fortran), Fortran D, Craft Fortran, Fortran 90D. • Allowing manual specification of data alignment and data distribution • Block, cyclic, block-cyclic distribution. • Automatically determined by the compiler: • In a number of research compiler (FX(CMU), paradigm(UIUC), …) • The compiler looks at the program and determines the data distribution that would minimize the communication cost. • Different loops usually require different distribution • Fancy distribution is usually useless • Related to another difficult problem: communication opt.

Computation partitioning: • With the owner computes rule, the computation partitioning only depends on the data partitioning. • Communication is only needed for read. • Other computation partitioning technique is also possible. • Communication may also be needed for write. • The basic method to compile a program is the same regardless of the computation partitioning methods.

Communication optimization: • Message vectorization: combining small messages within a loop to be a large message before the loop. • Redundant communication elimination: eliminate communications that have been performed. • Can be done partially. • Communication scheduling: combining message with the same pattern into one larger message • Overlap communication with computation. • Separate send and receive as far as possible. • Exploit pipeline communication. • Allow small messages instead of large messages.

Generate efficient communication code: • Buffer management • Pack and unpack messages • Fancy data distribution methods usually result in high overheads. • Considering the features in the underlying communication system. • General note: very difficult, main results coming from the HPF group from Rice University. • Future of HPF:

Optimizations in Compilation Techniques for Distributed Memory Machines

Optimizations in Compilation Techniques for Distributed Memory Machines

Presentation Transcript

Common Childhood Problems

Luche Reduction

Excretion vs. Elimination

Motion estimation

Chapter 5 Circular Motion, the Planets, and Gravity

CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

Motion Analysis Summer Course

Analyses and Optimizations for Multithreaded Programs

COMMON JAVA JAVA 数据结构与常用类库简介

Uniform Circular Motion

Physics:

Winter 2012-2013 Compiler Principles IR Local Optimizations – part 2

Oscillations and Simple Harmonic Motion :

Theory of Compilation 236360 Erez Petrank

1. Projectile motion can be described by the horizontal and vertical components of motion.

Periodic Motion - 1

Simple Harmonic Motion

Gaussian Elimination

Chapter 3

Chapter 20 Oxidation-Reduction Reactions

The Laws of Motion

The Architecture of Uncertainty