1 / 38

Parallelizing Iterative Computation for Multiprocessor Architectures

Parallelizing Iterative Computation for Multiprocessor Architectures. Peter Cappello. What is the problem?. Create programs for m ulti- p rocessor u nit ( MPU ) Multicore processors Graphics processing units (GPU). For whom is it a problem? Compiler designer. EASY. Compiler. Application

tacy
Télécharger la présentation

Parallelizing Iterative Computation for Multiprocessor Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing Iterative Computation for Multiprocessor Architectures Peter Cappello

  2. What is the problem? Create programs for multi-processor unit (MPU) • Multicore processors • Graphics processing units (GPU)

  3. For whom is it a problem?Compiler designer EASY Compiler Application Program Executable CPU

  4. For whom is it a problem?Compiler designer HARDER Compiler Application Program Executable MPU 4

  5. For whom is it a problem?Compiler designer MUCH HARDER Compiler Application Program Executable MPU

  6. For whom is it a problem?Application programmer Compiler Application Program Executable MPU

  7. Complex Machine Consequences • Programmer needs to be highly skilled • Programming is error-prone These consequences imply . . . Increased parallelism  increased development cost!

  8. Amdahl’s Law The speedup of a program is bounded by its inherently sequential part. (http://en.wikipedia.org/wiki/Amdahl's_law) If • A program needs 20 hours using a CPU • 1 hour cannot be parallelized Then • Minimum execution time ≥ 1 hour. • Maximum speed up ≤ 20.

  9. (http://en.wikipedia.org/wiki/Amdahl's_law)

  10. Parallelization opportunities Scalableparallelism resides in 2 sequential program constructs: • Divide-and-conquer recursion • Iterative statements (for)

  11. 2 schools of thought • Create a general solution (Address everything somewhat well) • Create a specific solution (Address one thing very well)

  12. Focus on iterative statements (for) float[] x = new float[n]; float[] b = new float[n]; float[][] a = new float[n][n]; . . . for ( int i = 0; i < n; i++ ) { b[i] = 0; for ( int j = 0; j < n; j++ ) b[i] += a[i][j]*x[j]; }

  13. Matrix-Vector Product b = Ax, illustrated with a 3X3 matrix, A. _______________________________ b1 = a11*x1 + a12*x2 + a13*x3 b2 = a21*x1 + a22*x2 + a23*x3 b3 = a31*x1 + a32*x2 + a33*x3

  14. a31 x1 a32 x2 a33 x3 b3 a21 x1 a22 x2 a23 x3 b2 a11 x1 a12 x2 a13 x3 b1 x1 x2 x3

  15. a31 x1 a32 x2 a33 x3 SPACE a21 x1 a22 x2 a23 x3 a11 x1 a12 x2 a13 x3 TIME

  16. x1 x2 x3 a31 a32 a33 SPACE a21 x1 a22 x2 a23 x3 a11 x1 a12 x2 a13 x3 TIME

  17. a31 x1 a21 a32 x1 x2 a11 a22 a33 x1 x2 x3 SPACE a12 a23 x2 x3 a13 x3 TIME

  18. Matrix Product C = AB, illustrated with a 2X2 matrices. c11 = a11*b11 + a12*b21 c12 = a11*b12 + a12*b22 c21 = a21*b11 + a22*b21 c12 = a21*b12 + a22*b22

  19. row a21 b12 a22 b22 a21 b11 a22 b21 a11 b12 a12 b22 a11 b11 a12 b21 col k

  20. S a21 b12 a22 b22 b11 a22 b21 a21 a11 b12 a12 b22 b11 a12 b21 a11 S T

  21. S a21 b12 a22 b22 a21 b11 a22 b21 a11 b12 a12 b22 a11 b11 a12 b21 S T

  22. Declaring an iterative computation • Index set • Data network • Functions • Space-time embedding

  23. I1: I2: Declaring an Index set 1≤ i ≤ j ≤n 1≤ i ≤n 1≤ j ≤n j j i i

  24. D1: x: [ -1, 0]; b: [ 0, -1]; a: [ 0, 0]; D2: x: [ -1, 0]; b: [ -1, -1]; a: [ 0, -1]; Declaring a Data network a x x b b a

  25. I1: D1: x: [ -1, 0]; b: [ 0, -1]; a: [ 0, 0]; Declaring an Index set +Data network 1≤ i ≤ j ≤n j a x i b

  26. Declaring the Functions R1: float x’ (float x) { return x; } float b’ (float b, float x, float a) { return b + a*x; } R2: char x’ (char x) { return x; } boolean b’ (boolean b, char x, char a) { return b && a == x; } j i

  27. space1 space2 time Declaring a Spacetime embedding E1: • space = -i + j • time = i + j. E2: • space1 = i • space2 = j • time = i + j. space time

  28. UTMVP = (I1,D1,F1,E1) Declaring an iterative computationUpper triangular matrix-vector product space time

  29. Declaring an iterative computationFull matrix-vector product UTMVP = (I2,D1,F1,E1) space time

  30. Declaring an iterative computationConvolution (polynomial product) UTMVP = (I2,D2,F1,E1) space time

  31. Declaring an iterative computationString pattern matching UTMVP = (I2,D2,F2,E1) space time

  32. Declaring an iterative computationPipelined String pattern matching UTMVP = (I2,D2,F2,E2) space1 space2 time

  33. Iterative computation specification Declarative specification Is a 4-dimensional design space (actually 5 dimensional: space embedding is independent of time embeding) Facilitates reuse of design components.

  34. Starting with an existing language … • Can infer • Index set • Data network • Functions • Cannot infer • Space embedding • Time embedding

  35. Spacetime embedding • Start with it as aprogram annotation • More advanced: compiler optimized based on program annotated figure of merit.

  36. Work • Work out details of notation • Implement in Java, C, Matlab, HDL, … • Map virtual processor network to actual processor network • Map • Java: map processors to Threads, [links to Channels] • GPU: map processors to GPU processing elements Challenge: spacetime embedding depends on underlying architecture

  37. Work … • The output of 1 iterative computation is the input to another. • Develop a notation for specifying a composite iterative computation?

  38. Thanks for listening!Questions?

More Related