trish
Uploaded by
55 SLIDES
701 VUES
560LIKES

Automatic Task Generation for DAGuE: A Dataflow Representation Tool for QR Factorization

DESCRIPTION

This work presents the DAGuE system, an innovative tool designed to generate automatic task representations for dataflow applications, specifically focusing on QR factorization. The DAGuE compiler facilitates the transformation from serial code into a dataflow representation via QUARK-specific syntax. It includes an annotated example using various tasks like `Insert_Task` for task dependencies and memory management to enable maximum parallelism. The paper discusses the sequential code representation, task isolation, and synchronization edges essential for optimizing computational tasks within high-performance computing environments.

1 / 55

Télécharger la présentation

Automatic Task Generation for DAGuE: A Dataflow Representation Tool for QR Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript

Playing audio...

  1. Automatic task generation for DAGuE http://icl.utk.edu/dague George Bosilca, AurelienBouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra

  2. The DAGuE system

  3. DAGuE Compiler Serial Code to Dataflow Representation

  4. Example: QR Factorization

  5. Input Format – Quark (PLASMA) for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Sequential C code • Annotated through QUARK-specific syntax • Insert_Task • INOUT, OUTPUT, INPUT • REGION_L, REGION_U, REGION_D, … • LOCALITY • StarPU syntax in progress

  6. DAGuE Compiler analysis steps for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Record all USEs • Record all DEFinitions • Formulate as Omega Relations: • all true (flow) dependencies • all output dependencies • Compute the differences • Formulate all anti-dependencies • Finalize synchronization edges

  7. Traditional Compiler (Control Flow) for(k… for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } for(m… Control Flow Graph for(n… for(m…

  8. Data Flow imposes ordering for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }

  9. Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) n=k+1 m=k+1

  10. Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR • Polyhedral Analysis through Omega • Compute algebraic expressions for: • Source and destination tasks • Necessary conditions for that data flow to exist k=SIZE-1 Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) U L n=k+1 m=k+1

  11. Intermediate Representation: Job Data Flow GEQRT(k)  /* Execution space */  k = 0..( MT < NT ) ? MT-1 : NT-1 )  /* Locality */  : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k)          -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER]          -> (k < MT-1) ? A1TSQRT(k, k+1)         [type = UPPER]          -> (k == MT-1) ? A(k, k)                  [type = UPPER] WRITET <- T(k, k)          -> T(k, k)          -> (k <  NT-1) ? TUNMQR(k, k+1 .. NT-1)  /* Priority */  ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible

  12. Intermediate Representation: Job Data Flow GEQRT(k)  /* Execution space */  k = 0..( MT < NT ) ? MT-1 : NT-1 )  /* Locality */  : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k)          -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER]          -> (k < MT-1) ? A1TSQRT(k, k+1)         [type = UPPER]          -> (k == MT-1) ? A(k, k)                  [type = UPPER] WRITET <- T(k, k)          -> T(k, k)          -> (k <  NT-1) ? TUNMQR(k, k+1 .. NT-1)  /* Priority */  ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible

  13. Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1

  14. Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 Ctrl Flow k’ > k DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1

  15. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Flow Dependency (RAW) Relation [k,m,n] -> [k’] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1 0 <= k’ < N-1 k < k’ m = k’ n = k’

  16. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N}

  17. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N} Output Dependency (WAW)

  18. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} n=k+1 m=k+1

  19. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} n=k+1 m=k+1

  20. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} (k>0) ? TSMQR(k-1,k,k) n=k+1 m=k+1

  21. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }

  22. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 } -> (k<N-1) ? UNMQR(k, k+1..N-1)

  23. Anti-dependencies • In theory, anti-deps do not matter in distributed memory • But real machines are distributed/shared memory hybrids • Anti-deps must create synchronization edges • Overestimating anti-deps is safe (albeit slow) • Output deps should be treated the same … in theory

  24. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } }

  25. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } n=k+1 m=k+1

  26. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘}

  27. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘} n=k+1 m=k+1

  28. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  29. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  30. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  31. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  32. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  33. FinalizingAnti-deps

  34. FinalizingAnti-deps

  35. FinalizingAnti-deps Transitive Closure is undecidable!

  36. IsDescendantOf()

  37. Current/Future Work Can we address non-affine codes?

  38. Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)

  39. Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative) for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i+= 2*s) V[i] = op(V[i], V[i+s]) Issue: Non-affine loops lead to non-polyhedral array accessing

  40. Example: Reduction Operation 0 reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : Areduce( l-1, 2*p )        -> ((depth+1) == l) ? V(0)        -> (0 == (p%2))? Areduce(l+1, p/2) : Breduce(l+1, p/2) READB <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0)        <- (1 == l) ? V(2*p+1)        <- (1 != l) ? Areduce( l-1, p*2+1 ) BODY operator(A, B); END 1 2 3 Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation

  41. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } }

  42. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }

  43. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }

  44. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } }

  45. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii')

  46. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') Hard to solve statically

  47. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj}

  48. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } constant jj' = (jj * 2**ii) / (2**ii') – 1/2 jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj} constant

  49. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } jj' = C / (2**ii') – 1/2 jj' = C / (2**ii') But a given (source) task has fixed {ii, jj}

  50. Handling Reduction Finding a destination task means finding integers that satisfy either equation. Run-time upper bound for cost: log(NT/2) jj' = C / (2**ii') – 1/2 jj' = C / (2**ii')

More Related