Automatic Task Generation for DAGuE: A Dataflow Representation Tool for QR Factorization
This work presents the DAGuE system, an innovative tool designed to generate automatic task representations for dataflow applications, specifically focusing on QR factorization. The DAGuE compiler facilitates the transformation from serial code into a dataflow representation via QUARK-specific syntax. It includes an annotated example using various tasks like `Insert_Task` for task dependencies and memory management to enable maximum parallelism. The paper discusses the sequential code representation, task isolation, and synchronization edges essential for optimizing computational tasks within high-performance computing environments.
Automatic Task Generation for DAGuE: A Dataflow Representation Tool for QR Factorization
E N D
Presentation Transcript
Automatic task generation for DAGuE http://icl.utk.edu/dague George Bosilca, AurelienBouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
DAGuE Compiler Serial Code to Dataflow Representation
Input Format – Quark (PLASMA) for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Sequential C code • Annotated through QUARK-specific syntax • Insert_Task • INOUT, OUTPUT, INPUT • REGION_L, REGION_U, REGION_D, … • LOCALITY • StarPU syntax in progress
DAGuE Compiler analysis steps for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Record all USEs • Record all DEFinitions • Formulate as Omega Relations: • all true (flow) dependencies • all output dependencies • Compute the differences • Formulate all anti-dependencies • Finalize synchronization edges
Traditional Compiler (Control Flow) for(k… for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } for(m… Control Flow Graph for(n… for(m…
Data Flow imposes ordering for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }
Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) n=k+1 m=k+1
Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR • Polyhedral Analysis through Omega • Compute algebraic expressions for: • Source and destination tasks • Necessary conditions for that data flow to exist k=SIZE-1 Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) U L n=k+1 m=k+1
Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k) -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITET <- T(k, k) -> T(k, k) -> (k < NT-1) ? TUNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible
Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k) -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITET <- T(k, k) -> T(k, k) -> (k < NT-1) ? TUNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible
Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1
Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 Ctrl Flow k’ > k DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Flow Dependency (RAW) Relation [k,m,n] -> [k’] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1 0 <= k’ < N-1 k < k’ m = k’ n = k’
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N}
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N} Output Dependency (WAW)
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} (k>0) ? TSMQR(k-1,k,k) n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 } -> (k<N-1) ? UNMQR(k, k+1..N-1)
Anti-dependencies • In theory, anti-deps do not matter in distributed memory • But real machines are distributed/shared memory hybrids • Anti-deps must create synchronization edges • Overestimating anti-deps is safe (albeit slow) • Output deps should be treated the same … in theory
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } }
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } n=k+1 m=k+1
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘}
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘} n=k+1 m=k+1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
FinalizingAnti-deps Transitive Closure is undecidable!
Current/Future Work Can we address non-affine codes?
Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)
Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative) for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i+= 2*s) V[i] = op(V[i], V[i+s]) Issue: Non-affine loops lead to non-polyhedral array accessing
Example: Reduction Operation 0 reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : Areduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? Areduce(l+1, p/2) : Breduce(l+1, p/2) READB <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? Areduce( l-1, p*2+1 ) BODY operator(A, B); END 1 2 3 Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } }
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii')
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') Hard to solve statically
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj}
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } constant jj' = (jj * 2**ii) / (2**ii') – 1/2 jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj} constant
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } jj' = C / (2**ii') – 1/2 jj' = C / (2**ii') But a given (source) task has fixed {ii, jj}
Handling Reduction Finding a destination task means finding integers that satisfy either equation. Run-time upper bound for cost: log(NT/2) jj' = C / (2**ii') – 1/2 jj' = C / (2**ii')