1 / 34

Parametric Tiling Revisited

Parametric Tiling Revisited . Muthu Baskaran 1 Albert Hartono 1 Thomas Henretty 1 Sanket Tavarageri 1 J. Ramanujam 2 P. Sadayappan 1 1 Ohio State University 2 Louisiana State University. j. j. i. i. Loop Tiling. A key loop transformation for:

maxine
Télécharger la présentation

Parametric Tiling Revisited

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parametric Tiling Revisited Muthu Baskaran 1 Albert Hartono 1 Thomas Henretty 1 SanketTavarageri 1 J. Ramanujam 2 P. Sadayappan1 1 Ohio State University 2 Louisiana State University

  2. j j i i Loop Tiling • A key loop transformation for: • Efficient coarse-grained parallel execution • Data locality optimization for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j); Inter-tile loops Intra-tile loops

  3. Parametric Tiling for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j); for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj • Parametric tile sizes • Not fixed at compile time • Runtime parameters • Valuable for: • Fast iterative compilation • Automatic empirical tuning systems (e.g., ATLAS)

  4. Why Revisit Parametric Tiling? • Existing tools for tiling transformation inadequate • TLOG and HiTLOG • Handles only perfectly nested loops • Tile sizes can be run time parameters • Pluto • Handles imperfectly nested loops • Tile sizes must be fixed at compile time • Addresses parallelism • Sequential PrimeTile • Handles imperfectly nested loops • Tile sizes can be run time parameters • Does not address parallelism

  5. Goal • Develop a system to address all positive features of existing tools • Handle imperfectly nested loops • Allow tile sizes to be run time parameters • Address parallelism • Support multi-level tiling

  6. Unaligned Tiles vs Aligned Tiles j j Previous work (PrimeTile) This work i i

  7. Loop Generation • Representation of Statement Domains • Set of affine inequalities • S: • v1 , v2 , …, vnare loop variables (v1 outermost and vn innermost) • p1 , p2 , …, pkare program parameters • Bounds of vi ,r ≤ i ≤ n, r ≥ 1 • max(f1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk, c), … , ft(v1 , v2 , …, vr-1 , p1 , p2 , …, pk, c) ) ≤ vi ≤ min(g1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk, c), … , gs(v1 , v2 , …, vr-1 , p1 , p2 , …, pk, c) ) • Bounds are dependent on outer loop variables and parameters (row echelon form)

  8. Loop Generation (cont.) B P C v1 . vn p1 . pk 1 B11 0 0 … 0 B21 B22 0 … 0 . . . Bn1 Bn2 … Bnn P11 P12 … P1k P21 P22 … P2k . . . Pn1 Pn2 … Pnk c1 c2 . . . cn . ≥ 0 row echelon form – suitable for generating loop code to scan iteration points represented by the system v p 1 B | P | C . ≥ 0

  9. Parametric Sequential Tiling • Tiling transformation • Express each variable vjin terms of inter-tile (tile) co-ordinates tj, intra-tile co-ordinates uj and tile sizes sj • vj = sj .tj + uj and 0 ≤ uj ≤ sj -1 • S’ : • S’ is equivalent to S t u p s 1 B.s | B | P | 0 | C . ≥ 0 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 Not in Row echelon form for t But in Row echelon form for u I : Identity matrix

  10. Parametric Sequential Tiling (cont.) • To derive a system in row echelon form for all variables • Create a system ST with only tile variables, program parameters and tile sizes (also parameters) • Relaxed projection to eliminate intra-tile variables uj • In ST , Bij.uj = • All solutions to S’ also satisfy ST • ST: • B.s has same nonzero structure as B => Row echelon form for t 0 if Bij ≤ 0 Bij . (sj -1) if Bij > 0 t p s 1 . B.s | P | B+ | C’ ≥ 0

  11. Parametric Sequential Tiling (cont.) t u p s 1 B.s | B | P | 0 | C S’: In row echelon form for u - To generate intra-tile loops . 0 | I | 0 | 0 | 0 ≥ 0 0 | -I | 0 | I | -1 In row echelon form for t - To generate tile loops ST: t u p s 1 B.s | 0 | P | B+ | C’ In row echelon form for t and u - To generate tile loops and intra-tile loops B.s | B | P | 0 | C ST|S’ : . ≥ 0 t p s 1 0 | I | 0 | 0 | 0 . B.s | P | B+ | C’ ≥ 0 0 | -I | 0 | I | -1

  12. Parametric Multi-level Tiling • Approach similar to single-level tiling • To create a system in row echelon form • Multiple levels of relaxed projection • Eliminate intra-tile variables • Eliminate inner level tile variables • From the innermost level to the last but one outer level • Two-level tiling for illustration • S: • Express loop variables vj as: vj = s1j .t1j + s2j .t2j + ujusing • Outer level tile variables: t11, t12 …, t1n • Inner level tile variables: t21, t22 …, t2n • Tile sizes at outer level: s11, s12 …, s1n • Tile sizes at inner level: s21, s22 …, s2n

  13. Parametric Multi-level Tiling (cont.) S’: t1 t2 u p s2 1 B.s1 | B.s2 | B | P | 0 | C In row echelon form for u - To generate intra-tile loops . 0 | 0 | I | 0 | 0 | 0 ≥ 0 0 | 0 | -I | 0 | I | -1 S’ -> S2T : Intra-tile variables uj eliminated by relaxed projection S2T: t1 t2 p s2 1 . In row echelon form for t2 - To generate inner tile loops B.s1 | B.s2 | P | B+ | C’ ≥ 0

  14. Parametric Multi-level Tiling (cont.) S2T -> S1T : Inner-tile variables t2j eliminated S1T: t1 p s2 r1 1 In row echelon form for t1 - To generate outer tile loops . B.s1 | P | 0 | B+ .s2 | C’’ ≥ 0 Note: r1 = is a parameter S1T | S2T | S’: t1 t2 u p s2 r1 1 B.s1 | 0 | 0 | P | 0 | B+ .s2 | C’’ In row echelon form for t1, t2 and u - To generate tile loops and intra-tile loops B.s1 | B.s2 | 0 | P | B+ | 0 | C’ . B.s1 | B.s2 | B | P | 0 | 0 | C ≥ 0 0 | 0 | I | 0 | 0 | 0 | 0 0 | 0 | -I | 0 | I | 0 | -1

  15. Multi-level Tiling /* 1-level tiled loops */ for (it=⌈(M-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(-a*(it*Ti+Ti-1)+b-Tj+1)/Tj⌉; jt<=⌊(c*(it*Ti+Ti-1)+d)/Tj⌋; jt++) for (i=max(M, it*Ti); i<=min(N, it*Ti+Ti-1); i++) for (j=max(-a*i+b, jt*Tj); j<=(c*i+d, jt*Tj+Tj-1); j++) S(i,j); /* Original loops */ for (i=M; i<=N; i++) for (j=-a*i+b; j<=c*i+d; j++) S(i,j); /* 2-level tiled loops */ for (it2=⌈(M-T1i*⌈T2i/T1i ⌉+1)/T2i⌉; it2<=⌊N/T2i⌋; it2++) for (jt2=⌈(-a*(it2*T2i+T1i*⌈T2i/T1i⌉-1)+b-T1j*⌈T2j/T1j⌉+1)/T2j⌉; jt2<=⌊(c*(it2*T2i+T1i*⌈T2i/T1i⌉-1)+d)/T2j⌋; jt2++) for (it1=max(0, ⌈( M-it2*T2i-T1i+1)/T1i⌉); it1<=min(⌈T2i/T1i ⌉-1, ⌊(N-it2*T2i)/T1i⌋); it1++) for (jt1=max(0, ⌈(-a*(it2*T2i+it1*T1i+T1i-1)+b-jt2*T2j-T1j+1)/T1j⌉ ); jt1<= min(⌈T2j/T1j⌉-1, ⌊(c*(it2*T2i+it1*T1i +T1i-1)+d-jt2*T2j)/T1j⌋); jt1++) for (i=max(M, it2*T2i+it1*T1i); i<=min(N, it2*T2i+it1*T1i+T1i-1); i++) for (j=max(-a*i+b, jt2*T2j+jt1*T1j); j<=(c*i+d, jt2*T2j+jt1*T1j+T1j-1); j++) S(i,j);

  16. Parametric Tiling (Single Statement Domain) a2*i+b2 /* Original loops */ for i=M, N for j=(a1*i+b1), (a2*i+b2) S(i,j) j a1*i+b1 /* Inter-tile loops */ for it=⌈(M-Ti+1)/Ti⌉, ⌊N/Ti⌋ for jt=⌈(a1*it*Ti+b1-Tj+1)/Tj⌉, ⌊(a2*(it*Ti+Ti-1)+b2)/Tj⌋ /* Intra-tile loops */ for i=max(M, it*Ti), min(N, it*Ti+Ti-1) for j=max(a1*i+b1, jt*Tj), min(a2*i+b2, jt*Tj+Tj-1) S(i,j) M N i

  17. Parametric Tiling (Multi Statement Domains) Convex Hull S3 /* Inter-tile loops */ for it { for jt { } } /* Intra-tile loops*/ for i for j1 S1(i,j1) for j2 S2(i,j2) for j3 S3(i,j3) S2 S1 j i

  18. Wave-front Parallelism i j 1 . = 1 0 0 1 1 0 i‘ j’ j’ j tile wavefrontk+4 tile wavefrontk+3 tile wavefrontk+2 tile wavefrontk+1 i tile wavefrontk • After sequential tiling: • If no loop carried dependences exist, then each tiling loop is directly parallelizable • If none of the tiling loops is parallel, then wave-front parallelization is always possible i’

  19. Parallel Non-parameterized Tiling /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈-6/8⌉; it<=⌊N/8⌋; it++) for (jt=⌈-6/8⌉; jt<=⌊N/8⌋; jt++) for (kt=⌈(it*8-7)/8⌉; kt<=⌊N/8⌋; kt++) // intra-tile loops i,j,k Tiling (8x8x8 tile sizes) Original loop constraints Introduce new wavefront constraints (for loop kt) Use Fourier Motzkin Elimination to derive new wavefront constraints (for loops w,it,jt )

  20. Parallel Non-parameterized Tiling (cont.) /* Parallel tiled loops */ for (w=⌈-7/8⌉; w<=⌊3*N/8⌋; w++) /* sequential */ for (it=max(⌈-6/8⌉, ⌈(4*w-N)/4⌉); it<=min(⌊N/8⌋, ⌊(8*w+7)/16⌋); it++) /* parallel */ for (jt=max(⌈-6/8⌉, ⌈(8*w-8*it-N)/8⌉); jt<=min(⌊N/8⌋, ⌊(8*w-16*it+7)/8⌋); jt++) /* parallel */ for (kt=max(⌈(it*8-7)/8⌉, w-it-jt); kt<=min(⌊N/8⌋, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k • This works when tile sizes are fixed • When tile sizes are parametric, Fourier Motzkin Elimination becomes intractable • Sign of the coefficient in the combined inequalities can be indeterminate impossible to determine whether the new inequality is a lower-bound or upper-bound inequality

  21. Static Determination of Lowest and Highest Wavefront Numbers Lexicographic minimum point e.g., (1,1) Lowest wavefront number e.g., wmin=⌊1/Ti⌋+⌊1/Tj⌋ Global parameter values (affine inequalities) ILP Solver Lexicographic maximal point e.g., (200,2*N) Original point loops (affine inequalities) Highest wavefront number e.g., wmax=⌊200/Ti⌋+⌊(2*N)/Tj⌋ The outermost tiling loop enumerates the wavefront numbers from lowest (wmin) to highest (wmax) The values of wmin and wmax can be determined at compile time using ILP solvers such as PIP/PipLib Similarly, parametric bound values of each tiling loop variable (tjmin and tjmaxfor 1 ≤ j ≤ n) can also be computed using ILP solver.

  22. Parallel Parametric Tiling /* Parallel tiled loops */ for (w=wmin; w<=wmax; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Correct code, but visit empty unnecessary iterations • Introduce an outermost wavefront loop • Utilize ILP solver to derive wmin and wmax • Optimize the innermost iterator using wavefront inequalities w-t1-…-tn-1 ≤ tn≤ w-t1-…-tn-1

  23. Parallel Parametric Tiling (cont.) /* Parallel tiled loops */ for (w=wmin; w<=wmax; w++) /* sequential */ for (it=max(lbit, w-jtmax-ktmax); it<=min(ubit, w-jtmin-ktmin); it++) /* parallel */ for (jt=max(lbjt, w-it-ktmax); jt<=min(ubjt, w-it-ktmin); jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Tighter loop bounds, but still visit empty unnecessary iterations • Optimize using bounded wavefront inequalities • Utilize ILP solver to derive parametric bound valuestjmin, tjmax for 1 ≤ j ≤ n

  24. Parallel Parametric Tiling (cont.) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊N/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k Very tight loop bounds, with negligible overhead of scanning empty tiles • Optimize using Relaxed Symbolic Fourier Motzkin Elimination (RSFME)

  25. Parallel Parametric Tiling (cont.) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N-i; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊(N-it*Ti)/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k Ambiguous sign encountered Use itmin and itmax to resolve sign ambiguity: (7a.1) w-N/Tj-N/Tk+ itmin *Ti/Tj<=it (w*Tj*Tk-N*Tj-N*Tk+ itmin *Ti*Tk)/(Tj*Tk)<=it w-N/Tj-N/Tk<=it-it*Ti/Tj (7a.2) it*Ti/Tj<= itmax -w+N/Tj+N/Tk it<=( itmax *Tj*Tk-w*Tj*Tk+N*Tj+N*Tk)/(Ti*Tk) • Resolving ambiguous sign in RSFME  Relaxation step • Replace the tile loop variables with their parametric bounded values (tjmin and tjmax)

  26. Experimental Evaluation • Main comparison: • P-PrimeTile(parallel parametric tiling) • Pluto (parallel non-parametric tiling) • Intel Xeon workstation: • Dual quad-core E5462 Xeon processors (8 cores total) running at with 32 KB L1 cache, 12 MB of L2 cache (6 MB shared per core pair), and 16 GB of DDR2 FBDIMM RAM, running Linux kernel version 2.6.25 (x86-64) • Compilers: • GCC version 4.2.4 (with “-O3 –fopenmp” optimization flag) • ICC version 10.1 (with “-fast –openmp” optimization flag)

  27. Benchmarks

  28. Performance of Generated Tiled Code (Compiled with GCC) 1D Jacobi 2D FDTD TriSolver X-axis: #cores Y-axis: seconds Cholesky LU

  29. Performance of Generated Tiled Code (Compiled with GCC) (cont.) Seidel DSYRK DTRMM Parametric tiled code efficiency is comparableto or betterthan fixed tiled code X-axis: #cores Y-axis: seconds

  30. Performance of Generated Tiled Code (Compiled with ICC) 1D Jacobi 2D FDTD TriSolver X-axis: #cores Y-axis: seconds Cholesky LU

  31. Performance of Generated Tiled Code (Compiled with ICC) (cont.) Seidel DSYRK DTRMM Parametric tiled code efficiency is comparableto or betterthan fixed tiled code X-axis: #cores Y-axis: seconds

  32. Efficiency of Code Generation Generation time (seconds) Generation time (seconds) Levels of tiling Levels of tiling

  33. Efficiency of Code Generation (cont.) Generation time (seconds) Generation time (seconds) Levels of tiling Levels of tiling • Fixed tiled code generation does not scale • Double benefit of P-PrimeTile: better scalability and parametric tiling

  34. Summary • Developed an efficient tiling transformation system that has the following features • Handles imperfectly nested loops • Allows tile sizes to be run time parameters • Addresses parallelism • Supports multi-level tiling • Illustrated the efficiency of the system over existing state-of-the-art system

More Related