Compiler-directed Synthesis of Programmable Loop Accelerators

Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop

Loop Accelerators • Hardware implementation of a critical loop nest • Hardwired state machine • Digital camera appln – 1000x vs Pentium III • Multiple accelerators hooked up in a pipeline • Loop accelerator vs. customized processor • 1 block of code vs. multiple blocks • Trivial control flow vs. handling generic branches • Traditionally state machine vs. instruction driven

Programmable Loop Accelerators • Goals • Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) • Post-programmable – To a degree, allow changes to the application • Use compiler as architecture synthesis tool • But … • Don’t build a customized processor • Maintain ASIC-level efficiency

NPA (Nonprogrammable Accelerator) Synthesis in PICO

PICO Frontend for i = 1 to ni • Goals • Exploit loop-level parallelism • Map loop to abstract hardware • Manage global memory BW • Steps • Tiling • Load/store elimination • Iteration mapping • Iteration scheduling • Virtual processor clustering for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p]

PICO Backend • Resource allocation (II, operation graph) • Synthesize machine description for “fake” fully connected processor with allocated resources

Reduced VLIW Processor after Modulo Scheduling

Data/control-path Synthesis  NPA

PICO Methodology – Why it Works? • Systematic design methodology • 1. Parameterized meta-architecture – all NPAs have same general organization • 2. Performance/throughput is input • 3. Abstract architecture – We know how to build compilers for this • 4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture

Direct Generalization of PICO? • Programmability would require full interconnect between elements • Back to the meta architecture! • Generalize connectivity to enable post-programmability • But stylize it

Programmable Loop Accelerator – Design Strategy • Compile for partially defined architecture • Build long distance communication into schedule • Limit global communication bandwidth • Proposed meta-architecture • Multi-cluster VLIW • Explicit inter-cluster transfers (varying latency/BW) • Intra-cluster communication is complete • Hardware partially defined – expensive units

Programmable Loop Accelerator Schema DRAM Shift Register II Stream Unit SRAM Control Unit FU MEM Accelerator … … … … Intra-cluster Communication … … … … Stream Buffer Stream Unit FU FU Accelerator Inter-cluster Register File … Accelerator Datapath Pipeline of Tiled or Clustered Accelerators

Flow Diagram # cheap FUs FUs assigned to clusters Assembly code, II Modulo Schedule FU Alloc Shift register depth, width, porting Intercluster bandwidth # clusters # expensive FUs Loop Accelerator Partition

Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; } }

FU Allocation • Sobel with II=4 41 ops  3 clusters 2 MPY ops  1 multiplier 9 memory ops  3 memory units • Determine number of clusters: • Determine number of expensive FUs • MPY, DIV, memory

Partitioning • Multi-level approach consists of two phases • Coarsening • Refinement • Minimize inter-cluster communication • Load balance • Max of 4  II operations per cluster • Take FU allocation into account • Restricted # of expensive units • # of cheap units (ADD, logic) determined from partition

L L L L L L L L L L L L L L L L L L L L + + + + + + + + + + + + + + + + + + + + + + + + Coarsening • Group highly related operations together • Pair operations together at each step • Forces partitioner to consider several operations as a single unit • Coarsening Sobel subgraph into 2 groups:

? L L L L L + + + + + + Refinement • Move operations between clusters • Good moves: • Reduce inter-cluster communication • Improve load balance • Reduce hardware cost • Reduce number of expensive units to meet limit • Collect similar bitwidth operations together

From sobel, II=4 Place MPYs together Place each tree of ADD-LOAD-ADDs together Cuts 6 edges Partitioning Example

Modulo Scheduling • Determines shift register width, depth, and number of read ports • Sobel II=4 FU0 FU1 FU2 FU3 cycle ADD 0 LD 1 ADD 2 LD ADD ADD 3

Test Cases • Sobel and fsed kernels, II=4 designs • Each machine has 4 clusters with 4 FUs per cluster M + - M + - M + - B << sobel + - + - + - + - * & + - << M + - M + - M + & B + - fsed + - << + - << + & + & *

Cross Compile Results • Computation is localized • sobel: 1.5 moves/cycle • fsed: 1 move/cycle • Cross compile • Can still achieve II=4 • More inter-cluster communication • May require more units • sobel on fsed machine: ~2 moves/cycle • fsed on sobel machine: ~3 moves/cycle

Concluding Remarks • Programmable loop accelerator design strategy • Meta-architecture with stylized interconnect • Systematic compiler-directed design flow • Costs of programmability: • Interconnect, inter-cluster communication • Control – “micro-instructions” are necessary • Just scratching the surface of this work • For more, see the CCCP group webpage • http://cccp.eecs.umich.edu

Compiler-directed Synthesis of Programmable Loop Accelerators

Compiler-directed Synthesis of Programmable Loop Accelerators

Presentation Transcript

Synthesis directed towards antimicrobial purines

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Fast Online Synthesis of Generally Programmable Digital Microfluidic Biochips

V. Transcription (DNA-directed RNA synthesis)

Weakest Precondition Synthesis for Compiler Optimizations

Compiler-directed Synthesis of Multifunction Loop Accelerators

Program Synthesis for Low-Power Accelerators

Logic Synthesis for Programmable Devices

Synthesis Test Programmable Logic

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Compiler-Directed instruction cache leakage optimizations

Relational Verification to SIMD Loop Synthesis

Synthesis of Loop-free Programs

Compiler and System Techniques for SoC distributed accelerators

Compiler-directed Data Partitioning for Multicluster Processors

Compiler-in-the-Loop Exploration of Programmable Embedded Systems

System Synthesis for Networks of Programmable Blocks

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Continued