370 likes | 508 Vues
A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units. João M. P. Cardoso April 30 , 2001 IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert Park, CA, USA. Faculty of Sciences and Technology University of Algarve, Faro. Portugal. Index.
E N D
A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units João M. P. Cardoso April30, 2001 IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert Park, CA, USA Faculty of Sciences and Technology University of Algarve, Faro Portugal
Index • Introduction • Temporal Partitioning • Problem Definition • New vs Previous Approach • Algorithm Working Through an Example • Experimental Results • Related Work • Conclusions • Future Work
Introduction • “Virtual Hardware”: • Reuse of devices • Save silicon area • View “unlimited resources” • Enabled by the dynamically reconfigurable FPGAs • Two concepts: • Context switching among functionalities • Allowing a large “function” to be executed • FPGA devices allowing virtualization: • off-chip configurations • on-chip configurations • Several research efforts…
dx dx u y dx x x u << 1 << 1 y + + + x_1 y_1 dx u - - u_1 Introduction • Size larger than the available reconfigware area? • Answers: • Temporal Partitioning • Sharing of Functional Units • Goal: combining the two...
dx << 1 y + + x_1 y_1 aux1 Temporal Partitioning dx u x x u time
dx y << 1 + aux1 dx u - - u_1 Temporal Partitioning time
Temporal Partitioning dx dx u y dx x x u << 1 << 1 y + + + x_1 y_1 aux1 aux1 dx u - - u_1 time
Temporal Partitioning • Create temporal partitions to be executed by time-sharing the device • Netlist level (structural) • Difficulties when dealing with feedbacks • Loss of Information • Flat structure • Intricate for exploiting sharing of functional units • Behavioral level (functional) • Loops can be explicitly represented • Better design decisions • “A must” for compilers for reconfigurable computing
Problem Definition But, if we decrease the needed area by sharing functional units? • Simultaneously Temporal Partitioning and sharing of Functional Units THE PROBLEM: • Given a dataflow graph (representing a behavioral description), a library of components,... • Map the dataflow graph onto the available resources of the FPGA device: • Considering sharing of Functional Units • Considering Temporal Partitioning • Decreasing the overall execution latency
DFG, CDFG DFG, CDFG Constraints Constraints Temporal Partitioning Simultaneously Temporal Partitioning and High-Level Synthesis Component Library Component Library High-Level Synthesis Circuit-generation, Logic Synthesis Circuit-generation, Logic Synthesis New vs Previous Approach • Previous • New
3 0 1 4 2 5 Algorithm Working Through an Example Suppose the following dataflow graph • Consider: • Area(+) = 1 cell • Area(x) = 2 cells • Delay(+) = 1 control step (cs) • Delay(x) = 2 cs • Total area of the DFG: 8 cells • Available Area: 3 cells
3 0 1 4 2 5 Algorithm Working Through an Example Calculate ASAP and ALAP values Node 012345 ASAP0 0 10 23 ALAP 1 12 0 2 3
Algorithm Working Through an Example Identify the critical path 0 1 3 Node 012345 ASAP0 0 10 23 ALAP 1 12 0 2 3 4 2 5
Algorithm Working Through an Example Create an initial number of TPs: suppose 3 Area MAXCS 0 1 3 1 4 2 2 5 3
3 4 5 Algorithm Working Through an Example Map each node of the critical path on each temporal partition Area MAXCS 0 1 3 1 2 cs 4 2 2 1 cs 5 3 1 cs
3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (1) Area MAXCS 0 1 3 1 2 cs 4 2 2 1 cs 5 3 1 cs
0 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (1) Area MAXCS 0 1 3 1 2 cs 4 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (1) Area MAXCS 0 1 3 1 2 cs 4 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (1) Area MAXCS 0 1 3 1 2 cs 3 4 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (2) Area MAXCS 0 1 3 1 2 cs 4 2 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (3) Area MAXCS 0 1 3 1 2 cs 4 2 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Relax: add 1 clock step to MAXCS Area MAXCS 0 1 3 1 2 cs 4 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (1) Area MAXCS 0 1 3 1 2 cs 3 4 2 2 1 cs 5 3 1 cs
0 1 3 4 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (2) Area MAXCS 0 1 3 1 2 cs 4 2 2 2 1 cs 5 3 1 cs
0 1 3 4 2 5 Algorithm Working Through an Example Try to map nodes in each temporal partition (2) Area MAXCS 0 1 3 1 2 cs 4 2 2 2 1 cs 5 3 1 cs
0 1 3 4 2 5 Algorithm Working Through an Example Merge Operation (1) Area MAXCS 0 1 3 1 2 cs 4 2 2 2 cs 5 3 1 cs
0 1 3 4 2 5 Algorithm Working Through an Example Merge Operation (1) Area MAXCS 0 1 3 1,2 4 2 4 cs 5 3 1 cs
0 1 3 4 2 5 Algorithm Working Through an Example Merge Operation (2) Area MAXCS 0 1 3 1,2 4 2 4 cs 5 3 1 cs
0 1 3 4 2 5 Algorithm Working Through an Example Merge Operation (2) Area MAXCS 0 1 3 1,2,3 4 2 5 4 cs
Experimental Results Near-optimal w/o sharing vs sharing EX1 SEHWA HAL EWF
Experimental Results Near-optimal w/o sharing vs sharing 72 37 FIR MAT4x4
Experimental Results Performance vs No. of Temporal Partitions • Mult4x4, RMAX=10 (no sharing of adders)
Experimental Results Is the algorithm good for scheduling? • Comparison to some optimum results EWF SEHWA
Related Work • List-Scheduling considering dynamic reconfiguration [Vasilko et al., FPL’96] • ASAP [GajjalaPurna et al., IEEE Trans. on Comp., 1999] • Minimize latency taking onto account communication costs [Cardoso et al. VLSI’99]: • Enhanced Static-List Scheduling • Iterative approach (Simulated Annealing) • ILP formulation [SPARCs, DATE’98; RAW’98] • Enhanced Force-Directed List Scheduling [Pandey et al., SPIE’99] • And others [see the Related Work section]
Conclusions • Novel algorithm simultaneously doing temporalpartitioning and sharing offunctionalunits • Low complexity • Heuristic approach • Based on gradually enlarging of time slots • Permits to exploit the duality between the numberof temporal partitions and resource sharing • Close-to-optimum results with some examples • Results proved that the algorithm is not weak when performing scheduling
Future Work • Enhancements to the algorithm: • consider functional units with pipelining • consider pipelining between execution and reconfiguration • Study the possibility to take into account communication and reconfiguration costs • Test results with a reconfigurable computing system (comercial board)
Contact Author João M. P. Cardoso jmpc@acm.org http://w3.ualg.pt/~jmcardo THANK YOU!