Reconfigurable Computing

Reconfigurable Computing • Dr. Christophe Bobda • CSCE Department • University of Arkansas

Chapter 4High-Level Synthesis for Reconfigurable Devices

Agenda • Modeling • Dataflow graphs • Sequencing graphs • Finite State Machine with Datapath • High-level synthesis and Temporal partitioning • The List-Scheduling-based approach • Exact methods: Integer Linear Programming • Network-flow-based TP

1. Modeling • The success of a hardware platform depends on the easiness of programming • Micro processors • DSPs • High-level descriptions increase the acceptance of a platform • Modeling is a key aspect in the design of a system • Models used most be powerful enough • Capture all user’s need • Describe all system parameters • Easy to understand and manipulate • Several powerful models exists (FSM, State Charts, DFG, Petri Nets, etc…) • We focus in this course on two models: Dataflow graphs, sequencing graphs and the Finite State Machine with Datapath(FSMD)

1. Definitions • Given a node and its implementation as rectangular shape module in hardware. • denotes the length and the height of • denotes the area • The latency of is the time it takes to compute the function of using the module • For a given edge , is the weight of , i.e the width of bus connecting two components and . • The latency of is the time needed to transmit data from to • Simplification: we will just use for a node as well as for its hardware implementation .

1.1 Dataflow Graph • Means to describe a computing task in a streaming mode. • Operators represent the nodes • Inputs of the nodes are the operand • Node’s output can be used as input to other nodes • Data dependency in the graph. • Given a set of tasks a dataflow graph (DFG) is a directed acyclic graph , where • is the set of nodes representing operators and • E is the set of edges: An edge is defined through the (data) dependency between task and task . Dataflow graph for the quadratic root computation using:

1.2 Sequencing graph • Hierarchical dataflow graph with two different types of nodes: • The operation nodes corresponding to normal ”task nodes” in a dataflow graph • The link nodes or branching nodes that point to another sequencing graph in a lower level of the hierarchy. • Linking nodes evaluate conditional clauses. • placed at the tail of alternative paths corresponding to possible branches. • Loops can be modeled by using a branching as a tail of two paths • one for the exit from the loop • the other for the return to the body of the loop

1.2 Sequencing graph • Example: • According to the conditions that node BR evaluates, one of the two subsequencing graphs (1 or 2) can be activated. • Loop description: body of the loop is described in only one sub-sequencing graph: BR evaluates the loop exit condition

1.3 Finite State Machine with Datapath (FSMD) • Extension of dataflow graph with a finite state machine • Formally, An FSMD is a 7-tuple where: • is a set of states, • is a set of inputs, • is a set of outputs, • is a set of variables, • is a transition function that maps a tupel (state, input variable, output variable) to a state, • is an action function that maps the current state to output and variable, • is an initial state. • FSMD vs FSM • FSMD operates on arbitrary complex data type • Transition may include arithmetic operations

1.3 Finite State Machine with Datapath (FSMD) • Modeling with FSMD: • The transformation of a program into a FSMD is done by • Transforming the statements of the program into FSMD states. • The statements are first classified in three categories: • the assignment statements, • the branch statements and • the loop statements • For an assignment statement • a single state is created that executes the assignment action. • An arc connecting the so created state with the state corresponding to the next program statement is created.

1.3 Finite State Machine with Datapath (FSMD) • For a loop statement • a condition state C and a join state J, both with no action are created. • An arc, labeled with the loop condition and connecting the conditional state C with the state corresponding to the first statement in the loop body is added to the FSMD. • Accordingly another arc is added from C to the state corresponding to the first statement after the loop body. • This arc is labeled with the complement of the loop condition. • Finally an edge is added from the state corresponding to the last statement in the loop to the join state and • another edge is added from the join state back to the conditional state.

1.3 Finite State Machine with Datapath (FSMD) • For a branch statement, • a condition state C and a join state J both with no action are created. • An arc is added from the conditional state to the first statement of the branch. • This branch is labeled with the first branch condition. • A second arc, labeled with the complement of the first condition ANDed with the second branch condition is added from the conditional state to the first statement of the branch. • This process is repeated until the last branch condition. • Each state corresponding to the last statement in a branch is then connected to the join state. • The join state is finally connected to the state corresponding to the first statement after the branch.

1.3 Finite State Machine with Datapath (FSMD)

2. High-Level synthesis • Next: compilation (synthesis) onto a set of hardware components • Usually done in three different steps. • The allocation defines the number of resource types required by the design, and for each type the number of instances. • The binding maps each operation to an instance of a given resource. • The scheduling sets the temporal assignment of resources to operators • Fundamental difference in reconfigurable computing • Only Binding and scheduling are important common high-level synthesis. • Resources can be create as wanted in RC

2. High-Level synthesis • Dataflow graph of the functions and • Assumptions on “resource fixed” device • Multiplication needs 100 basic resource (ex 100 NAND-Gates or 100 LUTs). • The adder and the subtractor need 50 LUTs each. • Despite the availability of resources, two subtractors cannot be used in the first level • The adder cannot be used in the first step, • it depends on a subtractor that must be first executed • Minimum execution time is 4 steps

2. High-Level synthesis • Dataflow graph of the functions and • Assumptions on a reconfigurable device • Multiplication needs 100 basic resource (ex 100 NAND-Gate or 100 LUTs). • The adder and the subtractor need 50 LUTs each. • Total available amount of resources: 200 LUTs. • The two subtractors can be assigned in the first step. • Minimum execution time: 3 steps

2. Temporal partitioning – Problem definition • Temporal partitioning: Given: • Single device temporal partitioning of a DFG G=(V,E) for a device R • A temporal partition can also be defined as an ordered partition of G with the constraints imposed by R. • With the ordering relation imposed on the partition, we reduce the solution space to only those partitions which can be scheduled on the device for execution. • Therefore, cycles are not allowed in the dataflow graph. Otherwise, the resulting partition may not be schedulable on the device Cycle

2. Temporal partitioning • Goal: • Computation and scheduling of a Configuration graph • A configuration graphis a graph in which: • Nodes are partitions or bitstreams • Edges reflect the precedence constraints in a given DFG • Communication throughinter-configuration registers • configuration sequence is controlled by the host processor • On configuration, saved register values. • After reconfiguration, copy values back to registers P2 P3 P1 P4 P5 Inter-configuration registers A configuration graph IO Register IO Register Bus IO Register Block IO Register IO Register IO Register IO Register IO Register Processor FPGA Device’s register mapping into the processor address spaces

8 2 1 3 7 9 4 6 5 10 Connectivity = 0.24 3 2 6 9 1 7 4 8 5 10 Quality = 0.25 8 2 1 7 9 3 4 6 5 10 Quality = 0.45 2. Temporal partitioning • Objectives: The following objectives are followed: • Minimize the number of interconnection: This is one of the most important, since it will minimize : • the amount of exchanged data • The amount of memory for temporaly storing the data • Minimize the number of produced blocks • Minimize the overal computation delay • Quality of the result: means to measure how good an algorithm performs • Connectivity of a graph G=(V,E):con(G) = 2*|E|/(|V|2 - |V|) • Quality of Partitioning P = {P1,…,Pn}: Average connectivity over P • High quality means the algorithm performs well. • Low quality means that the algorithm performs poor.

2. Temporal partitioning • Scheduling: Given is a DFG and an architecture which is a set of processing element • Compute the starting time of each node on a given resource • Temporal partitioning: Given is a DFG and an a reconfigurable device • The starting time of each node is the starting time of the partition to which it belongs • Compute the starting time of each node onto the device • Solution approaches: • List scheduling • Integer Linear Programming • Network Flow • Spectral method

2.1 Unconstrained Scheduling • ASAP (as soon as possible) • Defines the earliest starting time for each node in the DFG • Compute a minimal latency • ALAP (as late as possible) • Defines the latest starting time for each node in the DFG according to a given latency • The mobility of a node ist difference betwen the ALAP-starting time and ASAP-starting time • Mobility is 0  node is on a critical path

Zeit 0 * * * * + + * * < - Zeit 3 - Zeit 3 Zeit 4 Zeit 4 2.1 ASAP Example • Unconstrained scheduling with optimal latency : L = 4 Time 0 Time 1 Time 2 Time 3 Time 4

2.1 ASAP Algorithm ASAP(G(V,E),d) { FOREACH( vi without predecessor) •  s(vi):= 0; REPEAT { choose a node vi , which predecessors are all planed; s(vi) := maxj:(vj,vi)E {s(vj)+ dj}; } UNTIL (all nodes vi are planned); RETURN s }

* * Zeit 1 * * - * * + Zeit 3 + < - Zeit 4 Zeit 4 2.1 ALAP-Example • Unconstrained scheduling with optimal latency : L = 4 Time 0 Time 1 Time 2 Time 3 Time 4

Zeit 0 0 0 * + * * * Zeit 1 1 0 + < * * * Zeit 2 1 2 0 2 * - + * Zeit 3 2 2 0 + < - Zeit 4 2.1 Mobility Time 0 Time 1 Time 2 Time 3 Time 4

2.1 ALAP-Algorithm ALAP(G(V,E),d, L) { FOREACH( vi without successor) s(vi) := L - di; REPEAT { Choose a node vi , which successors are all planed; s(vi) := minj:(vi,vj)E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s }

2.1 Constrained scheduling • Extended ASAP, ALAP • Compute ASAP or ALAP • Assign the tasks earlier (ALAP) or later (ASAP), untill the resource constraints are fullfiled • Listscheduling • A list L of ready to run tasks is created • Tasks are placed in L in decreasing priority order • At a given step, the free resource is assigned the task with highest priority. • Criteria can be: number of successors, mobility, connectivity, etc.

* * Time 1 + * * Time 2 < - * * Time 3 + - Time 4 2.1 Extended ASAP, ALAP • 2 Multiplizierer, 2 ALUs (+, , <) Time 0

3 3 2 1 1 * * * * + 2 1 0 0 + * * < 1 - 0 - 2.1 Constrained scheduling • Criterium: number of successors • Ressource: 1 multiplier, 1 ALU (+, , <)

Time 0 + * Time 1 * < Time 2 * Time 3 * - Time 4 * Time 5 - * Time 6 + Time 7 2.1 Constrained scheduling

2.1 Temporal partitioning vs constrained scheduling • List scheduling is used to keep an ordering between the nodes • List Scheduling (LS) partitioning • Construct a list L of all nodes with priority • Create a new empty partition Pact • Remove a node from the list and place it in the partition • If size(Pact) <= size(R) and T(Pact) <= T(R) go to 2.1, else go to 2.3 • If empty(list), stop else goto 2

3 3 2 1 1 * * * * + 2 1 0 0 + * * < 1 - 0 - 2.1 Temporal partitioning vs constrained scheduling • Criterium: number of successors • size(FPGA) = 250, size (mult) = 100, size(add) = size(sub) = 20 • size(comp) = 10

+ * * P1 3 3 2 1 1 * * * * + 2 1 0 0 < + * * < 1 - * * 0 - P2 - P3 * * - + 2.1 Temporal partitioning vs constrained scheduling • Connectivity: c(P1) = 1/6, c(P2) = 1/3, c(P3) = 2/6 • Quality: 0.83

* + * 3 3 2 1 1 * * * * + P1 2 1 0 0 + < + * * < * 1 - * 0 P2 - - * P3 * - 2.1 Temporal partitioning vs constrained scheduling • Connectivity: c(P1) = 2/10, c(P2) = 2/3, c(P3) = 2/3 • Quality: 1.2 • Connectivity is better

* + / * - * + - / 2.1 Temporal partitioning vs constrained scheduling • ASAP • Processing in topological order • Assigning level-number to the nodes • Scheduling of nodes according to level number • Drawback • “Levelization”: Assigned of nodes to partitions based level-number (increase data exchange) • Advantage • Fast (linear run-time) • Local optimization possible Level 0 Level 1 Level 2 Level 3

2.1 Temporal partitioning vs constrained scheduling • Local optimization by configuration switching • If two consecutive partitions P1 and P2 share a common set of operators, then: • Implement the minimal set of operators needed for the two partitions • Use signal multiplexing to switch from one partition to the next one. • Drawbacks: More resources are needed to implement the signals switching • Advantages: • reconfiguration time is reduced • Device's operation is not interrupted

d f b g c h e b c d e a Mult Add Add Mult Add h Add j Sub g a i i Add f Sub Sub Mult j 2.1 Temporal partitioning vs constrained scheduling Configuration 1 Inter configuration register Configuration 2

2.1 Temporal partitioning vs constrained scheduling • Improved List Scheduling algorithm • Generate the list of nodes node_list • Build a first partition P1 • while(!node_list.empty( )) • build a new partition P2 • If union (P1, P2) fits on the device, then implement configuration switching with P1 and P2 • else set P1 = P2 and goto 3 • Exit

2.2 Temporal partitioning – ILP • With the ILP (Integer Linear Programming), • the temporal partitioning constraints are formulated as equations. • The equations are then solved using an ILP-solver. • The constraints usually considered are: • Uniqueness constraint • Temporal order constraint • Memory constraint • Resource constraint • Latency constraint • Notations:

2.2 Temporal partitioning – ILP • Unique assignment constraint: Each task must be placed in exactly one partition. • Precedence constraint: for each edge in the graph, must be placed either in the same partition like or in a earlier partition than that in which is placed. • Resource constraint: The sum of the resource need to implement the modules in one partition should not exceed the total amount of available resources. • Device area constraint: • Device terminal constraints:

2.3 Temporal partitioning – Network-flow • Recursive bipartitioning • The goal at each step is the generation of a unidirectional bipartition • The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. • Network flow methods are used to compute the a bipartition with minimal edge-cut size. • Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. • Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut

2.3 Temporal partitioning – Network-flow • Two-terminal net transformation • Replace an edge (v1, v2) with two edges (v1, v2) with capacity 1 and (v2, v1) with infinite capacity • Multi-terminal net transformation • For a multi-terminal net {v1, v2, .....v2}, • Introduce a dummy node v with no weight and a briging (v1, v) with capacity 1. • Introduces the egdes (v, v2), .... (v, vn), each of which is assigned a capacity 1. • Introduce the edges (v2, v1), ..., (vn, v1), each of which is assigned an infinite capacity • Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G.

Reconfigurable Computing