Buffer-sizing for Precedence Graphs on Restricted Multiprocessor Architectures

Buffer-sizing for Precedence Graphs on Restricted Multiprocessor Architectures Thomas Feng Yang Yang Mentors: Qi Zhu, Abhijit Davare

Outline • Motivation • Previous Work • Preliminaries • Problem Statement • Investigative Approach • Summary and Conclusion • Compare with Prior Work • Our Contribution • Future Work

Outline • Motivation • Previous Work • Preliminaries • Problem Statement • Investigative Approach • Summary and Conclusion • Future Work

Parallel Heterogeneous Platforms (PHPs) • Advantages: • High computational capability • Challenges: • Explore the theoretically high performance (From Abhijit Davare’s Quals Presentation)

Project Goals: Deploying applications on these platforms Project Goals • Dataflow Programming Model: • Infinite Buffers • Blocking read, non-blocking write • Many scheduling and allocation techniques • Multiprocessor Platform: • Limited connectivity between processors • Limited, finite depth FIFOs • Low overhead reads and writes to FIFOs

Deploying applications on PHPs • Computation Synthesis: • Task Allocation • Task Scheduling • Communication Synthesis: • Interconnection Synthesis • Buffer sizing (The part we are working on)

Buffer Sizing • Architectures have bounded buffer resources. • If more communication buffer resources are utilized, processors may spend less time waiting to send/receive data. • Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.

Pre- processing Quantization DCT Huffman Color Conv. ZigZag Mult Scan H1 H2 (ISP1) RLE Lookup I O Trans- pose Trans- pose 1D-DCT 1D-DCT G M1 M2 Add2 Add4 Mult1 Sub2 H1 H2 Merge Shift Sub4 Mult2 O I (ISP2) G M1 M2 Example Flow Function Model Architecture Model Architecture Function Allocation + Scheduling Buffer Sizing

Previous Work • Transformations from various statically-schedulable dataflow variants into precedence DAGs [1] • Survey on allocation and scheduling algorithms for precedence DAGs assuming infinite-length buffers [2] • Minimizing Buffer Requirements for uniprocessor architectures [3] • Minimizing multiprocessor buffer sizing for SDF applications under conservative (non-interleaving) conditions [4] [1]Software Synthesis and Code Generation for Signal Processing Systems S. Bhattacharyya, R. Leupers, P. Marwedel - IEEE Transactions on Circuits and Systems, 2000. [2] Static scheduling algorithms for allocating directed task graphs to multiprocessorsYK Kwok, I Ahmad - ACM Computing Surveys, 1999. [3] Minimizing Buffer Requirements of Synchronous Dataflow Graphs with Model Checking M. Geilen, T. Basten, S. Stuijk, DAC 2005. [4] Data Memory Minimization for Synchronous Data Flow Graphs Emulated on DSP-FPGA TargetsM. Adé, R. Lauwereins, J.A. Peperstraete – DAC 1997.

Preliminaries • Precedence DAG • A precedence DAG is a common representation for the deployment of an application across multiple processors. • Precedence DAG can be generated from statically schedulable dataflow descriptions, such as synchronous dataflow or cyclo-static dataflow. This is suitable for most applications in the multimedia domain. [4][5] • Our synthesis process starts from precedence DAG. [4] A Hierarchical Multiprocessor Scheduling System for DSP Applications Jose Luis Pino, Edward A. Lee, Shuvra S. Bhattacharyya - 29th Asilomar Conference on Signals, Systems and Computers,1995 [5] Dataflow process networks Edward A. Lee, Thomas M. Parks - Proceedings of the IEEE, 1995

Preliminaries • Synthesis process: • Relationships among allocation, scheduling and buffer sizing • Allocation: assign each node in precedence DAG to a particular processor in the architecture. • Scheduling: specify an execution sequence for the set of tasks on each processor. • Buffer sizing: assign sizes to inter-processor communication channels. • In our approach, allocation and scheduling are done by assuming unbounded communication buffer size. Then buffer sizing will be based on the result of allocation and scheduling.

Preliminaries • Artificial deadlock is deadlock that results when the size of buffers between processors is reduced from infinity to some finite numbers. [6] In buffer sizing, we want to minimize the objective function, avoiding artificial deadlock. (“Deadlock” implies artificial deadlock in the following slides.) [6] Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten - Programming Languages and Systems: 12th European Symposium on Programming, ESOP 2003.

Outline • Motivation • Previous Work • Preliminary • Problem Statement • Investigative Approach • Summary and Conclusion • Future Work

Problem Statement Allocation and Scheduling with unbounded communication buffer size Bounded communication buffer size Artificial deadlock Use internal buffer Use communication buffer P1: Can we always find a legal scheduling with one-place communication buffer by using internal buffers, assuming we have legal scheduling for unbounded communication buffer? P3:Minimize total (or largest) communication buffer size P2: Does using internal buffer increase makespan or not? similar problems Minimize total (or largest) internal buffer size P4: Give an optimal internal buffer assignment

Assumptions • Interleaving communication is: For inter-processor communication, when write and read tasks are both active, they can communicate any amount of data through one-place buffer.

Outline • Motivation • Previous Work • Preliminary • Problem Statement • Investigative Approach • Summary and Conclusion • Future Work

Classification of Blocked nodes • In a precedence DAG, we classify nodes which will be blocked during execution into 3 kinds. • read blocked node -- the node will be blocked because it can not read in enough tokens. • write blocked node -- the node will be blocked because it can not finish writing all the produced tokens. • scheduling blocked node -- the node can not be fired because its previous node on the same processor has not finished execution.

Our Observations of Deadlock • We proved that it is impossible to have deadlock with only scheduling blocked nodes and read blocked nodes. • We proved that if a precedence DAG has deadlock, then it must has at least such a pattern called “write blocked cycle” in which: - all the schedule edges are in the same direction; - there must be 1 or more write blocked nodes, whose incoming degree is 0 in the cycle; - there could be read blocked nodes, whose incoming degree is one or more in the cycle; - if reversing the directed data edges from all the write blocked nodes, it becomes a directed cycle. Note: not every write blocked edges must be in a write block cycle.

Write Blocked Cycle Pi Pj … … Data edge Schedule edge ni nj Series of several data/schedule edges, in which the schedule edges are in the same direction as the schedule edge from ni to ni+1 , while the data edges could be in either direction. ni+1 nj+1 …… …… ni+m nj+n … … m>=1, n>=1 Buffer space from Pi to Pj < Token count on the write edge from ni to nj+n

P1 P2 P3 a c e b d f Examples of Write Blocked Cycle P1 P2 P3 a a a c c e e 2 2 2 2 ! ! ! b d f 2 2 ! Fig.2 Fig.1 Buffer[Pi][Pj] = 1 i,j=1,2,3

How to Avoid Deadlock • There is no artificial deadlock in a precedence DAG if and only if there is no write blocked cycles in the graph. Proof: by contradiction. (Omitted here.) • We can resolve the write blocked cycles by using enough communication buffer or internal buffer.

How to Avoid Deadlock a c Wa Bij b d Pi Pj • Wa > Space(Bij), a is write blocked. Deadlock is solved by • - increasing communication buffer size to hold all the Wa tokens; • or increasing internal buffer size in Pi to hold all the Wa tokens, and then write them to d after b and c; • - or reading all the tokens before c, and increasing internal buffer size in Pj to hold all the Wa tokens,

How to Avoid Deadlock e g Wa We a d a c Wb Wa b e Bmn f h Pm Pn Bij c f Bij b d Pi Pj Pi Pj Wa > Space(Bij), We > Space(Bmn), a and e are write blocked. Deadlock is solved by increasing the size of Bij to hold all the Wa, or increasing the size of Bmn to hold all the Wb tokens; or using internal buffer. Wa + Wb > Space(Bij), a and b are write blocked. Deadlock is solved by increasing buffer size to hold all the Wa + Wb tokens; or by increasing internal buffer in Pi or Pj.

Problem Statement Allocation and Scheduling with unbounded communication buffer size Bounded communication buffer size Artificial deadlock Use internal buffer Use communication buffer P1: Can we always find a legal scheduling with one-place communication buffer by using internal buffers, assuming we have legal scheduling for unbounded communication buffer? P1: Can we always find a legal scheduling with one-place communication buffer by using internal buffers, assuming we have legal scheduling for unbounded communication buffer? P3:Minimize total (or largest) communication buffer size P2: Does using internal buffer increase makespan or not? similar problems Minimize total (or largest) internal buffer size P4: Give an optimal internal buffer assignment

SS a b c n Using Internal Buffers • We can always find a legal scheduling S1 with one place communication buffers by using internal buffers, assuming we have a legal scheduling SU requiring unbounded communication buffer. Proof: Since the scheduling SU (a partial order) is legal, we can always find a sequential order SS to execute the nodes, which is also legal. We can get SS by arbitrarily assigning a total order conforming to the partial order. (to be continued)

The first write blocked node a x y1 yk n Let the communication buffer between every two processors be one place. If there is deadlock in SS, obviously the first blocked node must be write blocked node. Simulate this execution. Let x be the first blocked node and it writes tokens to nodes y1, …, yk. The block can be eliminated by letting x write all the produced tokens to the internal buffers: For every node yi on processor pi, write the tokens from x to the internal buffer of pi. Corresponding reading codes are inserted right after the executed codes of pi. Therefore, the deadlock at x is solved, and following execution will not be affected by this solution. Repeat this process until all the write blocked nodes is eliminated. Consequently, a legal schedule is found by using internal buffers.

Problem Statement Allocation and Scheduling with unbounded communication buffer size Bounded communication buffer size Artificial deadlock Use internal buffer Use communication buffer P1: Can we always find a legal scheduling with one-place communication buffer by using internal buffers, assuming we have legal scheduling for unbounded communication buffer? P3:Minimize total (or largest) communication buffer size P2: Does using internal buffer increase makespan or not? P2: Does using internal buffer increase makespan or not? similar problems Minimize total (or largest) internal buffer size P4: Give an optimal internal buffer assignment

MakeSpan • Make span is the maximum completion time for a set of processors. • Assumptions 1. Interprocessor communication takes place through bounded depth FIFOs with blocking reads and writes 2. Unlimited internal buffer space is available on each processor • Conjecture For a task precedence graph, if insufficient FIFO depth leads to deadlock, reading and writing can be reordered in such a way that deadlock is eliminated and makespan is not affected. • Counterexample The example is scheduled in such a way that multiple paths are relatively critical. Reordering the reads and writes to eliminate the deadlock increases the length of some of the relatively critical paths, extending the makespan, even if tx/rx time << computation time.

P3 215 e Communication Model Tx/Rx time: 5 units Latency: 0 units P1 P2 a c f 10 10 10 g d b 200 280 50 P5 i 80 290 305 h 300 315 P4 315

In the example, edges a->d and c->g may be blocked due to insufficient FIFO depth. Without increasing the FIFO depth, there are 4 ways to resolve this: • Move a->d communication after b • Move a->d communication before c • Move c->g communication after d • Move c->g communication before f • Options 1 and 3 delay d and g by a large amount, and increase the makespan significantly • Options 2 and 4 extend the critical paths that end at h and i

Problem Statement Allocation and Scheduling with unbounded communication buffer size Bounded communication buffer size Artificial deadlock Use internal buffer Use communication buffer P1: Can we always find a legal scheduling with one-place communication buffer by using internal buffers, assuming we have legal scheduling for unbounded communication buffer? P3: Minimize total (or largest) communication buffer size P3:Minimize total (or largest) communication buffer size P2: Does using internal buffer increase makespan or not? similar problems Minimize total (or largest) internal buffer size P4: Give an optimal internal buffer assignment

NP-hard Problem • Formally, the problem DEADLOCK-FREE-MIN-BUFFER (DFMB) is defined as follows: Given a Precedence DAG D, find the minimal total buffer size, such that there is no deadlock in D. • The problem DEADLOCK-FREE-MIN-BUFFER is NP-hard. Proof: We prove it by reducing FEEDBACK ARC SET (FAS) problem, which known to be NP-complete, step by step to the DRMB problem.

The FEEDBACK ARC SET (FAS) Problem is the following: Given a directed graph G=(V, E), and a positive integer K, does there exist a subset , such that B contains at least one edge from every directed cycle in G? • This problem is known to be NP-complete [7] [7] Computers and Intractability M. R. Garey and D.S. Johnson - W. H. Freeman and Co., NY 1979

First, we prove FAS problem can be reduced to the Problem below: Problem B: Given a directed graph G=(V, E) with weight w(e)on every edge , find the minimal , such that and B contains at least one edge from every directed cycle in G. • Then we prove Problem B can be reduced to DFMB problem, by proving that an arbitrary instance X of Problem B can be transformed to an instance X’ of DFMB problem in polynomial time, and the result of X is got by solving X’.

Instance X Instance X’ A vertex in G(V, E) The corresponding nodes and schedule edge in D(N, E’) An edge in G(V, E) The corresponding data edge in D(N, E’)

A directed cycle in G (V, E) A write blocked cycle in D (N, E’), where E’ = DE U SE

Solution to Instance X = min{ , and B contains at least one edge from every directed cycle in G} = min{ , and M contains at least one edge from every writing block cycle in D} = Solution to Instance X’

Algorithms

Minimizing Maximum FIFO Size (1) • Mathematical Model: • V = {v1, v2, …, vm}. The set of vertices. • P = {p1, p2, …, pl}. The set of processors. • M: V→ P. Mapping from vertices to the processors they are scheduled on. • E = {e1, e2, …, en}. The set of edges. • S = {e | e ∊ E ∧ M(src(e)) = M(des(e))}. Set of schedule edges. • D = {e | e ∊ E ∧ M(src(e)) ≠ M(des(e))}. Set of data edges. • W: D → R+. The weight function. • F: P P→ R+. The function that returns the FIFO size. F(pi, pj) need not be equal to F(pj, pi).

Minimizing Maximum FIFO Size (2) • Formalizing the problems: • Find an algorithm such that given a schedule <V, P, M, E, W>, find a valid F function, such that max{F(pi, pj)} is minimized (a.k.a. min max problem). • With interleaving communication. • Without interleaving communication. • Find an algorithm such that given a schedule <V, P, M, E, W>, find a valid F function, such that ∑{F(pi, pj)} is minimized (a.k.a. min total problem).

a c b d e Min Max Problem (1) • Free vertices: the vertices with no incoming edges. (a and c in the figure) • Free edges: the edges starting from free vertices. (ab, ad, cd, ce in the figure) • Our algorithm always deals with free edges. When one free edge is resolved with the algorithm, some other edges may become free.

a c a c b d b d Dependency Graph • Dependency graph: A graph constructed from the precedence DAG by making all the data edges bidirectional. • A data edge implies bidirectional dependency between the two vertices. A schedule edge is still unidirectional.

a a c a c a c b b d b d b d #1 #2 #3 #4 Min Max Problem (2) • 4 types of free edges (priority #1>#2>#3>#4): • #1: Free Schedule edge, and the source has no other outgoing edges. • #2: Free data edge between two free vertices (ignoring the incoming data edges to the second vertex). • #3: Free data edge that is not #2 and is not in a dependency cycle. • #4: Free data edge that is not #2 and is in a dependency cycle.

a a c b b d #1 #2 Min Max Problem (3) • #1: Just delete it, because a can finish immediately. b becomes free. • #2: Just delete it, because a and c can run simultaneously with interleaving communication.

a c a c b d b d #3 #4 Min Max Problem (4) • #3: Just delete it, because d will be ready later, and a just needs to wait. • #4: Resolve blocking before deleting the edge. Increase FIFO size if no space left; otherwise, use the space first.

v0 v3 v6 2 3 v1 v4 v7 2 v2 v5 P0 P1 P2 3 Choice of Free Edges (Min Max) • If edges of #1, #2 or #3 exists, remove them first. • If only edges of #4 are left, choose one of them to resolve in a greedy manner: • Among the edges of #4, always pick the one e such that F(M(src(e)), M(des(e))) is minimal after e is resolved.

Min Max: Proof of Optimality • Induction. G is the complete precedence DAG. At step i, Gi is the sub-graph we have solved. G – Gi is the sub-graph with only the remaining edges. Fi is the F function at step i. • Base case: G0 is empty. G0 is optimal. • Induction step: Assume Gk is optimal (max{Fk(M(src(e)), M(des(e))) | e∊Gk} is minimal). Prove Gk+1 is also optimal. Gk+1 is obtained by either removing an edge of type #1, #2 or #3 (in which case Gk+1 is obviously optimal), or updating FIFO for an edge of type #4. In the latter case, we always pick an edge ek+1 such that Fk+1(M(src(ek+1)), M(des(ek+1)) is minimum among such edges. Then, max{ Fk+1 (M (src(e)), M (des(e))) | e ∊ Gk+1} = max( max { Fk (M (src(e)), M (des(e))) | e ∊ Gk}, Fk+1 (M (src(ek+1)), M (des(ek+1))) is also minimal. So, Gk+1 is optimal. • This proof does not work with min total.

a c b d Linear cycle detection algorithm: Ac • To decide whether data edge from a to d is in a cycle: • Without considering edge (a, d) in the dependency graph, can we find d by traversing the graph from a? • Without considering edge (d, a) in the dependency graph, can we find a by traversing the graph from d? • If either case is true, then return true; otherwise, false.

Buffer-sizing for Precedence Graphs on Restricted Multiprocessor Architectures