Introduction to High-Level Synthesis: Optimizing System Architecture through HLS Flow
This document provides an insightful overview of High-Level Synthesis (HLS), detailing the flow from code or algorithm to architecture. It outlines the fusion of three traditionally sequential stages—code transformation, binding, and allocation—into a concurrent process for superior optimization. The presentation includes practical high-level synthesis examples, such as scheduling methods, resource allocation, and delay node implementation. Additionally, the implications of overlapped scheduling on throughput are analyzed, demonstrating how efficient design choices can significantly enhance the performance of hardware architectures.
Introduction to High-Level Synthesis: Optimizing System Architecture through HLS Flow
E N D
Presentation Transcript
ECE 565High-Level Synthesis--Introduction Shantanu Dutt ECE Dept., UIC
HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd) (Binding) Allocation: Simple counting of FUs after the above 2 stages
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X 1 2 3 4 5 6 demux demux cc 3(i+1) ldz z reg. “a” loaded lda = 1 Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) a) Non-overlapped scheduling X c1(1) c1(2) + c2(1) c3(2) c3(1) c2(2) cc’s mux1=0, mux2=0 demux=0, ldy=1 [y c+d] (c2) Controller FSM: cc 3i Reset Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t-care value ldx=1 cc 3(i+2) [x a x b] (c1) [z x+y] (c3)
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X demux demux 1 2 3 4 5 6 ldz z Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) b) Overlapped scheduling X c1(1) c1(2) + c2(1) c3(1) c2(2) c3(2) cc’s cc 3(i+1) ldc=1, ldd=1, mux1=0, mux2=0, demux=0, ldx=1, ldy=1 [y c+d, x a x b] (c1, c2) Controller FSM: cc 3i Reset • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 • Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n • Throughput = n/3n ~ 0.33 outputs/cc • ~ 34% throughput improvement using an overlapped schedule lda=1, ldb=1, mux1=1, mux2=1 demux=1, ldz=1 [z x+y] (c3)
in1 in in2 T F Distributor • Some DFG control operation nodes: Selectot T F Condition (T/F) Condition (T/F) out out2 out1 Simple HLS Examples (contd) • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code:
Iterative code: while (a > b) a a-b; b a a r1 b ldb lda ldr1 1 T F 0 sel Mux b’ mux > - b’+1 = 2’s compl. of -b To fsm + cin 1 s xor ovfl = 1 -ve = 0 +ve Initialized to F dist T F demux Demux 0 1 ldfina a final a + c1 c2 c1 c2 Scheduling & binding: cc’s Simple HLS Examples (contd) c2 c1
Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd) register Mapping to the architecture Transformation in the DFG
Detailed HLS Example (contd) Note: Not clear how register allocation has been done. It is sub-optimal. The synthesized architecture
Detailed HLS Example—Register Allocation (contd) • In the conflict graph (one per FU), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval) • Min. graph coloring can be solved optimally in linear time (using the left-edge algorithm that we will see later for channel routing)