480 likes | 626 Vues
Many-core processors: the integrated approach to the computational and execution models. Lorenzo Verdoscia and Roberto Vaccaro Institute for High Performance Computing and Networking National Research Council – Italy lorenzo.verdoscia@na.icar.cnr.it.
E N D
Many-core processors: the integrated approach to the computational and execution models Lorenzo Verdoscia and Roberto Vaccaro Institute for High Performance Computing and Networking National Research Council – Italy lorenzo.verdoscia@na.icar.cnr.it L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
The Landscape of Parallel Computing Research: A View From Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
What is D3AS • From our architectural point of view, this new trend raises at least two queries: • how to exploit such spatial parallelism, • how to program such systems. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
What is D3AS • The first query brings us to seriously reconsider the dataflow paradigm, given the fine grain nature of its operations. • In fact, instead of carrying out in sequence a set of operations like a von Neumann processor does, a many-core dataflow processor could calculate a function first connecting and configuring a number of identical simple cores as a dataflow graph and then allowing data asynchronously flow through them. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
What is D3AS • The second query brings us to seriously reconsider the functional programming style, given its intrinsic simplicity in writing parallel programs. • In fact, functional languages have three key properties that make them attractive for parallel programming: • They have powerful mechanisms for abstracting over both computation and coordination; • they eliminate unnecessary dependencies; • their high-level coordination achieves a largely architecture-independent style of parallelism. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Agenda • The hHLDS model • CHIARA language • Dataflow graph generation and mapping • D3AS general architecture • Future work L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
and whose architecure has a highly scalable degree with nodes characterized by having • a dynamic configurability • a transparent hardware reconfiguration D3AS (Demand Data Driven Architecture System): • the computational model is functional • the execution model is dataflow a high performance reconfigurable computing system demonstrator, which exploits FPGA technology where • Design methodology: • develop the right computation model alongside languages & hadware L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
The methodological approach L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Let A={a1, …, an} be the set of actors and L ={ll, …, ln} be the set of links A dataflow graph is a labelled directed graph G = (N, E) where N = A Lis the set of nodes firing of an actor E (A× L) (L× A) is the set of edges a token on each input link and no token on each output link effect consumes all input tokens and produces a token on its output link The homogeneous High Level Dataflow System (hHLDS) model Firing rules in the classical model L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Merge Switch T F A B A Gate Decider L L T F R L L A B The hHLDS model Special actors in the classical model are characterized by having heterogeneous I/O conditions L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
firing of an actor a token on each input link effect consumes all input tokens and can produces a token on its output link b c b c ≤ a * a + + homogeneous High Level Dataflow System Any actor has two input links and one output link and consumes and produces only data tokens a+b*c If b≤c then a L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
The hHLDS model Comparison between the two models input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c; output (d) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
CHIARA language • dialect of Backus‘s FP • tuple (O, F, F, :, D) where: • O is a set of objects; • F is a set of functions (or operators) from objects to objects; • F is a set of functional forms (functionals) from functions to functions; • : is the application operation; • D is a set of function definitions. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
CHIARA language CHIARA objects • Atoms: include integer fixed and floating-point numbers, Boolean constants,characters and strings • Sequences: denoted with angle brackets < 1, 2, 3 > • The empty sequence <> is the only object which is both an atom and a sequence • Undefined special object (or UDF) called bottom, which is usually used to denote errors or exceptions. • Sequences are bottom-preserving: < 1; 2;< 3; 5 >; > = L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
CHIARA language CHIARA functions two kinds of operators that can be applied to objects: • Elementary: the commonly used binary operators and some new ones • Combinator: operators that affect the structure of the objects on which they are applied (combine sequences, transpose sequences of sequences, etc). L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
CHIARA language Elementary operators
CHIARA language Elementary operators
Combinator operators L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Combinator operators L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Functional forms • CHIARA functional forms are used to define new functions from existing functions and combinators • Functionals in CHIARA include the functional forms of Backus’s FP and some new ones L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
CHIARA language The assembly language • a functionally complete sub-set of elementary operators is the assembly language for a D3AS many-core processor • more complex functions are obtained applying the rule of metacomposition • dataflow graphs that are produced can be directly mapped and executed onto the hardware L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
a b > < + + max New functions The def construct permits the definition of new functions from existing functions, combinators, functional forms, and other already defined functions. For example: • def max = (gt ° [1,2] --> 1;2) • max:<5,6> = 6 a L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Dataflow graph generation and mapping Dataflow graph mapping • communications inter many-core processors are slower than intra many-core processor • NP-hard mapping problem L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Dataflow graph generation and mapping Compilation process The whole compilation process is composed of two steps: • compilation, producing the dataflow graph from CHIARA programs (function definitions plus expressions to be evaluated) • mapping, aimed at implementing the produced dataflow graph onto the D3AS prototype L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Dataflow graph generation and mapping Dataflow graph generation • the CHIARA compiler, in conjunction with front-end tools, generates the Global Dataflow Graph Table (GDGT) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Global Dataflow Graph Table (GDGT) Node# Func Apply Constr Insert Left Right Out level level Level In In .. ... . . . .. .. .. 43 MUL 1 0 0 %1 %30 47 44 MUL 1 0 0 %2 %30 47 45 MUL 1 0 0 %3 %30 48 46 MUL 1 0 0 %4 %30 48 47 ADD 0 0 1 43 44 49 48 ADD 0 0 1 45 46 49 49 ADD 0 0 2 47 48 out 50 MUL 1 0 0 %1 %40 54 51 MUL 1 0 0 %2 %40 54 52 MUL 1 0 0 %3 %40 55 53 MUL 1 0 0 %4 %40 55 54 ADD 0 0 1 50 51 56 55 ADD 0 0 1 52 53 56 56 ADD 0 0 2 54 55 out .. ... . . . .. .. .. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Visualization of Compiler Graph L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Dataflow graph generation and mapping The next step the compiler extracts from the GDGT two tables: • Dataflow Graph Description (DGD) table, that contains, for each node, the binary operation and interconnection codes for the Graph Setter of a Processing Subsystem • Initial Input Value (IIV) table, that contains the binary information about input program data tokens L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Dataflow graph mapping The presence of functionals: • permits the adoption of strategies that try to cluster parallelism exploitation • suggests handy ways to partition the dataflow graph into smaller, loosely connected graphs that can be run on the single platform-processors L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture Reconfigurable Hardware System (RHS) • Capable to map and execute dataflow graphs, created with the hHLDS model in a completely asynchronous manner. • Contituted by three Subsystem • Actor Realization Subsystem (ARS) Capable to create a one-to-one correspondence among graph actors and Functional Units. • Token flow Realization Subsystem (TRS) Implementing graph edges. • Graph Mapping Subsystem (GMS) Devoted to store the RHS Context Informations. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture ■ ARS Constituted by N identical Multipurpose Functional Unit (MPFUs). ■TRS Constituted by 3 Sets of N buffer Registers and a Crossbar Swith Interconnect. ■GMS Constituted by a set of buffers and logic circuitery. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture Critical Parameters in the RHS design. • NMPFU: the number of the MPFUs constituting the ARS; • CMPFU: the logical and functional complexity of the MPFUs; • INTRS: the type of interconnect for the TRS. The number of MPFU implementable on a VLSI device depends on: • interconnect complexity; • logical and functional complexity of MPFU. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture RHS/D3AS Fundamental Building Block Many-core Datalow Processor (MDP) A many-core chip replicating the D3AS general arcitecture with n MPFU interconnected via a non-blocking cross bar switch network. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture Architecture with globally pure dataflow model N: Number of Graph Actor n: Number of MPFU of MDP RHS is configured interconnecting K= N/n MPD with a 2nd level non-blocking crossbar switch interconnection network. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture with Hybrid Dataflow Model N>n The Graph is partitioned into subgraphs and the RHS is configured interconnecting m= N/n MDP with a 2nd level message passing interconnection network. Dataflow Graph Edge among subgraph mapped on different MDP are virtualized by messages ranted through the network. Communnicating Dataflow Processes (CDP) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
D3AS general architecture demonstrator GIDEL board L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
i,j = 1…n Some results Matrix Multiplication • Given two matrices A(n,n)and B(n,n), their product generates a matrix C(n,n) whose generic element is given by the following formula: L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Some results Matrix Multiplication • we used two values of n: n=32 and n=64 L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Some results Matrix Multiplication • we compared the performance of a platform-processor with a IA32 Pentium IV • we measured performance in terms of CPI because our FPGA platform-processor executes an operation in 30 ns against 0.5 ns of the Pentium. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Some results IA-32 Pentium IV vs D3AS L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Some results Zeroes of a function (f=x*x+3x-1.75) assembly code generated compiling the C source code: 122 sequential assembly code lines L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Some results Zeroes of a function our compiler generates a GDGT with only 28 micro-instructions organized on 12 sequential steps. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models
Future work • To evalute which applications perfom better on the architecure with globally pure and hybrid dataflow model. • How to generalize pipeline inside the MDP L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models