Distributed Memory and Datastream-based Reconfigurable Computing

Örebro, Aug. 25-27, 2003 11.00 – 13.00 hrs Distributed Memoryand Datastream-based Reconfigurable Computing Reiner Hartenstein Kaiserslautern University of Technology

“Mainstream Silicon Application is switching every 10 Years” Makimoto’s Wave “The Programmable System-on-a-Chip is the next wave“ standard µproc., memory 2007 1967 1987 LSI, MSI reconfigurable 1957 ASICs, accel’s 1977 1997 custom Published in 1989 anti machine paradigm Semiconductor Revolutions TTL vN machine paradigm anti machine paradigm 2

hardwired procedural programming structural programming 4th wave ? ? Coarse grain RAs ? Hartenstein’s Curve algorithm: variable algorithm: fixed algorithm: variable Tredennick’s resources: variable resources: fixed resources: fixed Paradigm Shifts anti machine paradigm How’s next Wave ? standard FPGAs 2007 2007 1967 1987 1957 1977 1997 custom no further wave ! vN machine paradigm anti machine paradigm 3

TTL 2007 1967 1987 reconfigurable 1957 1977 1997 µproc. memory custom LSI, MSI mainframes ASICs, accel’s PC data streams ... morphware here? Mainstream Markets standard technology issue and business model ? 4 Trittbrettfahrer

Repeat Success Story by new Machine Paradigm ! Software Industry’s Secret of Success standard µproc., memory TTL 2007 1967 1987 LSI, MSI reconfigurable 1957 ASICs, accel’s 1977 1997 custom The Impact of Makimoto’s Paradigm Shifts Dr. Makimoto: FPL 2000 keynote Procedural personalization via RAM-based Machine Paradigm structural personalization: RAM-based before run time Personalization (CAD) before fabrication 5

Reconfigurable Computing: a second programming domain Migration of programming to the structural domain The structural domain has become RAM-based The opportunity to introduce the structural domain to programmers ... ... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm 6

Ubiquitous embedded systems Embedded System Engineering (ESE) requires: • Hardware (HW)/ (E)Software (ESW) co-design • Configware (CW) / ESW co-design • HW / CW/ ESW co-design ESW becomes main vehicle to product differentiation ESE becomes the main focus in system design: 7

Coarse grain vs. Fine grain coarse grain (PACT AG, Munich) Reconfigurability: fine grain (FPGAs, rGAs) multi grain (e. g. by slice bundling) 8

Makimoto’s 3rd Wave • Fine Grain Subsystems (FPGAs): • 1st half of 3rd wave • universal (but less efficient) • Coarse Grain Subsystems: • 2nd half of 3rd wave • domain-specific • much more flexible than 2nd half of 2rd wave 9

FF of hidden RAM Principle of a Typical FPGA 10

>1000 transistors at each cross bar > Ý 40 transistors Routing Congestion [DeHon]: at each switching FF FF often 50% or less of CLBs used point FF part of the FF > Ý 15 transistors hidden RAM at each tap FF most FPGA vendors’ FF FF gate count: 1 flipflop of FF FF configuration RAM = 4 gates Routing Overhead in FPGAs 11

area used by application L L L S S L L L resources needed for reconfigurability S S L L L Reconfigurability Overhead partly for configuration code storage “hidden RAM” not shown 12

Reconfigurability Overhead • Fine Grain morphware platforms: • about 1 of 100 transistors serve the application • the rest serves for reconfigurability • Coarse Grain platforms: • If well layouted by structured VLSI design • area efficiency almost like hardwired designs 13

physical ~ 10 memory logical FPGA physical supersystolic ~ 10 000 FPGA logical FPGA routed microprocessor Why Coarse Grain instead of FPGA ? Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld 100 000 000 000 10 000 000 000 1000 000 000 100 000 000 10 000 000 1000 000 100 000 10 000 1000 Transistors / chip reduced reconfigurability overhead by up to ~ 1000 much faster loading drastically smaller configuration memory a lot of more benefits 1980 1990 2000 2010 14

*) R. Hartenstein: ISIS 1997 L L L S S area used by application hardwired rDPAs (reconfigurable computing)* L L L 2 1 0.5 0.25 0.13 0.1 0,07 DSP FPGAs (reconfigurable logic) 1 Bit CLB S S instruction set processors standard microprocessor Wiring by abutment: 32 Bit example resources needed for reconfigurability L L L Throughput vs. Efficiency T. Claasen et al.: ISSCC 1999 MOPS / mW 1000 100 10 1 0.1 0.01 0.001 µ feature size 15

coarse grain goes far beyond bridging the gap T. Claasen et al.: ISSCC 1999 MOPS / mW *) R. Hartenstein: ISIS 1997 throughput 1000 100 von Neumann hardwired 10 hardwired rDPAs (reconfigurable computing)* coarse grain FPGAs 1 2 1 0.5 0.25 0.13 0.1 0,07 DSP FPGAs (reconfigurable logic) 0.1 instruction set processors flexibility 0.01 standard microprocessor Wiring by abutment: 32 Bit example 0.001 µ feature size Throughput vs. Flexibilityy 16

>> outline << • embedded System Design Crisis • the CS crisis • datastream-based Computing • the Anti Machine Paradigm • application-specific distributed memory • anti machine architectural resources • final remarks http://www.uni-kl.de 17

design cost product life cycle year Embedded System Design Crisis 18

design complexity (1.4/year) Integration density (1.4/year) [Moore’s law] Embedded software [DTI* law] Communication bandwidth [Hansen’s law] Mask and NRE cost (1.25/year) µprocessor integration density (1.2/year) designer productivity (1.15/year) Memory bandwidth [Patterson‘s law] (1.07/year) Battery capacity (1.03/year) What are the Challenges ? (5)[ST microelectronics, MorphICs, Dataquest, eASIC] factor new compilation techniques needed ! supported by a new machine paradigm 2y 2 3y 4y 5y 10y 30y 1 months 10 12 18 0 *) Department of Trade and Industry, London 19

[Hartenstein 2002] demand /years of availability IC market volume 2 1 0.5 0.25 0.13 0.1 0,07 IC physical life expectance /years µ feature size The microelectronics spare part problem key problem in many application areas: medical, aerospace, automotive, other transportation, military, industrial equipment controllers, et al. 20

[Hartenstein 2002] demand /years of availability IC market volume 2 1 0.5 0.25 0.13 0.1 0,07 IC physical life expectance /years µ feature size The microelectronics spare part problem • Demand: several decades of availability • e. g. car price: ~25% electronics • ICs do not survive storage time • Original fab line is no more existing 21

[ST microelectronics] Mask & NRE cost 22

Shannon‘s Law • In a number of application areas throughput requirements are growing faster than Moore's law • Fundamental flaws in software processor solutions • 32 soft ARM cores fit onto contemporary FPGA • Data-stream-based distributed processing is the way to go 23

Foundries: Adoption Rate By Process [Nick Tredennick] 24

(ECW) ECW CW- CW and CW- and CW SoC System level Design:Embedded SW (ESW) ESE becomes the main focus in system design: ESW becomes main vehicle to product differentiation HW-(E)SW codesign onto highly programmable platforms (SoC) new design automation from high level descriptions SW synthesis included (SoC) HW-(E)SW-co-verificationH.] formal verification for (E)SW 25

ITRS SoC design cost model [ITRS 2001] small block reuse tall thin engineer large block reuse IC implementation tools ES level methodology Intelligent testbench mostly system level issues RTL methodology only w. future improvements 26 http://public.itrs.net/Files/2001ITRS/Design.pdf

>> CS crisis << • embedded System Design Crisis • the CS crisis • datastream-based Computing • the Anti Machine Paradigm • application-specific distributed memory • anti machine architectural resources • final remarks http://www.uni-kl.de 27

„EDA industry shifts into CS mentality“[Wojciech Maly] • patches instead of engineering • innovation stalled many years ago • netlist-based: do not care about efficiency, ... • ... do not care about transistor density • 85% users hate their tools 28

(1.4/year) [Moore’s law] Embedded software [DTI* law] Where are we heading ? CS is not prepared: heading toward disaster factor 2 90% by 2010 10 times more programmers will write embedded applications than computer software by 2010 1 months 10 12 18 0 *) Department of Trade and Industry, London 29

Crusty Computing Sciences more and more efforts yield only marginal improvements areas fade away dataflow machines dead shrinking supercomputing conferences 98.5% vN-only this monopoly is the problem [David Padua, John Hennessy] 30

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories Dead Supercomputer Society [Gordon Bell, keynote at ISCA 2000] • MasPar • Meiko • Multiflow • Myrias • Numerix • Prisma • Tera • Thinking Machines • Saxpy • Scientific Computer • Systems (SCS) • Soviet Supercomputers • Supertek • Supercomputer Systems • Suprenum • Vitesse Electronics 31

CS: young ? dynamic? .. but the von Neumann Paradigm is still the dominant doctrine ... ... the vN Microprocessor is a methusela, the steam engine of the silicon age. after >10 technology generations ... • 1th 4004 • 2nd 8008 • 3rd 8086 • 4th 80286 • 5th 80386 • 6th 80486 • 7th P5 (Pentium) • 8th P6 (Pentium Pro / Pentium II) • 9th Pentium III • 10th .... • 11th • ....... ... still pushing he basic models from the times of mainframe dinosaurs Microelectronics is ignored (except falling cost of computational effort) computingsciencesare ultra conservative … A Re-orientation is over-due … to avoid saying: senile 32

MPU designs more complex new kinds of concurrency are becoming important chip-level multiprocessing + simultaneous multithreading many bugs relate to concurrency issues greatly complicates the verification process 33

MPU performance stalled Bill Gates’ law: relative computation time needed doubles every 2 years had been compensated by Moore’s law Moore’s law will stall soon for MPUs 34

blinders: CS: Lacking Sense of Direction ? „we are o.k. !“ (no new direction) for ignoring the impact of RC 35

Stealthy CS Crisis severe software quality problems progress in CS stalled by qualification problems in industry and academia often hardware people needed to solve CS problems communication barriers between disciplines 36

It‘s the gap between procedural and structural mind set µprocessor accelerators What‘s the problem ? Crossing the Hardware / Software Chasm [Mike Butts] Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software The typical programmer has problems to understand function evaluation without machine mechanisms.... .... by signals rippling through a network of transistors. 37

structural hemisphere missing Brain usage: procedural-only µprocessor accelerators What‘s the problem ? (2) Crossing the Hardware / Software Chasm [Mike Butts] The brain hurts on paradigm shift ? no, it can‘t ... 38

hardware/software co-design software design Software Configware Software Software (structural) (procedural) downloading downloading downloading CAD downloading RAM I / O RAM RAM hardwired re- RAM accelerator(s) conf. instruction host host data path accelerator(s) sequencer “von Neumann” Hardware Morphware Changing Models of Computing configware/software co-design hardware/configware/software co-design 39

Personalization ( “Programs” ) by Programming Domain Communication Paths Setup Time Platform Hardware CAD Space Fabrication Time Systolic Array CAD Time and Space Fabrication Time procedural (e.g. “von Neumann”) Software Time Run Time Morphware Configware Space Compile Time Embedded Morphware Configware / Soft- ware Co-Compilation Compile Time and Run Time Time and Space “Programming” Domains 40

Terminology: Digital System Platforms clearly distinguished 41

There are more Levels of Parallelism ignored by typical CS people & ignored by CS curricula Process level Loop Level (data-stream-based, pipe nets, etc.) Instruction Level (VLIW etc.) RT Level (special architectures etc.) Logic Level (FPGAs) 42

“abstraction levels must be raised above present-day RT-level from HW + (processor-dependent embedded) C code level language infrastructures for complex models (SystemC etc.) must be leveraged by industry consensus on use-methodology and abstraction levels” Complexity: System Level Design Challenge [ITRS 2001] 43

>> datastream-based computing << • embedded System Design Crisis • the CS crisis • datastream-based computing • the Anti Machine Paradigm • application-specific distributed memory • anti machine architectural resources • final remarks http://www.uni-kl.de 44

this dichotomy is completely ignored by our CS curricula y 1 y 2 - y 3 - - placement - a a - a x 33 13 23 3 - a a a x 12 22 32 2 computing computing systolic a in space a a x in time 11 21 arrays 31 1 etc. - - ( ) y 0 data streams - 1 ( ) 0 y migration by re-timing 2 ( ) y 0 3 and other transformations Computing in space and time 45

y a + * DPU architectures x expression tree 1 3 2 simultaneous placement & routing + + 4 * xf Mapper - * sh sh + + * xf Scheduler data streams - * free form pipe network sh sh simulated annealing General Stream-based Computing System heterogenous Array of rDPUs (reconf. data path units) The same mapper for both: Reconfigurable, or hardwired Kress DPSS [1995] space time 46

... which data item at which time at which port time input data streams DPA x x x x x x x x time port # | x | | time - - - x x x - - - - x x x x x x x x x - - - - - x x x - | | | x x x - - | | | port # | | | port # x | | | x x | | x x x output data streams x x x time flowware defines .... flowware history: 1980: data streams (Kung, Leiserson) 1995: super systolic rDPA (Kress) 1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ... (tutorials and courses available on all this) 47

Flowware control-procedural vs. data-procedural The structural domain is primarily data-stream-based: ..... mostly not yet modelled that way: most flowware is hidden by its indirect instruction-stream-based implementation Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ... ... a Troyan horse to introduce the structural domain to the procedural mind set of programmers 48

high level source program asM wrapper data streams intermediate M M M M mapper rDPA M M configware scheduler r. Data Path Array M M M M M M M M M M flowware address generator Configware / Flowware Compilation students should know that also P & R is a compilation technique data sequencer 49

>> the anti machine paradigm << • embedded System Design Crisis • the CS crisis • datastream-based Computing • the Anti Machine Paradigm • application-specific distributed memory • anti machine architectural resources • final remarks http://www.uni-kl.de 50

Distributed Memory and Datastream-based Reconfigurable Computing

Distributed Memory and Datastream-based Reconfigurable Computing

Presentation Transcript

Reconfigurable Computing

Memory Addressing Organization for Stream-Based Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing - Memory in FPGAs

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Platform-Based Reconfigurable Computing Design

Distributed Representation, Connection-Based Learning, and Memory

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable computing

Reconfigurable Computing

Configurable, reconfigurable, and run-time reconfigurable computing

Reconfigurable Computing

Reconfigurable Computing

FPGA and Reconfigurable Computing

Reconfigurable Computing Applications

Reconfigurable Computing

Distributed Representation, Connection-Based Learning, and Memory

Reconfigurable Computing

Reconfigurable Computing