The von Neumann Syndrome

TU Delft, Sept 28, 2007 The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern (v.2) http://hartenstein.de

von Neumann Syndrome this term has been coined by “RAM” (C.V. Ramamoorthy, emeritus, UC Berkeley) 2

60 years later the von Neumann (vN) model took over The first Reconfigurable Computer • prototyped 1884 by Herman Hollerith • a century before FPGA introduction • data-stream-based • instruction-stream-based 3

Outline • von Neumann overhead hits the memory wall • The manycore programming crisis • Reconfigurable Computing is the solution • We need a twin paradigm approach • Conclusions 4

The spirit of the Mainframe Age • For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps … • … finally tending to code sizes of astronomic dimensions • Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view • 1951: Hardware Design going von Neumann (Microprogramming) 5

von Neumann: array of massive overhead phenomena … piling up to code sizes of astronomic dimensions 6

von Neumann: array of massive overhead phenomena piling up to code sizes of astronomic dimensions temptations by von Neumann style software engineering [Dijkstra 1968] the “go to” considered harmful massive communication congestion [R.H. 1975] universal bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvindet al., 1983:A critique of Multiprocessing the von Neumann Style 7

von Neumann overhead: just one example 94% computation load only for moving this window [1989]: 94% computation load (image processing example) 8

ends in 2005 Performance 1000 µProc 60%/yr.. CPU clock speed ≠ performance: 2005: ~1000 processor’s silicon is mostly cache 100 CPU Dave Patterson’s Law - “Performance” Gap: 10 1 1990 2000 DRAM 7%/yr.. DRAM 1980 the Memory Wall instruction stream code size of astronomic dimensions ….. … needs off-chipRAM which fully hits better compare off-chip vs. fast on-chip memory growth 50% / year 9

CPU clock speed ≠ performance: processor’s silicon is mostly cache 200 DEC alpha [BWRC, UC Berkeley, 2004] 175 150 caches ... 125 100 SPECfp2000/MHz/Billion Transistors CPU 75 IBM 50 SUN 25 HP 0 1990 1995 2000 2005 stolen from Bob Colwell Benchmarked Computational Density alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs 10

The Manycore future • we are embarking on a new computing age -- the age of massive parallelism[Burton Smith] • everyone will have multiple parallel computers [B.S.] • Even mobile devices will exploit multicore processors, also to extend battery life [B.S.] • multiple von Neumann CPUs on the same µprocessor chip lead to exploding (vN) instruction stream overhead [R.H.] 12

The instruction-stream-based parallel von Neumann approach: CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Several overhead phenomena the watering pot model [Hartenstein] per CPU! has several von Neumann overhead phenomena 13

proportionate to the number of processors CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Explosion of overhead by von Neumann parallelism disproportionate to the number of processors [R.H. 2006] MPI considered harmful 14

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Rewriting Applications • more processors means rewriting applications • we need to map an application onto different size manycore configurations • most applications are not readily mappable onto a regular array. • Mapping is much less problematic with Reconfigurable Computing 15

The Education Wall Disruptive Development • Computer industry is probably going to be disrupted by some very fundamental changes. [Ian Barron] • We must reinvent computing. [Burton J. Smith] • A parallel [vN] programming model for manycore machines will not emerge for five to 10 years [experts from Microsoft Corp]. • I don‘t agree: we have a model. • Reconfigurable Computing: Technology is Ready, Users are Not • It‘s mainly an education problem 16

The Reconfigurable Computing Paradox • Bad FPGA technology: reconfigurability overhead, wiring overhead, routing congestion, slow clock speed • Up to 4 orders of magnitude speedup + tremendously slashing the electricity bill by migration to FPGA • The reason of this paradox ? • There is something fundamentally wrong in using the von Neumann paradigm • The spirit from the Mainframe Age is collapsing under the von Neumann syndrome 18

The instruction-stream-based von Neumann approach: beyond von Neumann Parallelism the watering pot model [Hartenstein] We need an approach like this: per CPU! it’s data-stream-based RC* has several von Neumann overhead phenomena *) “RC” = Reconfigurable Computing 19

(coarse-grained rec.) rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA: reconfigurable datapath array rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU no instruction fetch at run time von Neumann overhead vs. Reconfigurable Computing using reconfigurable data counters using data counters using program counter *) configured before run time 20

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU von Neumann overhead vs. Reconfigurable Computing (coarse-grained rec.) using reconfigurable data counters using data counters using program counter rDPA: reconfigurable datapath array [1989]: x 17 speedup by GAG** (image processing example) [1989]: x 15,000 total speedup from this migration project *) configured before run time **) just by reconfigurable address generator 21

Reconfigurable Computing means … • For HPC run time is more precious than compiletime http://www.tnt-factory.de/videos_hamster_im_laufrad.htm • Reconfigurable Computing means moving overhead from run time to compile time** • Reconfigurable Computing replaces “looping” at run time* … … by configuration before run time **) or, loading time *) e. g. complex address computation 22

by Software by Configware Data meeting the Processing Unit (PU) ... explaining the RC advantage We have 2 choices routing the data by memory-cycle-hungry instruction streams thru shared memory (data) data-stream-based: placement* of the execution locality ... (PU) pipe network generated by configware compilation *) before run time 23

rDPU rDPU rDPU rDPU rDPU rDPU depending on connect fabrics rDPA rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA array port receiving or sending a data stream What pipe network ? pipe network, organized at compile time Generalization*of the systolic array rDPA = rDPU array, i. e. coarse-grained [R. Kress, 1995] *) supporting non-linear pipes on free form hetero arrays rDPU = reconf. datapath unit (no program counter) 24

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU ASM: Auto-Sequencing Memory RAM GAG rDPA ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM ASM data counter Migration benefit by on-chip RAM Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM so that the drastic code size reduction by software to configware migration can beat the memory wall multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions GAGs inside ASMs generate the data streams GAG = generic address generator rDPA = rDPU array, i. e. coarse-grained rDPU = reconf. datapath unit (no program counter) 25

rDPU mesh-connected; exceptions: see rout thru only 3 x 3 fast on-chip RAM ASM ASM ASM ASM ASM ASM ASM ASM ASM . . not used backbus connect 32 bits wide . . . . . . array size: 10 x 16 = 160 such rDPUs compiled by Nageldinger‘s KressArray Xplorer (Juergen Becker‘s CoDe-X inside) Coarse-grained Reconfigurable Array example image processing: SNN filter ( mainly a pipe network) coming close to programmer‘s mind set (much closer than FPGA) note: kind of software perspective, but without instruction streams  datastreams+ pipelining 26

C language source Juergen Becker 1996 “vN"machine Partitioner para d igm antimachine paradigm CW SW Analyzer compiler compiler / Profiler CW Code SW code FW Code apropos compilation: Software / Configware Co-Compilation The CoDe-X co-compiler But we need a dual paradigm approach: to run legacy software together w. configware Reconfigurable Computing: Technology is Ready. -- Users are Not ? 28

the education wall (procedural) the main problem Curricula from the mainframe age structurally disabled non-von-Neumann accelerators (this is not a lecture on brain regions) no common model the common model is ready, but users are not not really taught 29

each side needs its own common model procedural structural (this is not a lecture on brain regions) We need a twin paradigm education Brain Usage: both Hemispheres 30

RCeducation 2008 teaching RC ? The 3rd International Workshop on Reconfigurable Computing Education April 10, 2008, Montpellier, France http://fpl.org/RCeducation/ 31

2007 Here it is ! We need new courses We need undergraduate lab courses with HW / CW / SW partitioning We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW co-design “We urgently need a Mead-&-Conway-like text book “ [R. H., Dagstuhl Seminar 03301,Germany, 2003] 32

Conclusions • We need to increase the population of HPC-competent people [B.S.] • We need to increase the population of RC-competent people [R.H.] • Data streaming is the key model of parallel computation – not vN • Von-Neumann-type instruction streams considered harmful [RH] • But we need it for some small code sizes, old legacy software, etc. … • The twin paradigm approach is inevitable, also in education [R. H.]. 34

**) “FPGAs? Do we need to learn hardware design?” please, reply to: An Open Question • Coarse-grained arrays: technology ready*, users not ready *) offered by startups (PACT Corp. and others) • Much closer to programmer’s mind set: really much closer than FPGAs** • Which effect is delaying the break-through? 35

thank you 36

END 37

Disruptive Development The way the industry has grown up writing software - the languages we chose, the model of synchronization and orchestration, do not lead toward uncovering parallelism for allowing large-scale composition of big systems. [Iann Barron] 39

token bit evoke FF FF FF 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972 1972: Dual paradigm mind set: an old hat (mapping from procedural to structural domain) Software mind set: instruction-stream-based: flow chart -> control instructions Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer 40

The von Neumann Syndrome