Signalling in the Heterogeneous Architecture Multiprocessor Paradigm - PowerPoint PPT Presentation

signalling in the heterogeneous architecture multiprocessor paradigm n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm PowerPoint Presentation
Download Presentation
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

play fullscreen
1 / 81
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm
128 Views
Download Presentation
daria-cardenas
Download Presentation

Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Signalling in the Heterogeneous Architecture Multiprocessor Paradigm Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC A. Nunez

  2. Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez

  3. A. Nunez

  4. Technological Forecasts • Moore's Law: number of transistors per chip double every two years • ITRS: GALS NoC SoC MPSoC A. Nunez

  5. A. Nunez

  6. Processor to DRAM Performance Gap µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1993 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 2000 1980 1981 1982 1983 1984 1996 1997 1998 1999 Time A. Nunez

  7. Logic to Memory Area Gap A. Nunez

  8. Logic to Productivity Gap A. Nunez

  9. -> Platform based design -> Communication architectures A. Nunez

  10. Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez

  11. Processor ArchitectureParadigms Cfr. Ungerer et al, Patterson et al, Tenhunnen et al, Computer special issue • Processor/Memory/Switch • Processor- Memory- Communications- dominated systems • Communications architecture • Processor-Mono: Speed-up of a single-threaded application • Advanced superscalar • Trace Cache • Superspeculative • Multiscalar processors • Processor-Multi: Speed-up of multi-threaded applications • Simultaneous multithreading (SMT) • Chip multiprocessors (CMPs) • Memory, Processor-in-Memory, IRAM, others • Network on Chip Patt, Sohi… • Homo • Hetero Many.. Patterson Mihal, Tenhunnen, Goosens A. Nunez

  12. Monoprocessor: Superflow Processor • Fine granularity, data word • The Superflow processor speculates on • instruction flow: two-phase branch predictor combined with trace cache • register data flow: dependence prediction: predict the register value dependence between instructions • source operand value prediction • constant value prediction • value stride prediction: speculate on constant, incremental increases in operand values • dependence prediction predicts inter-instruction dependences • memory data flow: prediction of load values, of load addresses and alias prediction A. Nunez

  13. Com-arch in SuperflowProcessor A. Nunez

  14. Multiscalar Processors • A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control. • A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. • The tasks are distributed to a number of parallel PEs within a processor. • Each PE fetches and executes instructions belonging to its assigned task. • The primary constraint: it must preserve the sequential program semantics. A. Nunez

  15. PE 0 A Task A PE 1 B C Data values Task B D PE 2 Task D E PE 3 Task E Multiscalar mode of execution A. Nunez

  16. Com-arch in Multiscalar processor A. Nunez

  17. Multiscalar, Trace and Speculative Multithreaded Processors • Multiscalar: A program is statically partitioned into tasks which are marked by annotations of the CFG. • Trace Processor: Tasks are generated from traces of the trace cache. • Speculative multithreading: Tasks are otherwise dynamically constructed. • Common target: Increase of single-thread program performance by dynamically utilizing thread-level speculation additionally to instruction-level parallelism. • A „thread“ means a „HW thread“ A. Nunez

  18. Multis: Additional utilization of more coarse-grained parallelism • CMPs Chip multiprocessors or multiprocessor chips • integrate two or more complete processors on a single chip, • every functional unit of a processor is duplicated. • SMPs Simultaneous multithreaded processors • store multiple contexts in different register sets on the chip, • the functional units are multiplexed between the threads, • instructions of different contexts are simultaneously executed. A. Nunez

  19. Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secndary Cache Global Memory CMPs-Homo: Com-arch by shared global memory Global Memory Shared global memory, no caches A. Nunez

  20. Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secondary Cache Global Memory CMPs-Homo: Com-arch by shared primary cache Shared primary cache A. Nunez

  21. Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory CMPs-Homo: Com-arch by global memory, caches Shared caches and memory Shared secondary cache A. Nunez

  22. Com-arch in Hydra: A Single-Chip Multiprocessor Centralized Bus Arbitration Mechanisms A Single Chip CPU 0 CPU 1 CPU 2 CPU 3 Primary I-cache Primary Primary Primary Primary Primary Primary Primary I-cache D-cache I-cache D-cache D-cache I-cache D-cache CPU 0 Memory Controller CPU 1 Memory Controller CPU2 Memory Controller CPU 3 Memory Controller DMA Rambus Memory Off-chip L3 I/O Bus On-chip Secondary Interface Interface Interface Cache DRAM Main Memory Cache SRAM Array I/O Device A. Nunez

  23. Engines Engines CMPs-Hetero: Communications Architecture • Architectures found in today’s heterogeneous processors for platform based design • E.gr. CPU cores, AMBA buses, internal/external shared memories AMBA Bus RISC Core Internal/ External Memory External I/O Shared Bus A. Nunez

  24. CMPs-Hetero: Communications Architecture, Arbiters A. Nunez

  25. Multithreaded Processors • Aim: Latency tolerance • What is the problem? Load access latencies measured on an Alpha Server 4100 SMP with four Alpha 21164 processors are: • 7 cycles for a primary cache miss which hits in the on-chip L2 cache of the 21164 processor, • 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache, • 80 cycles for a miss that is served by the memory, and • 125 cycles for a dirty miss, i.e., a miss that has to be served from another processor's cache memory. A. Nunez

  26. Multithreading • Multithreading • The ability to pursue two or more threads of control in parallel within a processor pipeline. • Advantage: The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. • Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors. A. Nunez

  27. Approaches of Multithreaded Processors • Cycle-by-cycle interleaving • An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle. • Block-interleaving • The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch. • Simultaneous multithreading SMTs • Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor. • combines a wide issue superscalar instruction issue with multithreading. A. Nunez

  28. Time (process cycles) Context switch Context switch (a) (b) (c) Multithreading versus Non-Multithreading Approaches (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar A. Nunez

  29. ) s e l c y c r o s s e c o r p ( e m i T Issue slots (a) (b) Simultaneous Multithreading(SMT)and Chip Multiprocessors (CMP) (a) SMT (b) CMP A. Nunez

  30. Combining SMT and Multimedia • Start with a wide-issue superscalar general-purpose processor • Enhance by simultaneous multithreading • Enhance by multimedia unit(s) • Enhance by on-chip RAM memory for constants and local variables A. Nunez

  31. The SMT Multimedia Processor A. Nunez

  32. IPC of Maximum Processor Models A. Nunez

  33. Combining CMP-hetero and Multimedia • Start with a general-purpose processor • Enhance by hierarchical-bus com-arch • Enhance by hardware accelerators and copros including multimedia unit(s) • Enhance by on-chip RAM memories for constants, local variables, frames… A. Nunez

  34. Real implementation example: Philips Eclipse architecture instance for video coding A. Nunez

  35. CMP or SMT? • The performance race between SMT and CMP is not yet decided. • CMP is easier to implement, but only SMT has the ability to hide latencies. • A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue. • A separation of the thread queues is a possible solution, although it does not remove the central instruction issue. • A combination of simultaneous multithreading with the CMP may be superior. • Research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread • thread-level speculation • close to multiscalar A. Nunez

  36. Processor-in-Memory • Technological trends have produced a large and growing gap between processor speed and DRAM access latency. • Today, it takes dozens of cycles for data to travel between the CPU and main memory. • CPU-centric design philosophyhas led to very complex superscalar processors with deep pipelines. • Much of this complexity is devoted to hiding memory access latency. • Memory wall: the phenomenon that access times are increasingly limiting system performance. • Memory-centric design is envisioned for the future A. Nunez

  37. PIM or Intelligent RAM (IRAM) • PIM (processor-in-memory) or IRAM (intelligent RAM) approaches couple processor execution with large, high-bandwidth, on-chip DRAM banks. • PIM or IRAM merge processor and memory into a single chip. • Advantages: • The processor-DRAM gap in access speed increases in future. PIM provides higher bandwidth and lower latency for (on-chip-)memory accesses. • DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches. • On-chip memory may be treated as main memory - in contrast to a cache which is just a redundant memory copy. • PIM decreases energy consumption in the memory system due to the reduction of off-chip accesses. • VIRAM, CODE A. Nunez

  38. V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB I/O I/O I/O I/O 8 x 64 or 16 x 32 or 32 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 8K I cache 8K D cache 8 x 64 8 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 8 x 64 8 x 64 8 x 64 8 x 64 8 x 64 … … … … … … … … … … M M M M M M M M M M A. Nunez

  39. DSP PE Array NoC Processor Architecture • Network-on-chip, specialized PEs, advanced interconnect technologies • Will use packet network architectures in 2010 On-Chip Memory PE External Memory Switch Node External I/O Packet Network Controller PE Switch Node PE PE PE A. Nunez

  40. Processing Element PE PE switch bridge $ MEM PE $ MEM Processing Element switch NoC Mescal Communication Architecture General Paradigm • Mescal Communication Architecture is a general, coarse-grained on-chip interconnection scheme for various system components such as Processing Elements, memory and other communicating elements. A. Nunez

  41. NoC Mescal Abstract System Architecture A. Nunez

  42. NoC Communication Architecture A. Nunez

  43. NoC: Example for a bus A. Nunez

  44. Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez

  45. Todays Communication ArchitectureParadigms: Topology • Single and Shared Transport and Signalling Channel • p2p • Bus • Hierarchical bus • Switch • Crossbar • Multistage… • Ring • Trees • Network • Circuit sw • Packet sw w/o connection • Packet sw w connection.. A. Nunez

  46. Todays Communication ArchitectureParadigms: Topology • Split Transport and Signalling • Transport • Topology (bus, h-bus, switch, ring, network…) • Signalling (Addresses and routing, services, synchronisms) • Associated channel • Topology • Common channel • Topology… • Protocol layer stack: software and process view of the generation of hardware signalling requires mapping onto actual interfaces A. Nunez

  47. Todays Communications ArchitectureParadigms: Bandwidth • Application Granularity • Transport Granularity • Fine grain • Medium grain • Coarse grain • Bus sizes, transfer sizes • Traffic Characterization • Traffic Characterization • E.gr. Streaming, burstiness, interval requests, space-time distribution A. Nunez

  48. Todays Communications Architecture Paradigms: Protocols • Protocols • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency A. Nunez

  49. Todays Communications ArchitectureParadigms: Signalling • Addressing, routing info • Service info • Hand-shake and command sync strobes • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency • Streaming ... A. Nunez

  50. Com-arch Modelling: Ptolemy-MescalUCBerkeley PtolemyI&II, Mescal, UCSD-Dey, PR-Vissers, Goosens, Lippen.., TIMA-Jerraya.. • Components for channels: • Synchronous digital bus (shared or point-to-point) • ARM AMBA bus • IBM CoreConnect bus • Analog channel • Actors encapsulate the physical layer • Each actor has a common interface to make experimentation possible • Ptolemy actor interface is a higher level than the channel’s actual electrical interface A. Nunez