1 / 124

Introduction to Many-Core Architectures

Introduction to Many-Core Architectures. Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010. Core i7. Intel Trends (K. Olukotun). 3GHz. 100W. 5. System-level integration (Chuck Moore, AMD at MICRO 2008).

louie
Télécharger la présentation

Introduction to Many-Core Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IntroductiontoMany-Core Architectures Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

  2. Core i7 Intel Trends (K. Olukotun) 3GHz 100W 5 Henk Corporaal

  3. System-level integration (Chuck Moore, AMD at MICRO 2008) • Single-chip CPU Era: 1986 –2004 • Extreme focus on single-threaded performance • Multi-issue, out-of-order execution plus moderate cache hierarchy • Chip Multiprocessor (CMP) Era: 2004 –2010 • Early: Hasty integration of multiple cores into same chip/package • Mid-life: Address some of the HW scalability and interference issues • Current: Homogeneous CPUs plus moderate system-level functionality • System-level Integration Era: ~2010 onward • Integration of substantial system-level functionality • Heterogeneous processors and accelerators • Introspective control systems for managing on-chip resources & events Henk Corporaal

  4. Why many core? • Running into • Frequency wall • ILP wall • Memory wall • Energy wall • Chip area enabler: Moore's law goes well below 22 nm • What to do with all this area? • Multiple processors fit easily on a single die • Application demands • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd • Performance/Watt is the new metric !! Henk Corporaal

  5. CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2 2C V’2 = fCV’2 < P1 Henk Corporaal

  6. Engine Engine Engine Engine How low Vdd can we go? • Subthreshold JPEG encoder • Vdd 0.4 – 1.2 Volt Henk Corporaal

  7. Computational efficiency: how many MOPS/Watt? Yifan He e.a., DAC 2010 Henk Corporaal

  8. 10000 W m / s p o M 0 0 0 1 ) IBM Cell s 1000 p W m o / s p o G M 0 W ( 0 m 1 / s p e o M c 0 1 n SODA a 100 ( 90 nm ) m P SODA r o Imagine o ( 65 nm ) w f B e r W m e r e / s t p o E t P M e f 1 f r i 10 c i e VIRAM Pentium M n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Computational efficiency: what do we need? 4 G Wireless Mobile HD Video 3 G Wireless Woh e.a., ISCA 2009 Henk Corporaal

  9. Intel's opinion: 48-core x86 Henk Corporaal

  10. Outline • Classifications of Parallel Architectures • Examples • Various (research) architectures • GPUs • Cell • Intel multi-cores • How much performance do you really get? Roofline model • Trends & Conclusions Henk Corporaal

  11. Classifications • Performance / parallelism driven: • 4-5 D • Flynn • Communication & Memory • Message passing / Shared memory • Shared memory issues: coherency, consistency, synchronization • Interconnect Henk Corporaal

  12. Flynn's Taxomony • SISD (Single Instruction, Single Data) • Uniprocessors • SIMD (Single Instruction, Multiple Data) • Vector architectures also belong to this class • Multimedia extensions (MMX, SSE, VIS, AltiVec, …) • Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, …… • MISD (Multiple Instruction, Single Data) • Systolic arrays / stream based processing • MIMD (Multiple Instruction, Multiple Data) • Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin • Flexible • Most widely used Henk Corporaal

  13. Flynn's Taxomony Henk Corporaal

  14. Enhance performance: 4 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue • Single stream: Superscalar • Multiple streams • Single core, multiple threads: Simultaneously Multi-Threading • Multiple cores Henk Corporaal

  15. IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methodsPipelined Execution of Instructions • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline Henk Corporaal

  16. * Architecture methodsPipelined Execution of Instructions • Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Henk Corporaal

  17. Architecture methodsPowerful Instructions (1) • MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Henk Corporaal

  18. SIMD Execution Method time PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) • SIMD computing • All PEs (Processing Elements) execute same operation • Typical mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Henk Corporaal

  19. * * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Examples • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Henk Corporaal

  20. Architecture methodsPowerful Instructions (2) • MO-technique: multiple operations per instruction • Two options: • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Henk Corporaal

  21. VLIW architecture: central Register File Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Henk Corporaal

  22. Architecture methodsMultiple instruction issue (per cycle) • Who guarantees semantic correctness? • can instructions be executed in parallel • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar • Compiler: Compile into dataflow representation • Dataflow processors Henk Corporaal

  23. SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Henk Corporaal

  24. Architecture I O D S Mpar CISC 0.2 1.2 1.1 1 0.26 RISC 1 1 1 1.2 1.2 VLIW 1 10 1 1.2 12 Superscalar 3 1 1 1.2 3.6 SIMD 1 1 128 1.2 154 MIMD 32 1 1 1.2 38 GPU 32 2 8 24 12288 Top500 Jaguar ??? S(architecture) = f(Op) * lt (Op) Op I_set Architecture design space Example values of <I, O, D, S> for different architectures You should exploit this amount of parallelism !!! Mpar = I*O*D*S Henk Corporaal

  25. Communication • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node Henk Corporaal

  26. Communication models: Shared Memory • Coherence problem • Memory consistency issue • Synchronization problem Shared Memory (read, write) (read, write) Process P2 Process P1 Henk Corporaal

  27. Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap • Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) Henk Corporaal

  28. Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel can be 1 bus, N busses, or any network Main memory I/O System Henk Corporaal

  29. Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System Henk Corporaal

  30. Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache Henk Corporaal

  31. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? Henk Corporaal

  32. receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • standard MPI: Message Passing Interface • www.mpi-forum.org • Note that MP can be build on top of SM and vice versa! Henk Corporaal

  33. Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer Henk Corporaal

  34. Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network Henk Corporaal

  35. Communication Models: Comparison • Shared-Memory: • Compatibility with well-understood language mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing: • Simpler hardware • Explicit communication • Implicit synchronization (with any communication) Battle ongoing Henk Corporaal

  36. Interconnect • How to connect your cores? • Some options: • Connect everybody: • Single bus • Hierarchical bus • NoC • multi-hop via routers • any topology possible • easy 2D layout helps • Connect with e.g. neighbors only • e.g. using shift operation in SIMD • or using dual-ported mems to connect 2 cores. Henk Corporaal

  37. Example: NoC with 2x4 mesh routing network node node node node R R R R node node node node R R R R Bus (shared) or Network (switched) • Network: • claimed to be more scalable • no bus arbitration • point-to-point connections • but router overhead Henk Corporaal

  38. Historical Perspective • Early machines were: • Collection of microprocessors. • Communication was performed using bi-directional queues between nearest neighbors. • Messages were forwarded by processors on path • “Store and forward” networking • There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time Henk Corporaal

  39. Design Characteristics of a Network • Topology (how things are connected): • Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, .... • Routing algorithm (path used): • Example in 2D torus: all east-west then all north-south (avoids deadlock) • Switching strategy: • Circuit switching: full path reserved for entire message, like the telephone. • Packet switching: message broken into separately-routed packets, like the post office. • Flow control and buffering (what if there is congestion): • Stall, store data temporarily in buffers • re-route data to other nodes • tell source node to temporarily halt, discard, etc. • QoS guarantees, Error handling, …., etc, etc. Henk Corporaal

  40. Switch / Network Topology • Topology determines: • Degree: number of links from a node • Diameter: max number of links crossed between nodes • Average distance: number of links to random destination • Bisection: minimum number of links that separate the network into two halves • Bisection bandwidth = link bandwidth * bisection Henk Corporaal

  41. Bisection Bandwidth • Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves • Bandwidth across “narrowest” part of the network not a bisection cut bisection cut bisection bw= link bw bisection bw = sqrt(n) * link bw • Bisection bandwidth is important for algorithms in which all processors need to communicate with all others Henk Corporaal

  42. Common Topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension Henk Corporaal

  43. Topologies in Real High End Machines older newer Henk Corporaal

  44. Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation • How can a mechanism help hide latency? • overlap message send with computation, • prefetch data, • switch to other task or thread Henk Corporaal

  45. Examples of many core / PE architectures • SIMD • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) • VLIW • Itanium,TRIPS / EDGE, ADRES, • Multi-threaded • idea: hide long latencies • Denelcor HEP (1982), SUN Niagara (2005) • Multi-processor • RaW, PicoChip, Intel/AMD, GRID, Farms, ….. • Hybrid, like , Imagine, GPUs, XC-Core • actually, most are hybrid !! Henk Corporaal

  46. IMAP from NEC • NEC IMAP • SIMD • 128 PEs • Supports indirect addressing • e.g. LD r1, (r2) • Each PE 5-issue VLIW Henk Corporaal

  47. TRIPS (Austin Univ / IBM)a statically mapped data flow architecture R: register file E: execution unit D: Data cache I: Instruction cache G: global control Henk Corporaal

  48. Compiling for TRIPS • Form hyperblocks (use unrolling, predication, inlining to enlarge scope) • Spatial map operations of each hyperblock • registers are accessed at hyperblock boundaries • Schedule hyperblocks Henk Corporaal

  49. Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Intel calls this 'Hyperthreading' Thread 2 Thread 4 Idle slot Henk Corporaal

  50. SUN Niagara processing element • 4 threads per processor • 4 copies of PC logic, Instr. buffer, Store buffer, Register file Henk Corporaal

More Related