Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Multi-Core Architectures and Shared Resource ManagementLecture 3: Interconnects Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu onur@cmu.edu Bogazici University June 10, 2013

Last Lecture • Wrap up Asymmetric Multi-Core Systems • Handling Private Data Locality • Asymmetry Everywhere • Resource Sharing vs. Partitioning • Cache Design for Multi-core Architectures • MLP-aware Cache Replacement • The Evicted-Address Filter Cache • Base-Delta-Immediate Compression • Linearly Compressed Pages • Utility Based Cache Partitioning • Fair Shared Cache Partitinoning • Page Coloring Based Cache Partitioning

Agenda for Today • Interconnect design for multi-core systems • (Prefetcher design for multi-core systems) • (Data Parallelism and GPUs)

Readings for Lecture June 6 (Lecture 1.1) • Required – Symmetric and Asymmetric Multi-Core Systems • Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003, IEEE Micro 2003. • Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009, IEEE Micro 2010. • Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro 2011. • Joao et al., “Bottleneck Identification and Scheduling for Multithreaded Applications,” ASPLOS 2012. • Joao et al., “Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs,” ISCA 2013. • Recommended • Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. • Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. • Mutlu et al., “Techniques for Efficient Processing in Runahead Execution Engines,” ISCA 2005, IEEE Micro 2006.

Videos for Lecture June 6 (Lecture 1.1) • Runahead Execution • http://www.youtube.com/watch?v=z8YpjqXQJIA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=28 • Multiprocessors • Basics:http://www.youtube.com/watch?v=7ozCK_Mgxfk&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=31 • Correctness and Coherence: http://www.youtube.com/watch?v=U-VZKMgItDM&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=32 • Heterogeneous Multi-Core: http://www.youtube.com/watch?v=r6r2NJxj3kI&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=34

Readings for Lecture June 7 (Lecture 1.2) • Required – Caches in Multi-Core • Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005. • Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012. • Pekhimenkoet al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,”PACT 2012. • Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” SAFARI Technical Report 2013. • Recommended • Qureshi et al., “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,”MICRO 2006.

Videos for Lecture June 7 (Lecture 1.2) • Cache basics: • http://www.youtube.com/watch?v=TpMdBrM1hVc&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=23 • Advanced caches: • http://www.youtube.com/watch?v=TboaFbjTd-E&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=24

Readings for Lecture June 10 (Lecture 1.3) • Required – Interconnects in Multi-Core • Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. • Fallin et al., “CHIPPER: A Low-Complexity Bufferless Deflection Router,” HPCA 2011. • Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012. • Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009. • Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010, IEEE Micro 2011. • Recommended • Grot et al. “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009. • Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011, IEEE Micro 2012.

More Readings for Lecture 1.3 • Studies of congestion and congestion control in on-chip vs. internet-like networks • George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and SrinivasanSeshan,"On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects"Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx) • George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu,"Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?"Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HOTNETS), Monterey, CA, October 2010. Slides (ppt)(key)

Videos for Lecture June 10 (Lecture 1.3) • Interconnects • http://www.youtube.com/watch?v=6xEpbFVgnf8&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=33 • GPUs and SIMD processing • Vector/array processing basics: http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15 • GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19 • GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20

Readings for Prefetching • Prefetching • Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007. • Ebrahimi et al., “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009. • Ebrahimi et al., “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems,” HPCA 2009. • Ebrahimi et al., “Prefetch-Aware Shared Resource Management for Multi-Core Systems,” ISCA 2011. • Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. • Recommended • Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” MICRO 2009.

Videos for Prefetching • Prefetching • http://www.youtube.com/watch?v=IIkIwiNNl0c&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=29 • http://www.youtube.com/watch?v=yapQavK6LUk&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=30

Readings for GPUs and SIMD • GPUs and SIMD processing • Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011. • Jog et al., “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance,” ASPLOS 2013. • Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013. • Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008. • Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007.

Videos for GPUs and SIMD • GPUs and SIMD processing • Vector/array processing basics: http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15 • GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19 • GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20

Online Lectures and More Information • Online Computer Architecture Lectures • http://www.youtube.com/playlist?list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ • Online Computer Architecture Courses • Intro:http://www.ece.cmu.edu/~ece447/s13/doku.php • Advanced: http://www.ece.cmu.edu/~ece740/f11/doku.php • Advanced: http://www.ece.cmu.edu/~ece742/doku.php • Recent Research Papers • http://users.ece.cmu.edu/~omutlu/projects.htm • http://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en

Interconnect Basics

Interconnect in a Multi-Core System Shared Storage

Where Is Interconnect Used? • To connect components • Many examples • Processors and processors • Processors and memories (banks) • Processors and caches (banks) • Caches and caches • I/O devices Interconnection network

Why Is It Important? • Affects the scalability of the system • How large of a system can you build? • How easily can you add more processors? • Affects performance and energy efficiency • How fast can processors, caches, and memory communicate? • How long are the latencies to memory? • How much energy is spent on communication?

Interconnection Network Basics • Topology • Specifies the way switches are wired • Affects routing, reliability, throughput, latency, building ease • Routing (algorithm) • How does a message get from source to destination • Static or adaptive • Buffering and Flow Control • What do we store within the network? • Entire packets, parts of packets, etc? • How do we throttle during oversubscription? • Tightly coupled with routing strategy

Topology • Bus (simplest) • Point-to-point connections (ideal and most costly) • Crossbar (less costly) • Ring • Tree • Omega • Hypercube • Mesh • Torus • Butterfly • …

Metrics to Evaluate Interconnect Topology • Cost • Latency (in hops, in nanoseconds) • Contention • Many others exist you should think about • Energy • Bandwidth • Overall system performance

Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping and serialization) - Not scalable to large number of nodes (limited bandwidth, electrical loading  reduced frequency) - High contention  fast saturation 0 1 2 3 4 5 6 7

Point-to-Point 0 Every node connected to every other + Lowest contention + Potentially lowest latency + Ideal, if cost is not an issue -- Highest cost O(N) connections/ports per node O(N2) links -- Not scalable -- How to lay out on chip? 7 1 6 2 5 3 4

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 Crossbar • Every node connected to every other (non-blocking) except one can be using the connection at any given time • Enables concurrent sends to non-conflicting destinations • Good for small number of nodes + Low latency and high throughput - Expensive - Not scalable  O(N2) cost - Difficult to arbitrate as N increases Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II

Another Crossbar Design 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Sun UltraSPARC T2 Core-to-Cache Crossbar • High bandwidth interface between 8 cores and 8 L2 banks & NCU • 4-stage pipeline: req, arbitration, selection, transmission • 2-deep queue for each src/dest pair to hold data transfer request

Buffered Crossbar + Simpler arbitration/scheduling + Efficient support for variable-size packets - Requires N2 buffers 0 0 NI NI Flow Control 1 NI NI 1 Flow Control 2 NI NI 2 Flow Control 3 NI NI 3 Flow Control Output Arbiter Output Arbiter Output Arbiter Output Arbiter Bufferless Crossbar Buffered Crossbar

Can We Get Lower Cost than A Crossbar? • Yet still have low contention? • Idea: Multistage networks

Multistage Logarithmic Networks • Idea: Indirect networks with multiple layers of switches between terminals/nodes • Cost: O(NlogN), Latency: O(logN) • Many variations (Omega, Butterfly, Benes, Banyan, …) • Omega Network: conflict

Multistage Circuit Switched • More restrictions on feasible concurrent Tx-Rx pairs • But more scalable than crossbar in cost, e.g., O(N logN) for Butterfly 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 2-by-2 crossbar

Multistage Packet Switched • Packets “hop” from router to router, pending availability of the next-required switch and buffer 0 0 1 1 2 2 3 3 4 4 5 5 6 6 2-by-2 router 7 7

Aside: Circuit vs. Packet Switching • Circuit switching sets up full path • Establish route then send data • (no one else can use those links) + faster arbitration -- setting up and bringing down links takes time • Packet switching routes per packet • Route each packet individually (possibly via different paths) • if link is free, any packet can use it -- potentially slower --- must dynamically switch + no setup, bring down time + more flexible, does not underutilize links

Switching vs. Topology • Circuit/packet switching choice independent of topology • It is a higher-level protocol on how a message gets sent to a destination • However, some topologies are more amenable to circuit vs. packet switching

Another Example: Delta Network • Single path from source to destination • Does not support all possible permutations • Proposed to replace costly crossbars as processor-memory interconnect • Janak H. Patel ,“Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. 8x8 Delta network

Another Example: Omega Network • Single path from source to destination • All stages are the same • Used in NYU Ultracomputer • Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982.

Ring + Cheap: O(N) cost - High latency: O(N) - Not easy to scale - Bisection bandwidth remains constant Used in Intel Haswell, Intel Larrabee, IBM Cell, many commercial systems today

Unidirectional Ring • Simple topology and implementation • Reasonable performance if N and performance needs (bandwidth & latency) still moderately low • O(N) cost • N/2 average hops; latency depends on utilization R R R R 2x2 router 0 1 N-2 N-1 2

Bidirectional Rings + Reduces latency + Improves scalability - Slightly more complex injection policy (need to select which ring to inject a packet into)

Hierarchical Rings + More scalable + Lower latency - More complex

More on Hierarchical Rings • Chris Fallin, Xiangyao Yu, Kevin Chang, RachataAusavarungnirun, Greg Nazario, Reetuparna Das, and Onur Mutlu,"HiRD: A Low-Complexity, Energy-Efficient Hierarchical Ring Interconnect"SAFARI Technical Report, TR-SAFARI-2012-004, Carnegie Mellon University, December 2012. • Discusses the design and implementation of a mostly-bufferless hierarchical ring

Mesh • O(N) cost • Average latency: O(sqrt(N)) • Easy to layout on-chip: regular and equal-length links • Path diversity: many ways to get from one node to another • Used in Tilera 100-core • And many on-chip network prototypes

Torus • Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle • Torus avoids this problem + Higher path diversity (and bisection bandwidth) than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths

Torus, continued • Weave nodes to make inter-node latencies ~constant

Trees Planar, hierarchical topology Latency: O(logN) Good for local traffic + Cheap: O(N) cost + Easy to Layout - Root can become a bottleneck Fat trees avoid this problem (CM-5) Fat Tree

CM-5 Fat Tree • Fat tree based on 4x2 switches • Randomized routing on the way up • Combining, multicast, reduction operators supported in hardware • Thinking Machines Corp., “The Connection Machine CM-5 Technical Summary,” Jan. 1992.

1101 1111 1100 1110 0101 0111 0100 1001 1011 0110 1000 1010 0001 0011 0000 0010 Hypercube • Latency: O(logN) • Radix: O(logN) • #links: O(NlogN) + Low latency - Hard to lay out in 2D/3D

Caltech Cosmic Cube • 64-node message passing machine • Seitz, “The Cosmic Cube,” CACM 1985.

Handling Contention • Two packets trying to use the same link at the same time • What do you do? • Buffer one • Drop one • Misroute one (deflection) • Tradeoffs?

Bufferless Deflection Routing • Key idea: Packets are never buffered in the network. When two packets contend for the same link, one is deflected.1 New traffic can be injected whenever there is a free output link. Destination 1Baran, “On Distributed Communication Networks.” RAND Tech. Report., 1962 / IEEE Trans.Comm., 1964.

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Presentation Transcript

Lecture 14: Distributed Multimedia Systems

Strategic Human Resource Management in Europe

CHAPTER 11 HUMAN RESOURCE MANAGEMENT

Human resource management in Australia

Human Resource Management 10 th Edition Chapter 14 GLOBAL HUMAN RESOURCE MANAGEMENT

Resource Management

Lecture 8 OOP v.s . FP, Subtyping

CUDA Lecture 3 Parallel Architectures and Performance Analysis

Memory Systems in the Multi-Core Era Lecture 1: DRAM Basics and DRAM Scaling

CS15-346 Perspectives in Computer Architecture

Lecture 3 (Complexities of Parallelism)

Securing an Information Resource Management System

Decentralization, Autonomy, and Participation in Multi-User/Agent Environments

Core Four Pest Management

Multi Cycle CPU

Lecture 25: Multi-view stereo, continued

Core Lecture Linguistics

Architectures and Algorithms for Internet-Scale (P2P) Data Management

Human Resource Management

A Survey on Power Management Solutions for Individual Systems and Cloud