11. Multicore Processors

11. Multicore Processors Dezső Sima Fall 2006  D. Sima, 2006

Overview 1 Overview of MCPs 2 Attaching L2 caches 3 Attaching L3 caches 4Connecting memory and I/O 5Case examples

1. Overview of MCPs (1) Figure 1.1: Processor power density trends Source: D. Yen: Chip Multithreading Processors Enable Reliable High Throughput Computing http://www.irps.org/05-43rd/IRPS_Keynote_Yen.pdf

1. Overview of MCPs (2) Figure 1.2: Single-stream performance vs. cost Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and MicroarchitectureIntel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16

1. Overview of MCPs (2) Figure 1.2: Dual/multi-core processors (1)

1. Overview of MCPs (3) Figure 1.3: Dual/multi-core processors (2)

1. Overview of MCPs (4) Macro architecture of dual/multi-core processors (MCPs) Layout of the cores Layout of the I/O and memory architecture Attaching of L2 caches Attaching of L3 caches (if available)

2. Attaching L2 caches 2.1 Main aspects of attaching L2 caches to MCPs (1) Attaching L2 caches to MCPs Allocation to the cores Use by instructions/data Integration of L2 caches to the proc. chip Inclusion policy Banking policy

Allocation of L2 caches to the cores Shared L2 cache for all cores Private L2 cache for each core UltraSPARC IV (2004) UltraSPARC T1 (2005) Yonah (2006) Smithfield (2005) Core Duo (2006) Athlon 64 X2 (2005) POWER4 (2001) POWER5 (2005) Montecito (2006?) Expected trend

2.1 Main aspects of attaching L2 caches to MCPs (2) Attaching L2 caches to MCPs Allocation to the cores Use by instructions/data Integration of L2 caches to the proc. chip Inclusion policy Banking policy

Inclusion policy of L2 caches Exclusive L2 Inclusive L2 L1 L2 L1 Memory L2 Memory Lines replaced (victimized) in the L1 are written into the L2 References to data in the L2 initiate reloading that cache line into the L1, L2 operates usually as write back cache (only modified data that is replaced in the L2 is written back to the memory), Unmodified data that is replaced in the L2 is deleted.

Figure 1.1: Implementation of exclusive L2 caches Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2004, pp. 89-96.

Inclusion policy of L2 caches Exclusive L2 Inclusive L2 Most implementations Athlon 64X2 (2005) Expected trend

Use by instructions/data Unified instr./data cache(s) Split instr./data caches UltraSPARC IV (2004) Montecito (2006?) UltraSPARC T1 (2005) POWER4 (2001) POWER5 (2005) Smithfield (2005) Yonah (2006) Core Duo (2006) Athlon 64 X2 (2005) Expected trend

Banking policy Single-banked implementation Multi-banked implementation

Integration to the processor chip On chip L2 tags/contr., off chip data Entire L2 on chip UltraSPARC IV (2004) UltraSPARC V (2005) POWER4 (2001) POWER5 (2005) Smithfield (2005) Presler (2005) Athlon 64 X2(2005) Expected trend

2.2 Examples of attaching L2 caches to MCPs (1) L2 data L2 data L2 L2 Core Core Core Core L2 tags/contr. L2 tags/contr. Core Core HT-bus Mem contr. contr. Syst. if. Mem. contr. Fire Plane Memory Memory bus Private L2 caches for each core Unified instruction / data caches Split instruction/data caches Entire L2 on-chip Entire L2 on-chip On-chip L2 tags/contr., off-chip data On-chip L2 t/c off-chip data Examples: Montecito (2006?) UltraSPARC IV (2004) Smithfield (2005) Athlon 64 X2 (2005) Presler (2005) (Exclusive L2) Core Core Interconn. L2 I L2 D L2 I L2 D network L2 L2 System Request Queue L3 L3 Xbar Syst. if. Syst. if. FSB FSB HT-bus

2.2 Examples of attaching L2 caches to MCPs (2) Core Core Core Core L2 Memory Memory L3 tags/ contr. 7 6 0 Addr. 196 128 S Modulo 3 64 0 1 0 2 256 Shared L2 caches for all cores Multi core/multi banked L2 Dual core/single banked L2 Dual core/multi banked L2 UltraSPARC T1 (2005) Yonah Duo (2006) POWER4 (2001) (Niagara) Examples: Core (2006) POWER5 (2005) (8 cores/4xL2 banks) Core Core X-bar X-bar L2 contr. L2 L2 L2 L2 L2 Fabric Bu SContr. Fabric Bus Contr. System if. Mem. contr. Mem. contr. GX FSB contr. GX bus Mapping of addresses to the banks: Mapping of addresses to the banks: The 128-byte long L2 cache lines are hashed across The four L2 modules are interleaved at 64-byte blocks. the 3 modules. Hashing is performed by modulo 3 arithmetric applied on a large number of real address bits.

3. Attaching L3 caches Macro architecture of dual/multi-core processors (MCPs) Layout of the cores Layout of the I/O and memory architecture Attaching of L2 caches Attaching of L3 caches (if available)

3.1 Main aspects of attaching L3 caches to MCPs (1) Attaching L3 caches to MCPs Allocation to the L2 cache(s) Use by instructions/data Integration of L3 caches to the proc. chip Inclusion policy Banking policy

Allocation of L3 caches to the L2 caches Shared L3 cache for all L2s Private L3 cache for each L2 POWER5 (2005) POWER4 (2001) UltraSPARC IV+ (2004) Montecito (2006?)

Inclusion policy of L3 caches Exclusive L3 Inclusive L3 L2 L3 L2 Memory L3 Memory Lines replaced (victimized) in the L2 are written into the L3 References to data in the L3 initiate reloading that cache line into the L2, L3 operates usually as write back cache (only modified data that is replaced in the L3 is written back to the memory), Unmodified data that is replaced in the L3 is deleted.

Inclusion policy of L3 caches Exclusive L3 Inclusive L3 POWER4 (2001) POWER5 (2005) UltraSPARC IV+ (2004) Montecito (2006?) Expected trend

Use by instructions/data Unified instr./data cache(s) Split instr./data caches All multicore processors unveiled until now hold both instruction and data

Banking policy Single-banked implementation Multi-banked implementation

Integration to the processor chip On chip L3 tags/contr., off chip data Entire L3 on chip UltraSPARC IV+ (2005) POWER4 (2001) POWER5 (2005) Montecito (2006?) Expected trend

3.2 Examples of attaching L3 caches to MCPs (1) L2 L2 L2 L3 data Memory Inclusive L3 cache Private L3 caches for each L2 cache banks Shared L3 cache for all cache banks Entire L3 on-chip On-chip L3 tags/contr., off-chip data On-chip L3 tags/contr., off-chip data Entire L3 on-chip POWER4 (2001) Examples: Montecito (2006?) L2 I L2 D L2 I L2 D Fabric Bus Contr. L3 L3 L3 tags/contr. Arbiter System if. Mem. contr. FSB

3.2 Examples of attaching L3 caches to MCPs (2) L3 data L3 data L3 tags/contr. L2 L3 tags/contr. L2 L3 tags/contr. L3 data L3 tags/contr. L3 data L2 Core Core Syst. if. Mem. contr. Fire Plane Memory bus Exclusive L3 cache Private L3 caches for each L2 cache banks Shared L3 cache for all cache banks Entire L3 on-chip On-chip L3 tags/contr., off-chip data Entire L3 on-chip On-chip L3 tags/contr., off-chip data Examples: POWER5 (2005): UltraSPARC IV+ (2005): L2 Interconn. network Fabric Bus Contr. Memory contr. Memory

4. Connecting memory and I/O Macro architecture of dual/multi-core processors (MCPs) Layout of the cores Layout of the I/O and memory architecture Attaching of L2 caches Attaching of L3 caches (if available)

4.1 Overview Layout of the I/O and memory architecture in dual/multi-core processors Integration of the memory controller to the processor chip Connection policy of I/O and memory

4.2 Connection policy (1) Connection policy of I/O and memory Dedicated connection of I/O and memory Connecting both I/O and memory via the system bus Symmetric connection of I/O and memory Asymmetric connection of I/O and memory POWER4 (2001) PA-8800 (2004) POWER5 (2005) PA-8900 (2005) UltraSPARC T1 (2005) UltraSPARC IV (2004) UltraSPARC IV+ (2005) Smithfield (2005) Presler (2005) Athlon64 X2 (2005) Yonah Duo (2006) Core (2006) Montecito (2006?)

4.2 Connection policy (2) Syst. bus if. Syst. bus if. L2 Core Core Syst. bus if. Connecting both I/O and memory via the system bus Examples: Smithfield/Presler (2005/2005) Yonah Duo/Core (2006/2006) L2 L2 L2 FSB FSB Montecito (2006) PA-8800 (2004) PA-8900 (2005) L2 I/ L2 I/ L2 D L2 D L2 contr. L3 L3 Syst. bus if. FSB FSB

4.2 Connection policy (3) Connection policy of I/O and memory Dedicated connection of I/O and memory Connecting both I/O and memory via the system bus Symmetric connection of I/O and memory Asymmetric connection of I/O and memory (Connecting I/O via the internalinterconnection network,and memory via the L2/L3 cache) (Connecting both I/O and memory via the internal interconnection network POWER4 (2001) PA-8800 (2004) POWER5 (2005) PA-8900 (2005) UltraSPARC T1 (2005) UltraSPARC IV (2004) UltraSPARC IV+ (2005) Smithfield (2005) Presler (2005) Athlon64 X2 (2005) Yonah Duo (2006) Core (2006) Montecito (2006?)

4.2 Connection policy (4) Memory L2 X Memory L2 b a Memory L2 r Memory L2 L2 L2 L2 L3 dir./ GX contr. contr. L3 data Memory Asymmetric connection of I/O and memory POWER4 (2001) UltraSPARC T1 (2005) Core 0 L2 M. contr. Chip-to-chip/ M. contr. Fabric Bus Contr. Mem.-to-Mem. interconn. M. contr. M. contr. Core 7 GX-bus Bus if. Mem. contr. JBus

4.2 Connection policy (5) Connection policy of I/O and memory Dedicated connection of I/O and memory Connecting both I/O and memory via the system bus Symmetric connection of I/O and memory Asymmetric connection of I/O and memory (Connecting I/O via the internalinterconnection network,and memory via the L2/L3 cache) (Connecting both I/O and memory via the internal interconnection network POWER4 (2001) PA-8800 (2004) POWER5 (2005) PA-8900 (2005) UltraSPARC T1 (2005) UltraSPARC IV (2004) UltraSPARC IV+ (2005) Smithfield (2005) Presler (2005) Athlon64 X2 (2005) Yonah Duo (2006) Core (2006) Montecito (2006?)

4.2 Connection policy (6) L2 data L2 data L2 L2 L2 L3 Chip-chip/ L2 tags/contr. L2 tags/contr. Mem.-Mem. interconn. Core Core GX Mem contr. contr. Syst. if. Mem. contr. Memory Fire Plane Memory bus Symmetric connection of I/O and memory (1) POWER5 (2005) UltraSPARC IV (2004) Fabric Bus Contr. Interconn. network GX. bus

4.2 Connection policy (7) L3 data L2 L2 L3 tags/contr. Core Core HT-bus Mem contr. contr. Syst. if. Mem. contr. Memory Fire Plane Memory bus Symmetric connection of I/O and memory (2) Athlon 64 X2 (2005) UltraSPARC IV+ (2005) System Request Queue L2 Xbar Interconn. network HT-bus

4.3 Integration of the memory controller to the processor chip Integration of the memory controller to the processor chip On-chip memory controller Off-chip memory controller POWER4 (2001) POWER5 (2005) UltraSPARC IV (2004) PA-8800 (2004) PA-8900 (2005) UltraSPARC IV+ (2005) UltraSPARC T1 (2005) Smithfield (2005) Presler (2005) Athlon 64 X2(2005) Yonah Duo (2006) Core (2006) Montecito (2006?) Expected trend

5. Case examples 5.1 Intel MCPs (1) Figure 5.1: The move to Intel multi-core Source: A. Loktu: Itanium 2 for Enterprise Computing http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps

5.1 Intel MCPs (2) Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm) Source: http://www.intel.com/products/processor/index.htm

5.1 Intel MCPs (3) ED: Execute Disable Bit Malicious buffer overflow attacks pose a significant security threat. In a typical attack, a malicious worm creates a flood of code that overwhelms the processor, allowing the worm to propagate itself to the network, and to other computers. It can help prevent certain classes of malicious buffer overflow attacks when combined with a supporting operating system. Execute Disable Bit allows the processor to classify areas in memory by where application code can execute and where it cannot. When a malicious worm attempts to insert code in the buffer, the processor disables code execution, preventing damage and worm propagation. VT: Virtualization Technology It is a set of hardware enhancements to Intel’s server and client platforms that can improve the performance and robustness of traditional software-based virtualization solutions. Virtualization solutions will allow a platform to run multiple operating systems and applications in independent partitions. Using virtualization capabilities, one computer system can function as multiple "virtual" systems. EIST: Enhanced Intel SpeedStep Technology First delivered in Intel’s mobile and server platforms, It allows the system to dynamically adjust processor voltage and core frequency, which can result in decreased average power consumption and decreased average heat production.

5.1 Intel MCPs (4) Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm) Source: http://www.intel.com/products/processor/index.htm

5.1 Intel MCPs (5) Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965 Source: http://www.intel.com/products/processor/index.htm

11. Multicore Processors