Dezső Sima

Multicore and Manycore Processors Overview and Trends Dezső Sima December 2008

Overview 1. Overview 2. Homogeneous multicore processors 2.1 Conventional multicores 2.2 Manycore processors 3. Heterogeneous multicore processors 3.1 Master/slave architectures 3.2 Attached processor architectures 4. Outlook

1. Overview – inevitability of multicores

1. Overview – inevitability of multicores (1) Shrinking: ~ 0.7/2 Years Figure: Evolution of Intel’s IC fab technology [1]

1. Overview – inevitability of multicores (2) IC fab technology every two years Shrinking~ 0.7x/2 years) • same number of transistors: on ½ Si die area • on the same die area: 2x as many transistors Doubling transistor counts ~ every two years (on the chips) Moore’s rule (2. formulation: from 1975)

1. Overview – inevitability of multicores (3) 4 2 1 Doubling transistor counts ~ every two years Utilization of the surplus transistors? Wider processor width pipeline superscalar 1. Gen. 2. Gen.

1. Overview – inevitability of multicores (4) Available parallelism in general purpose apps: ~ 4-5 Figure: Parallelism available in applications [2]

1. Overview – inevitability of multicores (5) 4 2 1 Doubling transistor counts ~ every two years Utilization of the surplus transistors? Wider processor width Core enhancements Cache enhancements L2/L3 enhancements (size, associativity ...) • branch prediction • speculative loads • ... pipeline superscalar 1. Gen. 2. Gen.

1. Overview – inevitability of multicores (6) Increasing transistor count Diminishing return in performance The best use of surplus transistors is: multiple cores The inevitability of multicore processors with doubling of core numbers ~ every two years

1. Overview – inevitability of multicores (7) Figure: Spreading Intel’s multicore processors [3]

1. Overview – inevitability of multicores (8) GPU CPU MPC Multicore processors Homogenous multicores Heterogenous multicores Conventional multicores Manycore processors Master/slave architectures Add-on architectures 2 ≤ n ≤ 8 cores with >8 cores Desktops Servers General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future Figure 1.1: Main classes of multicore/manycore processors

2. Homogeneous multicores 2.1 Conventional multicores 2.2 Manycore processors

2. Homogeneous multicores GPU CPU MPC Multicore processors Homogenous multicores Heterogenous multicores Conventional multicores Manycore processors Master/slave architectures Add-on architectures 2 ≤ n ≤ 8 cores with >8 cores Desktops Servers General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future Figure 2.1: Main classes of multicore/manycore processors

2.1 Conventional multicores Multicore MP servers Intel’s multicore MP servers AMD’s multicore MP servers

2.1Intel’s multicore MP servers (1) The evolution of Intel’s basic microarchitecture Figure 2.1.1: Intel’s Tick-Tock development model [13]

2.1Intel’s multicore MP servers (2) (Potomac) 1x1 C 8 MB L2 (Cransfield) 1x1 C 1 MB L2 7000 (Paxville MP) 2x1 C ½ MB L2/C 7100 (Tulsa) 2x1 C 1 MB L2/C 16 MB L3 7200 (Tigerton DC) 1x2 C 4 MB L2/C 2x2 C 4 MB L2/C 7300 (Tigerton QC) 7400 (Dunnington) 1x6 C 3 MB L2/2C 16 MB L3 7xxx (Beckton) 1x8 C ¼ MB L2/C 24 MB L3 Intel’s Tick-Tock model for MP servers 3/2005: First 64-bit MP Xeons 11/2005: First DC MP Xeon TICKPentium 4 /Prescott) 90nm TOCKPentium 4 /Irwindale) 1Q/2009 Figure 2.1.2: Overview of Intel’s Tick-Tock model and the related MP servers [24]

2.1Intel’s multicore MP servers (3) System architecture (before Potomac) Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1 SC SC SC SC Preceding NBs Typically HI 1.5 (266 MB/s) 1 Xeon MP before Potomac Figure 2.1.3: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

2.1Intel’s multicore MP servers (4) 3/2005 Truland Truland MP Platforms 3/2005 11/2005 8/2006 Xeon MP Xeon 7000 Xeon 7100 MP Cores (Potomac SC) (Paxville MP DC) (TulsaDC) 65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604 90 nm/675 mtrs 1 MB L2 8/4 MB L3 667 MT/s mPGA 604 90 nm/2x169 mtrs 2x1 (2) MB L2 - 800/667 MT/s mPGA 604 3/2005 4/2006 8500 8501 MP Chipsets (?) (Twin Castle) 2xFSB 800 MT/s 4 x XMB (2 x DDR2) 32GB 2xFSB 667 MT/s 4 x XMB (2 x DDR2) 32GB P4-based/90 nm P4-based/65 nm First 64-bit server Figure 2.1.4: Intel’s Xeon-based MP server platforms

2.1Intel’s multicore MP servers (5) XMB XMB XMB XMB 3 The 8500 supports also 2 First x86-64 MP processor • Cransfield SC) • Tulsa (DC) 2005 Up to 2005 Truland Potomac2 Paxville MP3 Potomac2 Paxville MP3 Potomac2 Paxville MP3 Potomac2 Paxville MP3 Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1 SC SC SC SC DC/SC DC/SC DC/SC DC/SC Preceding NBs (Twin Castle) 8500/8501 DC External Memory Bridge Typically HI 1.5 28 PCIe lanes + HI 1.5 (266 MB/s) (266 MT/s) (7 GT/s) Serial link 1 Xeon MPs before Potomac Figure 2.1.5: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

2.1Intel’s multicore MP servers (6) 3/2005 9/2007 Truland Truland Caneland Caneland MP Platforms 9/2007 3/2005 11/2005 8/2006 9/2008 Xeon 7200 Xeon 7300 Xeon 7400 Xeon MP Xeon 7000 Xeon 7100 MP Cores (Tigerton DC) (Potomac SC) (Paxville MP DC) (Tulsa DC) (Tigerton)QC (Dunnington 6C) 65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604 65 nm/2x291 mtrs 2x4 MB L2 - 1066 MT/s mPGA 604 65 nm/2x291 mtrs 2x(4/3/2) MB L2 - 1066 MT/s mPGA 604 45 nm/1900 mtrs 9/6 MB L2 16/12/8 MB L3 1066 MT/s mPGA 604 90 nm/675 mtrs 1 MB L2 8/4 MB L3 667 MT/s mPGA 604 90 nm/2x169 mtrs 2x1 (2) MB L2 - 800/667 MT/s mPGA 604 3/2005 9/2007 4/2006 7300 8501 8500 7300 MP Chipsets (?) (Clarksboro) (Twin Castle) 2xFSB 800 MT/s 4 x XMB (2 x DDR2) 32GB 4xFSB 1066 MT/s 4 x FBDIMM (DDR2) 512GB 2xFSB 667 MT/s 4 x XMB (2 x DDR2) 32GB P4-based/90 nm P4-based/65 nm Core2-based/65 nm Core2-based/45 nm Figure 2.1.6: Intel’s Xeon-based MP server platforms

2.1Intel’s multicore MP servers (7) Potomac2 Paxville MP3 Potomac2 Paxville MP3 Potomac2 Paxville MP3 Potomac2 Paxville MP3 DC/SC DC/SC DC/SC DC/SC (Twin Castle) XMB XMB 8500/8501 XMB XMB 28 PCIe lanes + HI 1.5 (266 MT/s) (7 GT/s) FB-DIMM (DDR2) 3 The 6500 supports also • Cransfield SC) • Tulsa (DC) Up to 2005 Truland 2005 Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1 SC SC SC SC Preceding NBs DC Typically HI 1.5 Caneland 2007 (266 MB/s) Tigerton Tigerton Tigerton Tigerton Dunnington Dunnington Dunnington Dunnington 6C/QC/DC 6C/QC/DC 6C/QC/DC 6C/QC/DC (Clarksboro) QC/8C 7300 2 First x86-64 MP processor 1 Xeon MP before Potomac 8 PCI-E lanes + ESI (2 GT/s) (1 GT/s) Figure 2.1.7: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem)

2.1Intel’s multicore MP servers (8) Nehalem’s key innovations concerning the system architecture (11/2008) Figure 2.1.8: Nehalem’s key innovations concerning the system architecture [22]

2.1Intel’s multicore MP servers (9) Nehalem’s key innovations concerning the system architecture (11/2008) Figure 2.1.9: Nehalem’s key innovations concerning the system architecture [22]

2.1Intel’s multicore MP servers (10) 11/2008: Nehalem Beckton8C Beckton8C QPI 4xFB-DIMM QPI QPI QPI QPI Beckton8C Beckton8C QPI QPI QPI QPI: QuickPath Interconnect Figure 2.1.10: Intel’s Nehalem based MP server architecture

AMD’s multicore MP servers

2.1 AMD’s multicore MP servers (1) AMD Direct Connect Architecture (2003) Introduced in 2003 along with the x86-64 ISA extension • Integrated Memory Controller • Serial HyperTransport links (Intel: 2008 with Nehalem) Remark • 3 HT 1.0 links at introduction (K8), • 4 HT 3.0 links with K10 (Barcelona) Figure 2.1.11: AMD’s Direct Connect Architecture [14]

2.1 AMD’s multicore MP servers (2) Use of available HyperTransport links [44] UPs Each link supports connections to I/O devices DPs Two links support connections to I/O devices, any one of the three links may connect to another DP or MP processor MPs Each link supports connections to I/O devices or other DP or MP processors

2.1 AMD’s multicore MP servers (3) RDD2 RDD2 AMDOpteron AMDOpteron HT RDD2 RDD2 AMDOpteron AMDOpteron HT HT HT RDD2 RDD2 AMDOpteron AMDOpteron HT HT HT PCI-X HT HT PCI-X PCI PCI Express I/O Figure 2.1.12: 2P and 4P server architectures based on AMD’s Direct Connect Architecture [15], [16]

2.1 AMD’s multicore MP servers (4) (K10) Figure 2.1.13: Block diagram of Barcelona (K10) vs K8 [17]

2.1 AMD’s multicore MP servers (5) Figure 2.1.14: Possible use of Barcelona’s four HT 3.0 links [39]

2.1 AMD’s multicore MP servers (6) Current platforms (2. Gen. Socket F with available chipsets) do not support HT3.0 links [46]. Novel features of HT 3.0 links, such as • higher speed or • splitting a 16-bit HT link to two 8-bit links can be utilized only with a new platform.

2.1 AMD’s multicore MP servers (7) Figure 2.1.15: AMD’s roadmap for server processors and platforms [19]

2.2Manycore processors

2.2 Manycore processors GPU CPU MPC Multicore processors Homogenous multicores Heterogenous multicores Conventional multicores Manycore processors Master/slave architectures Add-on architectures 2 ≤ n ≤ 8 cores with >8 cores Desktops Servers General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future Figure 2.2.1: Main classes of multicore/manycore processors

2.2Manycore processors Intel’s Larrabee Intel’s Tiled processor

2.2 Intel’ Larrabee (1) Larrabee Part of Intel’s Tera-Scale Initiative. • Objectives: High end graphics processing, HPC Not a single product but a base architecture for a number of different products. • Brief history: Project started ~ 2005 First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27]) Due in ~ 2009 • Performance (targeted): • 2 TFlops

2.2 Intel’ Larrabee (2) Basic architecture Figure 2.2.2: Block diagram of the Larrabee [4] • Cores: In order, 4-way multithreaded x86 IA cores, augmented with SIMD-16 capability • L2 cache: fully coherent • Ring bus: 1024 bits wide

2.2 Intel’ Larrabee (3) Main extensions • 64-bit instructions • 4-way multithreaded • (with 4 register sets) • addition of a 16-wide • (16x32-bit) VU • increased L1 caches • (32 KB vs 8 KB) • access to its 256 KB • local subset of a • coherent L2 cache • ring network to access • the coherent L2 $ • and allow interproc. • communication. Figure 2.2.5: Larrabee vs the Pentium [11]

2.2 Intel’ Larrabee (4) Mask registers The Vector Unit have one bit per bit lane, to control which bits of a vector reg. or memory data are read or written and which remain untouched. VU scatter-gather instructions (load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly). Numeric conversions 8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty. L1 D$ becomes as an extension of the register file Figure 2.2.3: Block diagram of the Vector Unit [5]

2.2 Intel’ Larrabee (5) ALUs • ALUs execute integer, SP and DP FP instructions • Multiply-add instructions are available. Figure 2.2.4: Layout of the 16-wide vector ALU [5]

2.2 Intel’ Larrabee (6) CSI: Common Systems Interface (Serial packet-based bus) Figure 2.2.6: System architecture of a Larrabee based 4-processor MP server [6]

2.2 Intel’ Larrabee (7) Programming of Larrabee [5] • Larrabee has x86 cores with an unspecified ISA extension,

2.2 Intel’ Larrabee (8) AES: Advanced Encryption Standard AVX: Advanced Vector Extension FMA: FP fused multiply-add instr. supporting 256-bit/128-bit SIMD Figure 2.2.7: Intel’s ISA extensions [11]

2.2 Intel’ Larrabee (9) Programming of Larrabee [5] • Larrabee has x86 cores with an unspecified ISA extension, • x86 cores allow to program Larrabee as usual x86 processors, • by using enhanced C/C++ compilers from MS, Intel, GCC etc. • this is a huge advantage compared to the competition (Nvidia, AMD/ATI),

Intel’s Tiled processor

2.2 Intel’s Tiled processzor (1) Tiled Processor • First implementation of Intel’s Tera-Scale Initiative (among more than 100 projects) • Aim: Tera-Scale research chip • - high bandwidth interconnect • - energy management • - programming manycore processors • Milestones of the development: Announcedat IDF Fall 20069/2006 Details at ISSCC 2007 2/2007 Due to 2009/2010 Remark Based on ideas of the Raw processor(MIT)

2.2 Intel’s Tiled processzor (2) Figure 2.2.8: Basic structure of the Tiled Processor [7]

2.2 Intel’s Tiled processzor (3) 2 single precision FP (Multiply-Add) SP FP cores (For debugging) VLIW microarchitecture? Figure 2.2.9: Block diagram of a tile [7], [9]

2.2 Intel’s Tiled processzor (4) Figure 2.2.10: Die shot of the Tiled Proc. [8]

2.2 Intel’s Tiled processzor (5) Figure 2.2.13: Ring based interconnect network topology [7]

Dezső Sima

Dezső Sima

Presentation Transcript

Physics of the Heart: From the macroscopic to the microscopic

Pablo Picasso

Dezső Sima 20 11 December

Microarchitecture of Superscalars (3) Branch Prediction

Evolution of the ILP Processing

Rendszerarchitektúra

Rendszer architektúrák

Multithreaded Processors

Dezső Sima 20 11 November

5. Microarchitecture of Superscalars (3) Branch Prediction

10 Famous Painters

Dezső Sima 20 1 2 December