Multi-core Programming

Multi-core Programming Multi-core Hardware Architecture

Topics • Dual-Core Architecture Notes • Motivation for Multicore • Architecture Features • SIMD • 64 Bit • Concurrency vs. Parallelism

P5 P6 Intel NetBurst® Mobile Pentium® Pentium® Pro Pentium® II/III Pentium® 4 Pentium® D Xeon® Pentium® M Dual Core Architecture Notes - History before 2006 Examples: Architecture: Instruction set definition and compatibility EPIC* (Itanium®) IA-32 IXA* (xScale) Microarchitecture: Hardware implementation maintaining instruction set compatibility with high-level architecture Examples: Examples: Processors: Productized implementation of Microarchitecture Intel® Core(TM) Micro-Architecture * IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing

Dual Core Architecture Notes - Intel® Core™ Micro-Architecture • Intel Objectives: • Deliver world class performance • Unsurpassed energy efficiency • Optimized for Multi-core • Single foundation for desktop, mobile and server segments NetBurst® µArch Intel® Core™ µArch + “New Innovations” Mobile µArch Intel® Core(TM) Micro-Architecture

P4 P3 P2 Frequency Doubledin Two Years 100,000 10,000 1,000 Frequency(MHz) P6 100 Pentium® proc 486 10 386 8085 286 8086 8080 1 8008 0.1 4004 ’70 ’80 ’90 ’00 ’10 Dual Core Architecture Notes - Motivation for Multicore Chip frequency doubled every 18 to 24 months • This trend defined processor design in the past • Trend has effectively tapered off since 2005 due to energy issues • Intel Pentium® processor Extreme Editionat 3.7GHz. Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Motivation for Multicore Small Increases in Processor Speed Results in Large Increases in Power Intel® Core(TM) Micro-Architecture

Q: How Can Performance Increase Without Relying on Frequency? Dual Core Architecture Notes Key equation: Performance = Frequency x Instructions per Cycle Increase IPC (Instructions per Cycle) by • Adding additional execution cores – Multi-Core design • Improving selected, micro-architectural features to get better execution efficiency for a single core Intel® Core(TM) Micro-Architecture

Cache Cache Core Core Core Dual Core Architecture Notes - Multi-Core Technology Power Advantage Rule of thumb - % Relationships In the same process technology… Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 (depends on SW ) Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes Improving Execution Efficiency by • Reduced pipeline length • Reduces cost of pipeline flushes and thus e.g. cheaper branch mispredictions • Increase dynamic execution parallelism • More execution units • Overall wider data paths • Optimized data access pattern • Recycle data in transit (store-forwarding, buffering) • Look-ahead data movements (prefetching) • Improved cache structure reducing cost of cache misses and in particular overhead of snooping protocol (in multi-core/multi-socket systems) Intel® Core(TM) Micro-Architecture

Intel® Wide Dynamic Execution • 14-stage efficient pipeline • Wider execution path • Advanced branch prediction • Macro-fusion • Roughly ~15% of all instructions are conditional branches • Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline • Micro-fusion • Merges the load and operation micro-ops into one macro-op • Stack pointer tracker • ESP tracks the stack • This pointer allows push/pops to work returning the correct values • 64-Bit Support • Merom, Conroe, and Woodcrest support EM64T New Intel Microarchitecture

Instruction Fetch and Pre Decode Instruction Fetch and Pre Decode 2MB/4MB Shared L2 Cache Up to 10.5 GB/s FSB Instruction Queue Instruction Queue uCode ROM uCode ROM Decode Decode Rename/Alloc Rename/Alloc Retirement Unit (Reorder Buffer) Retirement Unit (Reorder Buffer) Schedulers Schedulers ALU Branch MMX/SSE FPMove ALU FMul MMX/SSE FPMove ALU FAdd MMX/SSE FPMove ALU Branch MMX/SSE FPMove ALU FMul MMX/SSE FPMove ALU FAdd MMX/SSE FPMove STORE LOAD LOAD STORE L1 D-Cache and D-TLB L1 D-Cache and D-TLB doubled in a Dual-Core Processor … Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes – AMD Multi-core

Dual Core Architecture Notes – SUN Sparc Multi-core

Dual Core Architecture Notes – Tilera Multi-core • Tilera – TILE 64: • At a clock rate of only 1 GHz • 192 billion instructions per second • 27 Tbps of mesh interconnect • 50 Gbps of I/O bandwidth. • 4 DDR2 memory controllers • 2 10 Gbps XAUI • 2 1 Gps Ethernet Controllers • Each core is a full-featured, general-purpose processor that includes: • L1 and L2 caches • distributed L3 cache • iMesh network - provides extremely low-latency, high bandwidth communications between the cores, memory and the I/O.

icache branch prediction unit page miss handler store address integer FP SIMD (3x) predecode data cache unit memory order buffer load instruction queue store data instruction decode register alias table Reservation Station MS ALLOC Re-Order Buffer Dual Core Architecture Notes - Intel® Core® Micro-Architecture Outline Arrows show instruction flow. Intel® Core(TM) Micro-Architecture

Store address Integer FP SIMD (3x) Load Store data Dual Core Architecture Notes - Intel® Wide Dynamic Execution • Short, 14-stage efficient pipeline • Reduced cost of pipeline flushes e.g. due to branch mispredictions • Reduced execution latency • At least 4 µ-ops/cycle wide execution paths • Pre-Decode: 6 instructions/cycle sent to Instruction Queue • Decode: 4 (5 w/ macro fusion) instructions/cycle • Renaming: 4 µ-ops/cycle • Dispatch ports from Reservation Station to Execution Units: 6 • 3 execution ports shared for Integer, Floating Point and SIMD • 1 for Loads • 1 for Store (address) • 1 for Store (data ) • Retirement: 4 µ-ops/cycle Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes -Execution Components FP Add SIMD Port IntegerArithmetic FP Div/Mul IntegerShift/Rotate • Renamer: • Architectural registers are renamed to a larger set of Microarchitectural registers • Allocates uOps into the ROB • Reorder buffer (ROB): • Tracks the progress of a uOp from issue to completion/cancellation • Updates the architectural state in order. • Manages ordering of exceptions • Reservation station (RS): • holds uops until dispatched for execution • Schedules and dispatches ready micro-ops to the available execution units in as close to a first in first out (FIFO) order as possible • Ports and execution units: • Execute the instructions and pass the results • Write back to the ROB SIMD Port IntegerArithmetic Port SIMD Reservation Stations (RS) IntegerArithmetic Scheduler / Dispatch Ports Port MemoryOrderBuffer(MOB) Port Port Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Intel® Advanced Digital Media BoostFull 128Bit SSE Instruction Execution SSE Operation (SSE/SSE2/SSE3) SSE – Streaming SIMD Instructions: adds 8 new 128-bit registers, divided into 4 32-bit (single precision) floating point values. Gives access to 70 new instructions that operate on these 128bit registers, MMX registers, and sometimes even regular 32bit registers. • Single Cycle SSE • 8 Single Precision Flops/cycle • 4 Double Precision Flops/cycle • Wide Operations • 128-bit packed Add • 128-bit packed Multiply • 128-bit packed Load • 128-bit packed Store • Support for Intel® 64 • In combination with Intel® Wide Dynamic Execution, it is possible now to execute up to 3 SIMD instructions each cycle SOURCE 127 0 X4 X3 X2 X1 SSE - OP Y4 Y3 Y2 Y1 DEST Core™ arch CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1 CLOCK CYCLE 1 Previous X2opY2 X1opY1 CLOCK CYCLE 2 X4opY4 X3opY3 Intel® Core(TM) Micro-Architecture

CPU-1L1D=32KB L0/L1 DTLB CPU-1L1I=32KB Dual Core Architecture Notes - Intel® Advanced Smart CacheBoth Cores sharing the Level 2 Cache FSB Shared, unified L2 Cache= 4 (8) MB L0/L1 DTLB CPU-0L1D=32KB CPU-0L1I=32KB CPU-0Core CPU-1Core FSB – Front Side Bus L1D – Level 1 Data L1I – Level 1 Instruction DTLB – Data Translate Lookaside Buffer Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Shared L2 Cache: Difference to Traditional Architectures CoreTM NetBurstTM Core 0 Core 1 Core 0 Core 1 L1 L1 L1 L1 L2 L2 L2 Cache Control L2 . . . . . . MCH MCH Mem Mem Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Sharing a Cache Line on a non-shared L2 Cache Memory Front Side Bus (FSB) Shipping L2 Cache Line ~150 Cycles CPU1 CPU2 Cache Line Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Sharing a Cache Line on Core™ Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line CPU1 Cache Line CPU2 Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Intel® Advanced Smart CacheMore Facts • 32KB Level 1 D-Cache (8-way, 64 byte line size) • 3 clock latency and 1 clock throughput to core • 32KB Level 1 I-Cache (8-way, 64 byte line size) • 16 aligned bytes fetched per cycle • Fast “cache-to-cache” transfer • Hit in other core’s L1 cache requires no memory access • 2MB or 4MB shared, unified Level 2 cache • 8-way for 2MB and 16-way for 4MB • both cores have access to full L2 cache • ~14 clock latency and 2 clock throughput to core • 256 entry DTLB • 128 entry (for 4K pages, 8 for 2M pages) ITLB Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Intel® Smart Memory Access • Sophisticated data and instruction prefetch implementations to move data to cache prior being accessed by application • Improved Store-Forwarding handling • Less cases where data just being written can be read soon after without “paying” a penalty • Memory Disambiguation predictor • Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible • Increases the performance of OOO memory pipelines due to increased speculation potential • Re-execution mechanism in case prediction was incorrect Intel® Core(TM) Micro-Architecture

Dual Core Architecture Notes - Intel® Smart Memory AccessPrefetching • L1D cache prefetching • Data Cache Unit Prefetcher (streaming prefetcher) • Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors L1D cache • Instruction Based Stride Prefetcher • Prefetches based upon a load having a regular stride • Can prefetch forward or backward 2 Kbytes • L2 cache prefetching: Data Prefetch Logic (DPL) • Prefetches data to the 2nd level cache before the DCU requests the data • Maintains 2 tables for tracking loads • Upstream – 16 entries, downstream – 4 entries • Every load is either found in the DPL or generates a new entry • Upon recognition of the 2nd load of a “stream” the DPL will prefetch the next load • Instruction Prefetcher • 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers Intel® Core(TM) Micro-Architecture

SIMD - Introducing SIMD: Single Instruction, Multiple Data Scalar processing • traditional mode • one operation producesone result SIMD processing • with SSE / SSE2 • one operation producesmultiple results X x3 x2 x1 x0 X + + Y y3 y2 y1 y0 Y X + Y x3+y3 x2+y2 x1+y1 x0+y0 X + Y Intel® Core(TM) Micro-Architecture

SIMD - X86 Register SetsSSE-Registers introduced first in Pentium® 3 IA-INT Registers MMX™ Technology / IA-FP Registers SSE Registers 80 128 32 64 xmm0 eax st0 mm0 … xmm7 edi st7 mm7 • Eight 128-bit registers • Hold data only: • 4 x single FP numbers • 2 x double FP numbers • 128-bit packed integers • Direct access to the registers • Use simultaneously with FP / MMX Technology • Eight 80/64-bit registers • Hold data only • Stack access to FP0..FP7 • Direct access to MM0..MM7 • No MMX™ Technology / FP interoperability • Fourteen 32-bit registers • Scalar data & addresses • Direct access to regs

SIMD - Instruction Set Extensions • See http://cache-www.intel.com/cd/00/00/32/26/322663_322663.pdf for details on the SSE-4 instruction set to be added to the Intel Core™ architecture in a future update 32 Future SSE-4 45 nm 28 Intel® Core(TM) Micro-Architecture

SSE 2x doubles 16x bytes 8x 16-bit shorts 4x 32-bit integers 2x 64-bit integers 1x 128-bit(!) integer SIMD - SSE and SSE-2 Data Types 4x floats SSE-2 Intel® Core(TM) Micro-Architecture

SIMD - SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only new data type supported: 4x32Bit (Single Precision) floating point data Some 70 instructions Arithmetic, compare, convert operations on SSE SP FP data PACKED, UNPACKED Data load/store Prefetch Extension of MMX Streaming Store (store without using cache in between) … 2001 PTE Engineering Enabling Conference

SIMD - Ways to SSE/SIMD programming Coding using SSE/SSE2/3/4 assembler instructions Very tedious (manually schedule) – discouraged: Don’t do it ! E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ? Intel® compiler’s C/C++ SIMD intrinsics No need to take care of register allocation, scheduling etc Intel® compiler’s C++ Vector Class Library Use this if you are heavy into C++ classes Vectorizer of Intel® C++ and Fortran Compilers Recommended for most cases – easy and efficient Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference

SIMD - Compiler Based VectorizationProcessor Specific Optimization via Intel Compilers Intel® Core(TM) Micro-Architecture

Intel® 64 = Extending IA-32 to 64 Bit Extended Memory Addressability 64-Bit Pointers, Registers = + Additional Registers 8-SSE & 8-Gen Purpose With 64-Bit Extension Technology Double Precision (64-bit) Integer Support Added to Intel XEON™ and Pentium® 4 Processor in 2004; today available in all main stream Intel IA-32 processors – in particular in all processors based on Intel® Core™ Architecture Intel® Core(TM) Micro-Architecture

Intel® 64 - New Modes of Operation Intel® Core(TM) Micro-Architecture

79 0 Intel® 64 - Registers : Extensions and Additions 127 64 63 0 63 32 31 0 X87/MMX Intel® Core(TM) Micro-Architecture

Intel® 64 - Registers : Availability in different modes Intel® Core(TM) Micro-Architecture

64-bit Mode of Operation • Default data size is 32-bits • Override to 64-bits using new REX prefix • All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable • REX prefixes • A family of 16 prefixes, encoded 0x40-0x4F • Allows the use of general purpose registers as 64-bits • Allows the use of new registers (like r8-r15) • Instructions that set a 32 bit register automatically zero extend the upper 32-bits Intel® Core(TM) Micro-Architecture

Intel® 64 - REX Prefix • A new instruction-prefix byte used in 64-bit mode • Specify the new GPRs and SSE registers • Specify a 64-bit operand size. • Specify extended control registers (used by system software) • An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix . • The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix. Intel® Core(TM) Micro-Architecture

64 Bit - Physical and Linear Addressing • Linear Addressing • Initial Intel® 64 implementation support 48 bits of Virtual addressing. • Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0. • Physical Addressing • Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least • Entries in page tables expanded for up to 52 bits of physical address. Intel® Core(TM) Micro-Architecture

Intel®64 - Large Memory Considerations • Canonical addressing for 64 bit addresses • Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits • Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros • Canonical addresses are a requirement • Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers Intel® Core(TM) Micro-Architecture

Intel® 64 - Floating Point Computation • No changes to legacy x87 FP hardware • For Intel® 64, Intel compilers by default execute FP operations via SSE – even in case no vectorization is done (scalar SSE) • unless IEEE conformance is enforced by –mp or fp-model strict • x87 uses 80 bit FP registers, SSE-2 are 64 bit: Numerical differences are possible • Floating point interfaces use XMM registers, instead of X87 • Function calling sequences pass arguments in registers Intel® Core(TM) Micro-Architecture

Intel® 64 - ABI: Scalar Types Linux Intel® Core(TM) Micro-Architecture

Intel®64 - ABI: Register Call Conventions: Linux* MBP = Must Be Preserved Intel® Core(TM) Micro-Architecture

Intel®64 – ABI: Parameter Passing Example: Linux* • typedefstruct • { • int a, b; • double d; • } structparm; • structparm s; int e, f, g, h, i, j, k; longdouble ld; double m, n; extern voidfunc (int e, int f, structparm s, int g, int h, longdouble ld, double m, double n, inti, int j, int k) func (e, f, s, g, h, ld, m, n, i, j, k); General Purpose RegistersFloating Point RegistersStack Frame Offset %rdi: e %xmm0: s.d 0: ld %rsi: f %xmm1: m 16: j %rdx: s.a, s.b %xmm2: n 24: k %rcx: g %r8: h %r9: i Intel® Core(TM) Micro-Architecture

64Bit versus 32Bit - Pro • Increased address space • No need for segmentation • 64-bit addressing support for data • Overhead of Physical Address Extension (PAE) addressing is eliminated • Extra 8 integer and SSE registers • Enable parameters to be passed through registers instead of through stack • Faster execution of calls • Extra registers reduce spilling/filling • Reduced memory operations • Reduced number of instructions • More registers and efficient parameter passing Intel® Core(TM) Micro-Architecture

64Bit versus 32Bit - Cons • Bigger data foot-print: Pointers are now 64 bits instead of 32 bits • Bigger code foot-print • New instruction prefix (REX) required to specify new registers and 64 bit operands • Immediates • Offsets greater than 8 bits • Need to save/restore more registers on function entry/exit • If the code manipulates 64 bit data heavily, 25-30% increase in code size is possible • Increase in code and data size puts pressure on TLBs, caches and memory sub-systems • 4-level page walks w/ 64-bit o/s Intel® Core(TM) Micro-Architecture

64-bit mode Optimization Guidelines • Majority of the techniques used for IA-32 optimization are applicable in 64-bit mode as well • The architecture is very similar except extended address space and registers • Issues like load/store buffer stalls, trace cache usage etc are the same in both architectures • Intel® 64 specific optimizations • Use the 32 bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers • Sign extend to 64-bits instead of sign extending to 32 bits, even when the destination will be used as a 32-bit value • Prefer 64-bit by 64-bit integer multiplies that produce 64-bit results over multiplies that produce 128-bit results • Use the 64-bit versions of add for 64-bit adds Intel® Core(TM) Micro-Architecture

Intel Quad Core Technology of TodayCache Structure The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each Core 1 Core 0 Core 2 Core 3 4MB Shared L2 Cache 4MB Shared L2 Cache Bus Interface 1066MHz/1333Mhz FSB Intel® Core(TM) Micro-Architecture

Multi-core Programming