Haswell

Haswell Thomas Shull Bhargava Reddy Gopi Reddy Raghavendra Pradyumna Pothukuchi

RISC-y • Find each instruction i.e. decode the length. • Each x86 instruction (Macro Op) is chopped into “µOps” • Some Macro Op combos can be treated as 1 instruction. Pack them together. 1 • CMP <–> JUMP IF; ± <–> TEST • Some µOps are packed into 1 µOp and are later implicitly broken op. • ADD [EBX] EAX -> MOV ECX [EBX] -> ADD [EBX] EAX ADD ECX EAX

www.realworldtech.com/haswell-cpu/

Fetch and Decode • Multicyclepower hungry decode. • µOps are cached. • 32 sets: 8 ways:6 µops per line: • 32B window (18 µOps at maximum) is inserted at once • if 32B has more than 18 µOps, do not insert. • Deliver atmost 4 µOps on a “full hit” • Double bandwidth (32B vs. 16B) on a hit. Why?AVX!

Renaming and Oh OhOh! • Renaming – Map from logical registers to physical registers (PRF) and allocate resources. • ROB is a placeholder. • Break the fused µOps to simpler Ops. www.realworldtech.com/haswell-cpu/

Scheduler • 8 Issue Ports • 1 WB per Port • INT, FP, SIMD networks + MEM • More penalty for inter-network data forwarding. • Register-Register moves are folded by just changing PRF map. • Extra pipeline stage for dereferencing links

Execution Units 60 Entry Unified Scheduler Port 0 Port 4 Port 5 Port 1 Port 2 Port 3 Port 6 Port 7 Store Int Int Int Int Store Mem Vector FMA FMA Vector Vector Vector Vector Vector Vector Branch Branch Div Vector

Did we forget something? Branch Predictor !! • More entries in BTB (less per entry!) • Entries with fewer offset bits • Use the space saved for global branch prediction • 2 level global predictor? 1-bit entries? • 14 -17 cycles of misprediction penalty. • 56 entry µOp buffer for identifying small loops

Big Picture: 14 stage pipeline www.realworldtech.com/haswell-cpu/

Memory Hierarchy – For Data Load Buffer Store Buffer Unified scheduler 4k – 64 2M/4M - 32 1G - 4 4-way Port 3 Port 4 Port 2 Port 7 1024 Entry Shared 8-way 64-bit AGU Store AGU 64-bit AGU Store Data 2x32B 32B L1 TLB L2 TLB 32 KB L1 D Cache (8-way) 64B 256KB L2 Cache (8-way) L3/LLC

L3 (Also Last Level Cache) • Banked Structure, One bank per core • Shared and Fully inclusive • Separate tag arrays • One for Data Requests • One for Prefetches and Coherency Requests • Point of Coherence • Separate Frequency domain from CPU • Helps to run CPU, GPU and LLC at different speeds as necessary System Agent Core0 L3 L3 Core1 L3 Core2 Core3 L3 GPU

The Ring • Ring stops • Core/L3 bank (Cachebox) can send/receive two packets on ring each cycle • Up direction • Down direction • GPU and System Agent can send only one per cycle • Ring actually consists of 4 Rings System Agent Core0 L3 L3 Core1 L3 Core2 Core3 L3 GPU

Memory Controller • 2 Clock Domains • DCLK – DDR command clock • QCLK – DDR data clock • Requested 32B are returned first • Maintains a page table information and corresponding requests • Page Hits are given priority -> increase the bandwidth • Reads are given priority • Write Data Buffer to maintain writes • Write Merging can happen in WriteDataBuffer

System Agent Display Engine PCIE DMI Memory Controller PCU • Contains • Memory Controller • PCI Express Controller • DMI Controller • Display Engine • Power Control Unit • I/O Core0 L3 L3 Core1 L3 Core2 L3 Core3 GPU

Multithreading • Use atomic operations to control access to items used by multiple threads • Obtain and release locks for critical sections • Intel currently supports making the following operations atomic by appending a “LOCK” prefix: • ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR SBB, SUB, XOR, XADD, and XCHG • MOV and LEAL are also atomic on aligned accesses

Transactional Memory • Main idea: try to run critical sections without locks and monitor for conflicts • Use Read and Write Sets to log memory accesses in transactional sections • If conflicts occur, abort and revert register state to the beginning of transaction • If successful, commit the changes to memory so they are visible to other threads

Restricted Transactional Memory • Haswell is the first Intel mainstream processor to include Transactional Memory • Added Transactional Synchronization eXtension (TSX) • New instructions for Restricted Transactional Memory • XBEGIN – indicates start of transaction • XEND – indicates end of transaction • XABORT – used for testing; aborts transaction • XTEST – indicates whether preforming in a transactional region • Must have pointer to code that runs upon an abort • Requires code to be rewritten using transactional sections

Integrated Graphics • Supports 3 simultaneous display, HDMI • Scalable Architecture: different versions of processor (GT1, GT2, GT3) offer different number of Execution Units (EUs) among other upgrades Figure taken from “Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell” Presentation. Intel Developers Forum, San Francisco, 2012 • Multiple Video Encoding and Decoding Support in Hardware. • Supported encodings include MPEG4, MPEG2, SVC • Supports Open CL 1.1, Open GL 4.0

Power Management • Three Voltage Domains • Allows for screen to be updated while processor is turned off • Voltage Regulators are on chip • Power Gating • New Power Saving States • S0ix idle states • Recommends power levels and response times for vendors • Uses 20x less power than previous S0 state Figure taken from “Intel Next Generation Microarchitecture Codename Haswell: New Processor Innovations” Presentation. Intel Developers Forum, San Francisco, 2012

Recap: • 14 stage pipeline • 4 cores, SMT machine • In order issue, Out of Order execution, In order commit. • Wider data paths and extra Store AGU to provide more bandwidth in AVX2 computations • LLC/Ring is the point of coherence and distributed arbitration of requests. • Intel TSX • Added support for Restricted Transaction Memory • Integrated Graphics and Improved Power Management • Power Efficiency is a huge emphasis

Resources General Information • Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell. Presented at IDF 2012 by Tom Piazza, Hong Jiang, Per Hammarlund, RonakSinghal • Intel Next Generation Micro Architecture Codename Haswell: New Processor Innovations. Presented at IDF 2012 by Robert Chappell, Bret Toll, Ronal Singhal • Kanter, David Intel’s HaswellCpu Microarchitecture. November 13, 2012. www.realworldtech.com/haswell-cpu/ • Kanter, David Analysis of Haswell’s Transactional Memory. February 15, 2012. www.realworldtech.com/haswell-tm/ • Lai Shimpi, Anand. Intel’s Haswell Architecture Analyzed: Building a New PC and a New Intel. October 5, 2012. www.anandtech.com/show/6355/intels-haswell-architecture • Introducing SandyBridge. Presented at IDF 2010 by Bob Valentine. • Sandy Bridge Spans Generation. Micro Processor Report. September 2010

Resources Processor Core • Fog Agner. The microarchitecture of Intel, AMD and VIA CPUs, An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering • Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-026. April 2012 Transactional Memory • Intel Transactional Synchronization Extensions. Presented at IDF 2012 by Ravi Rajwar, Martin Dixon • Intel Architecture Instruction Set Extensions Programming Reference Manual. Order Number: 319433-012A. February 2012 • Gelas, J and Hamm, C. Making Sense of the Intel Haswell Transactional Synchronization eXtensions. September 15, 2012. www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

Extra Slides

Current Locking Strategies acquire_lock(mutex) release_lock(mutex)

Scalability Issues Figure taken Making Sense of the Intel Haswell Transactional Synchronization eXtensions. As core count increases, efficiency is drastically reduced!

Lock Elision • Idea introduced by Ravi Rajwar and James R. Goodman in 2001 • remove locks, run code as a transaction • If there are conflicts, abort and rerun code with locks intact • On success, commit the transaction’s writes to memory • To other threads the lock still remains available • Reduces execution time if conflicts do not occur • Guarantees Correctness by using the transactional memory • Have new instructions to implement Lock Elision • XAQUIRE: denotes start of lock elision section • XRELEASE: denotes end of lock elision section • These options are added as prefixes to existing instructions

Lock Elision acquire_lock(mutex) release_lock(mutex) Changes can be made in library functions. User does not have to adopt new programming paradigm

Performance Benefits Intel says using TSX Helps! Figure taken from “Intel Transactional Synchronization Extensions” Presentation. Intel Developers Forum, San Francisco, 2012 Software Transactional Memory has been researched, but the overhead in software negated performance benefits

Haswell

Haswell

Presentation Transcript

Haswell vs Skylake: Which One Should You Choose?

Haswell Green's - The Best Places For Live Music In NYC

Haswell Green’s

Haswell Green's | Happy Hour Times Square | Brunch Broadway | Nightclub