1 / 22

Intel Core 2 Duo

Intel Core 2 Duo. CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009. Introduction. Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System.

appollo
Télécharger la présentation

Intel Core 2 Duo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel Core 2 Duo CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009

  2. Introduction • Motivation • A Multi-Core on our desks • A new microarchitecture to replace Netburst • Intel Core 2 Duo • A dual-core CPU • ISA with SIMD Extension • Intel Core microarchitecture • Memory Hierarchy System

  3. Instruction Set Architecture • Base: X86-64 • No VLIW (Itanium) • SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Walfdale, SSE4.1, Sep 2006 Core 2, SSSE3, July 2006 Prescott, SSE3, 2004 Pentium 4, SSE2, 2001 e.g. Permuting bytes in a word Pentium III, SSE, 1999 DSP-oriented math, process management Pentium MMX, 1996 Double precision, 128-bit register support 8 new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations

  4. 00000000 00000000 00000000 00000000 Streaming SIMD Extension (SSE) 4.1 • Beginning with the 45 nm processors • 47 instructions that improve performance of media data manipulation • e.g. Fast and efficient bit width conversions • Convert single byte values to word (16-bit) values.

  5. SSE2 Code • MOVDQU XMM0, M64 • PXOR XMM1, XMM1 • PUNPCKLBW XMM0, XMM1

  6. SSE4.1 Code • PMOVZXBW XMM0, M64 • DEST[15:0] <-- ZeroExtend(SRC[7:0]); • DEST[31:16] <-- ZeroExtend(SRC[15:8]); • DEST[47:32] <-- ZeroExtend(SRC[23:16]); • DEST[63:48] <-- ZeroExtend(SRC[31:24]); • DEST[79:64] <-- ZeroExtend(SRC[39:32]); • DEST[95:80] <-- ZeroExtend(SRC[47:40]); • DEST[111:96] <-- ZeroExtend(SRC[55:48]); • DEST[127:112] <-- ZeroExtend(SRC[63:56]); • Benefits • Reduced instruction number (31) • Better performance (~40% speedup each loop) • Reduced register pressure (21)

  7. Microarchitecture • The Cores • Single-die(107 mm²), • Two identical core(L1 cache 64K x 2), • Shared L2 cache 6M • No Hyper-threading, no L3 cache • Keep front-side bus • Larger L2 cache

  8. Microarchitecture • 14-stage Pipeline • 4 wide decode • 4 wide Retire • Macro-fusion • Enhanced ALUs • Deeper Buffers

  9. Another View

  10. Decode Hardware • 128 bits fetch bandwidth • 18-entry IQ • Complex Decode -produces 1-4 micro-ops • Micro-code Sequencer

  11. Macro-fusion New Micro-op • Represent instruction pair as single micro-op Enhanced ALUs • To execute new compare and jump (CMPJCC) micro-op in one clock

  12. Out of Order Execution • 96 entries ROB • 32 Entry Reservation Station

  13. Execution Units • 6 dispatch ports(1 Load, 2 Store, 3 universal ports) • 3 integer ALU, 2 float point ALU

  14. Branch Predictor • Loop Detector - Track the number of loop iterations for future reference • branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector

  15. Cache Organization • private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) • shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) • Memory disambiguation • aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table • watchdog mechanism

  16. Prediction Implementation • History table indexed by Instruction Pointer • Each entry in the history array has a saturating counter • Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses • When a particular load failed disambiguation: reset its counter • Each time a particular load correctly disambiguated: increment counter

  17. Predictor Lookup • when sent from RS, set disambiguation bit • If meets an older unknow store address, set "update" • If prediction is "go", dispatch, set "done" • Else blocked • A store in Load Buffer scan all previous load, if a match found, "reset" bit set. • When load commits, update history. Load Dispatch Prediction Verification

  18. Execute Disable Bit Support • AMD Enhanced Virus Protection; ARM eXecute Never • help prevent buffer overflow attacks • no need of software patches for buffer overflow attacks • segregate memory by either storage of code or data • processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)

  19. Instruction Pointer Based Prefetcher • L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; • predict what memory address will be used and deliver in time • record every load's history using Instruction Pointer • IP history array • parameters for prefetch traffic control fine-tuned for different platforms • prefetch monitor

  20. References • Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies • Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel • Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine • Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX • too many…

  21. Questions?

More Related