1 / 45

Intel Multimedia Extensions and Hyper-Threading

Intel Multimedia Extensions and Hyper-Threading. Michele Co CS451. Outline. Evolution of Intel multimedia extensions x87 (386) MMX (Pentium MMX, Pentium II) SSE (Pentium III) SSE2 (Pentium 4 – Willamette) SSE3 (Pentium 4 – Prescott) Hyper-Threading. X87 FPU.

jeri
Télécharger la présentation

Intel Multimedia Extensions and Hyper-Threading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel Multimedia ExtensionsandHyper-Threading Michele Co CS451

  2. Outline • Evolution of Intel multimedia extensions • x87 (386) • MMX (Pentium MMX, Pentium II) • SSE (Pentium III) • SSE2 (Pentium 4 – Willamette) • SSE3 (Pentium 4 – Prescott) • Hyper-Threading

  3. X87 FPU • 8 80-bit data registers (double extended precision floating point) • Data registers treated as a stack • Control register – FP precision, rounding, … • Status register – FPU busy, TOS, CC, error, exception, … • Tag register- (2 bits) valid, zero, special, empty • Last instruction pointer register • Last data (operand) pointer register • Opcode register

  4. x87 FPU State

  5. X87 Data Types

  6. x87 Instructions • Data transfer (load, store, move) • Basic arithmetic • Comparison • Transcendental (trigonometric, log, exp) • Load constant • x87 FPU control

  7. MMX • SIMD execution • 8 64-bit data registers (MMX) • Aliased to x87 FPU registers • Randomly accessible

  8. SIMD Execution

  9. MMX State

  10. MMX Registers

  11. MMX Data Types

  12. MMX Instructions • Data transfer • Arithmetic • Comparison • Conversion • Unpacking • Logical • Shift • Empty MMX state

  13. SSE • Pentium III • 8 128-bit data registers (XMM) • Independent of x87 FPU and MMX registers • SSE instructions can be executed in parallel with MMX/x87 • MXCSR register – control and status for XMM registers (similar to x87 status register) • EFLAGS register – results of compare ops • 128-bit packed single-precision fp data type • Prefetching, cacheability, store ordering control instructions

  14. SSE State

  15. XMM Registers

  16. SSE Data Type

  17. SSE Instructions • Packed and scalar single-precision floating point • Logical • Conversion • 64-bit SIMD integer • MXCSR management • State management • Cacheability control, prefetch, memory ordering • SFENCE (store fence) • FXSAVE, FXRSTORE • extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers

  18. Packed Single-Precision FP Operation

  19. Scalar Single-Precision FP Operation

  20. Shuffle

  21. Unpack and Interleave

  22. SSE2 • Pentium 4 • More data types • More instructions to support new data types

  23. SSE2 State

  24. SSE2 Data Types

  25. SSE2 Instructions • Support for additional types • CLFLUSH (cache line flush) • LFENCE (load fence) • MFENCE (load + store fence)

  26. Packed Double-Precision FP Operations

  27. Scalar Double-Precision FP Operations

  28. SSE3 • Pentium 4 (Prescott) • Support for Hyper-Threading • 13 new instructions • 10 SIMD support instructions • 1 x87 accelerating instruction (fp to int conversion) • Synchronization of threads • MONITOR (monitor write-back stores) • MWAIT (wait for write-back store) • No new state

  29. Asymmetric Processing

  30. Horizontal Data Movement

  31. Hyper-Threading

  32. Terminology • Process • Program associated with a context (state: registers, program counter, flags, etc.) • Consists of one or more threads • Thread • “lightweight process” (less state)

  33. Hyper-threading • Single physical processor appears as 2 logical processors • Thread Level Parallelism (TLP) • Many applications have software threads that can be executed simultaneously • Online transaction processing • Web services • Latency can leave execution units idle • Cache misses • Branch mispredictions • Waiting for loads/stores

  34. Techniques for Minimizing Effect of Long Latency • Chip multiprocessing (CMP) • 2 processors on single die • Larger than single core chip, manufacture more expensive • Time-slice or switch-on-event multithreading • Switch threads after fixed time period or on long latency events like cache misses • Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) • Simultaneous multithreading (SMT) • Multiple threads execute on single processor without switching • Hyper-Threading is Intel’s implementation

  35. Intel Hyper-Threading Demo

  36. Resource Requirements for HT Need to maintain 2 contexts • Replicated • Register renaming logic (RAT) • Instruction Pointer • ITLB • Return stack predictor • Various other architectural registers (GP, control, APIC, machine state) • Partitioned • Re-order buffers (ROBs) • Load/Store buffers • Various queues, like the scheduling queues, uop queue, etc. • Shared • Caches: trace cache, L1, L2, L3, microcode ROM • Microarchitectural registers • Execution Units

  37. Hyper-Threading Goals • Minimize die area cost for implementing • Ensure forward progress by at least one logical processor • Maintain single-threaded performance

  38. Frontend Changes • 2 PCs • Arbitration for shared resource access • Trace cache, microcode ROM, caches • One logical processor at a time per structure • Thread tags per trace cache entry • Microcode ROM – 2 microcode instruction pointers • Wider pipeline latches to hold state for 2 contexts • Branch prediction • RAS and branch history buffer duplicated • Global history shared, but tagged with logical processor ID

  39. Trace Cache Hit

  40. Trace Cache Miss

  41. Hyper-threaded Execution

  42. Execution Modes • Single-task (ST), Multi-task (MT) • ST0, ST1 • HALT: transitions ST modes depending on logical processor executing • Interrupt sent to halted processor transitions to MT

  43. HT Performance - OLTP

  44. HT Performance – Web Server

More Related