1 / 16

by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02

Itanium Processor Microarchitecture. by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02. General Information. First implementation of the IA64 instruction set architecture Targets memory latency, memory address disambiguation, and control flow dependencies

Télécharger la présentation

by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Itanium Processor Microarchitecture by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02

  2. General Information First implementation of the IA64 instruction set architecture Targets memory latency, memory address disambiguation, and control flow dependencies 0.18 micron process, 800MHz EPIC design style shifts more responsibilities to compiler Challenge Try to identify which improvements discussed in this class found their way into the Itanium.

  3. EPIC Conceptual View Idea Compiler has larger instruction window than hardware. Communicate to the hardware more of the information gleaned at compile time.

  4. Hardware Pipeline Six instructions wide and ten stage deep Tries to minimize latency of most frequent operations Hardware support for compilation time indeterminacies

  5. Front End • Software initiated prefetch (requests filtered by instruction cache) • prefetch must be 12 cycles before branch to hide latency • L2 -> streaming buffer -> instruction cache • Four level branch predictor hierarchy to prevent 9-cycle pipeline stall Decoupling buffer hold up to 8 bundles of code

  6. Branch Predictors • Compiler provides branch hint directives • explicit branch predict (BRP) instructions • hint specifiers on branch instructions • which provide • branch target addresses • static hints on branch detection • indicators for when to use dynamic predictors • Four types of predictors • Resteer 1: single cycle predictor (4 BRP programmed TARs) • Resteer 2: Adaptive multi-way and return predictors (dynamic) • Resteer 3&4: Branch address calculation and correction • Resteer 3 includes “perfect-loop-exit-predictor”

  7. Instruction Delivery • Plentiful Resources • four integer units • four multi-media units • two load/store units • three branch units • two extended precision FP units • two single precision FP units • SIMD allows up to 20 parallel operations per clock • Dispersal follows high level semantics provided by IA64 ISA • Check for: • Independence (determined by stop bits) • Oversubscription (determined by 8-bit instruction template) • Template allows for simplified dispersal routing • Organized around 9 issue ports • two memory • two integer • two FP • three branch

  8. Registers Two types of register renaming (virtual register addressing): • Register Stacking • reduces function call and return overhead by stacking new register frame on top of old frame to prevent explicit save of caller’s register (not supported in FP registers) • Register Rotation • supports software-pipelining by accessing the registers through an indirection based on the iteration count If software allocates more virtual registers than are physically available (overflow), the Register Stack Engine takes control of the pipeline to store register values to memory, and the reverse for underflow. No pipeline flushes required :)

  9. Register Files • Integer register file • 128 entries • 8 read ports • 6 write ports • postincrement performed by idle ALU and write ports • FP register file • 128 entries • 8 read ports • 4 write ports, separated into odd and even banks • supports double extended-precision arithmetic • Predicate register file: 1-bit entries with 15 read and 11 write ports

  10. Execution Core Non-blocking cache with scoreboard-based stall on use control strategy Pipeline only stalls when data is needed, not on other hazards Deferred-stall strategy (hazards evaluation in REG stage) allows more time for dependencies to resolve Stalls in EXE stage, where input latches snoop returning data values for correct data using existing register bypass hardware. Predication : turns control dependency into data dependency by executing all sides of a predicted branch and squashing the incorrect instructions before they change the machine state (speculative predicate register file vs architectural predicate register file) Executes up to three parallel branch predictions a cycle, uses priority encoding to determine earliest taken branch.

  11. Data Speculation Exception tokens In FP registers, exceptions are noted by storing a NaTVal value in the NaN space, but an extra bit is added to the INT register for the exception token (NaT). These bits must be stored in a special UNaT register in the event of a register spill because it won’t fit in memory, and it is restored during fills. ALAT structure If an instruction writes to a register between the time the speculative load reads that register and consumes the value, the ALAT invalidates the speculative load value and recovery is initiated. ALAT checks can be issued in parallel with the consuming instruction.

  12. First Level Cache Data and Instruction Separate 16Kbytes each, 32 byte line size (6 instructions/cycle in I cache) four-way set-associative dual ported 2 cycle latency, fully pipelined write through physically addressed and tagged single cycle, 64 entry, fully associative iTLB (backed up by an on-chip hardware page walker) iTLB and cache tags have an additional port to check address for miss Second Level Cache Combined data and instructions 96Kbytes six-way set-associative 64 byte line size two banks four-state MESI for multi-processor coherence 4 double precision operands per clock to FP register file On-Chip Memory

  13. On-Chip Memory cont. • Third Level Cache • 4Mbytes • 64-byte line size • four-way set associative • 128-bit core speed bus line • (12 Gbytes/s bandwidth) • MESI protocol • Optimal Cache Management • Memory locality hints • allocation and replacement strategies • Bias hints • optimize MESI latency

  14. System Bus • 64-bit system bus, source-synchronous data transfer (2.1 Gbytes/sec) • Multi-drop shared system bus uses MESI coherence protocol • Four-way glueless multiprocessor system support (4 processor nodes) • Multiple nodes connected through high speed interconnects • Transaction based bus protocol allows 56 pending transactions • ‘Defer mechanism’ for OoO data transfers and transaction completion

  15. Comparison to Previous Work Non-blocking caches as seen in “Lockup-free instruction fetch cache organization” Prefetch - decoupled prefetch based on branch hints as seen in “A Scalable Front-End Architecture for Fast Instruction Delivery” - software initiated prefetch as seen in “Design and Evaluation of a Compiler Algorithm for Prefetching” Memory locality hints for more efficient use of caches Speculation - extra bit for deferred exception tokens What else? Do you think they made a simple, scalable hardware implementation?

More Related