1 / 38

Design Tradeoffs in Instruction Window of Superscalar Processors

level leads to large instruction window. 2. Pursuing high clock speed limits the size of ... It successfully deals with RAW, WAW, and WAR data dependencies. ...

Sharon_Dale
Télécharger la présentation

Design Tradeoffs in Instruction Window of Superscalar Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Design Tradeoffsin Instruction Windowof SuperscalarProcessors

    Presented by: Chunming Gao MS Project Proposal Committee members: Dr. Soner Onder (Chair) Dr. Steven Carr Dr. David Poplawski Dr. Jianping Dong

    Slide 2:Outline of the presentation

    Part one: Introduction Part two: Background Part three: Instruction window organizations Part four: Work plan and preliminary results

    Part One Introduction

    Slide 4:Motivation

    1. Exploring more parallelism in instruction level leads to large instruction window. 2. Pursuing high clock speed limits the size of instruction window .

    Slide 5:What Will We Study

    1. Central window design 2. Distributed window design 3. Dependence-based window design 4. Cluster-based window design 5. PEWs (parallel execution windows) 6. Direct wake-up based window design

    Slide 6:How Do We Define Performance

    1. IPC (Instructions per cycle) 2. Clock cycle time 3. Compare the ratio of IPCs to a baseline processor

    Slide 7:Part Two Background

    Slide 8:Superscalar Processor Stages

    Fetch Decode Retire Complete Execute Dispatch Instruction Dispatch Issuing Completion Store buffer buffer buffer buffer buffer

    Slide 9:Bottlenecks of Superscalar Processors

    1. Structural hazards: A conflict between multiple instructions which require the same resource at the same time. 2. Control hazards: Instruction following a branch cannot be executed until the branch is resolved. 3. Data hazards:An instruction depends on the result of a previous instruction.

    Slide 10:Data Dependencies

    1. True data dependencies: RAW (Read after write) i: add r3 r2 r1; j: add r6 r3 r4; 2. False data dependencies: WAR (Write after read) k: add r6 r3 r4; l: add r3 r7 r1; WAW(Write after write) m: add r3 r2 r1; n: add r3 r7 r1;

    Slide 11:Tomasulo's Algorithm

    A hardware algorithm for dynamically issuing multiple instructions in a pipelined processor. It provides a general mechanism for register forwarding and data hazard detection. It successfully deals with RAW, WAW, and WAR data dependencies. Two kinds of techniques are used: Register renaming Shelving

    Slide 12:Register Renaming

    Example: add r3 r2 r1; # r2 + r1 -> r3; div r6 r3 r4; # r3 / r4 -> r6; (RAW) sub r3 r7 r1; # r7- r1 -> r3; (WAR, WAW) Register renaming: r3 -> rr1 r6 -> rr2 r3 -> rr3 New instruction serial: add rr1 r2 r1; # r2 + r1 -> rr1; div rr2 rr1 r4; # rr1 / r4 -> rr2; (RAW) sub rr3 r7 r1; # r7- r1 -> rr3;

    Slide 13:Shelving

    Reservation station: A buffer to hold decoded instructions to wait for issuing into execution. Independent instructions are detected and the RAW true data dependencies are dealt here. Possible reservation station entry components: Op Qj/Vj VBj Qk/Vk VBk Dest BusyBit

    Slide 14:What's the Instruction Window About

    Instruction Decode Instruction Window Holding decoded instructions Fetching operands Wake up instructions Select and issue instructions FU FU FU FU

    Slide 15:Instruction Window Design Space(1)

    1. Reservation stations may vary: Reservation Stations Individual RS's Group RS's Central RS's RS RS RS RS RS EU EU EU EU EU EU EU EU

    Slide 16:Instruction Window Design Space(2)

    2. Operand fetching scheme may vary: Reservation Station Reservation Station Reg.File Reg.File EU EU Scheme 1: Direct check of the scoreboard bits Scheme 2: Check of the explicit status bits

    Slide 17:Part Three Instruction Window Organizations

    Slide 18:Central Window Design Structure

    1.Onecentralizedreservation stationholds every kindof instructions afterdecoded. 2. It serves all the functional units. Reservation Station EU EU Decoded Instructions Ready Instructions

    Slide 19:CentralWindowDesign Components

    Decoded Instructions Rs1 Rs2 Rd Identifier Entry DestReg Value Value Latest Valid No. Valid Bit Register File OC Os1/Is1Vs1Os2/Is2Vs2 Rd Reservation Station OC Os1 Os2 Rd EUs Update Rd, set V-bi t Result, Rd/identifier Associative Update of Is1 Is2 with V-bits

    Slide 20:Central WindowDesign MeritsandDrawbacks

    Advantage: 1. A large register file is used, more registers can be renamed; 2. A large reservation station is used, more independent instructions can be detected; 3. Associative search, moreparallelism can be exploited. Disadvantage: 1. More ports are required; 2. Long wires are required; 3. Possibly long clock cycle is induced.

    Distributed Window Design Structure 1.Two or morereservation stationshold decoded instructions. 2. They serve different functional units. Reservation Station 1 EU EU Decoded Instructions Ready Instructions Reservation Station 2

    Slide 22:Distributed WindowDesign Structure

    Identifier Entry DestReg Value Value Latest Valid No. Valid Bit Register File OC Rs1 Rs2 Rd OC Rs1 Rs2 Rd ReservationStation1 ReservationStation2 Decoded Instructions Rs1 Rs2 Rd EUs Update Rd, Set V-bit Result Rd/Identifier

    Distributed WindowDesign MeritsandDrawbacks Advantage: 1. Reservation stations are less complicated 2. Possibly short clock cycle is achieved Disadvantage: 1. Random steering or Round Robin mode 2. The load in the different reservation stations may be unbalanced 3. More ports are still demanded to check the availability of the operands Dependence-based Window Design Structure 1.Reservation stations are distributed. 2.The decoded instructions are steered into different FIFO queues according to dependencies. EUs Rename,Steering Dependence-based FIFOs Register File Update register file

    Slide 25:Dependence-basedWindow Design Steering Algorithm

    For a decoded instruction I: 1.If all the operandsareready, Iis steered toa new FIFO. 2. There is one operand not ready, and if there's no instruction behind this instruction in a FIFO, then put I into this FIFO; otherwise put into a new FIFO. 3. There are more than one operands not ready. Apply 2 to the first operand. If not suitable, apply to the second operand. 4. If all the FIFOs are full or if no empty FIFO is available, stall. After the last instruction in a FIFO is issued, the FIFO is set free.

    Dependence-based WindowDesign MeritsandDrawbacks Advantage: 1. Issuing windows are distributed. 2. Only the heads of the FIFOs are checked, broadcast for wakeup is avoided. Disadvantage: An independent instruction always requires an additional FIFO to steer, if there's no FIFO available, it stalls. Hence the overall performance will be impacted. Cluster-based Window Design Structure 1.It's based on the dependence-based window design. 2. The FIFOs are clustered, with each using a copy of the register file. EUs Rename,Steering Dependence-based FIFOs Register File1 Register File2 Dependence-based FIFOs Cluster1 Cluster2 EUs Cluster-based WindowDesign MeritsandDrawbacks (1) Advantage: 1. Issuing windows are distributed. 2. Only the heads of the FIFOs are checked, broadcast for wakeup is avoided. 3. The number of ports on each register file can be reduced. Updates of the register file are in parallel. 4. Local bypasses are used much more frequently than inter-cluster bypasses. Cluster-based WindowDesign MeritsandDrawbacks (2) Disadvantage: 1. An independent instruction always requires an additional FIFO to steer, if there's no FIFO available, it stalls. Hence the overall performance will be impacted. 2. Inter-cluster bypasses will decrease the overall performance.

    Slide 30:Parallel Execution Windows (PEWs) Structure

    It splits the instruction window into separate execution windows(pews), with each having its own reservation station and its register file. The pews communicate with each other to get the required register data. pew0 pew1 pew3 pew2 Distributor

    PEWs MeritsandDrawbacks Advantage: 1. Issuing windows are distributed. 2. Local operands fetching and update are efficient. Disadvantage: More clock cycle delays are induced to pass the results to the remote pews. DirectWakeupWindow Design Structure RenameSteering Reorder Buffer I Wait_rslt Wait_lop wait_rop Not ready Ready Wakeup_input_queue wait_queues Cnt=0 Cnt<>0 ready_queues EUs Wakeup wait_lop&wait_rop Not ready Ready Wakeup wait_rslt

    Slide 33:DirectWakeup Window Design Meritsand Drawbacks

    Advantage: 1. Broadcast method is avoided. Only the depended instructions are woken-up. 2. Stalls happen only after the resources are fully occupied, hence resource utilization is high. Disadvantage: An extra stage is introduced to balance the complicated wakeup process, which will increase the misprediction roll back penalty.

    Slide 34:PartFour Work Planand Preliminary Simulations

    Slide 35:Implementation Plan

    1.Study the implementeddesigns: Central window design; Dependence-based design; Direct wakeup based design. 2. Finish and verify the following designs: Distributed window design; Cluster-based window design; PEWs-based window design.

    Slide 36:Test Plan

    1. Test using Integer benchmarks andFloat benchmarks 2. Test using different architecture set-ups: Vary the issue width; Vary the window size; Vary the register file size; Vary the number of functional units. 3. Write report.

    Slide 37:Preliminary Results(1)

    Central Window Distributed Window Dependence-based Cluster-based Direct wakeup 126.gcc 129.comprss 130.li 099.go 134.perl

    Slide 38:Preliminary Results (2)

    Central window Distributed window Dependence-based Cluster-based Direct wakeup 101.tomcat 102.swim 103.su2cor 104.hydro2d 107.mgrid

More Related