1 / 31

Tomasulo’s Algorithm

Tomasulo’s Algorithm. There are only three stages that an instruction goes through

dawn-price
Télécharger la présentation

Tomasulo’s Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tomasulo’s Algorithm • There are only three stages that an instruction goes through • Issue – get next instruction from FIFO instruction queue. If there is empty reservation station transfer instruction there along with operand values or names of reservation stations (tags) that will produce operand values. If there are no reservation stations stall on structural hazard. • Execute – when all operands are available start execution. Loads need only effective address. Stores also need data to be stored. No instruction can start executing before all prior branches have been evaluated. • Write result – write on CDB and from there into registers and pending reservation stations or memory

  2. Tomasulo’s Algorithm • Each reservation station has seven fields • Op – operation to perform • Qj, Qk – reservation station tags that will produce operands (0 indicates the operand is ready) • Vj, Vk – operand values • A – immediate field and later effective address of load/store instruction • Busy – this reservation station and its functional unit are occupied • Register file has a field • Qi – tag of reservation station computing the result

  3. Time =1 First load is issued Instruction status Issue Execute Write result L.D F6, 34(R2)  L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Reservation stations Busy Op Vj Vk QjQk A yes Load1 Regs[R2] Load 34 Load2 Add1 Add2 Add3 Mult1 Mult2 Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Load1

  4. Time =2 First load calculates address Second load is issued Instruction status Issue Execute Write result L.D F6, 34(R2)   L.D F2, 45(R3)  MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Reservation stations Busy Op Vj Vk QjQk A yes Load1 Regs[R2] Load 34 + yes Load2 Regs[R3] 45 Load Add1 Add2 Add3 Mult1 Mult2 Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Load2 Load1

  5. Time =3 First load reads from memory Second load calculates address Instruction status Mult is issued Issue Execute Write result L.D F6, 34(R2)   L.D F2, 45(R3)   MUL.D F0, F2, F4  SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Reservation stations Busy Op Vj Vk QjQk A yes Load1 Load Regs[R2]+34 yes Load2 Regs[R3] 45 + Load Add1 Add2 Add3 Regs[F4] yes Mult1 Load2 Mult Mult2 Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Load2 Load1

  6. Time =4 First load writes result Second load reads from memory Instruction status Mul is stalled Sub is issued Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)   MUL.D F0, F2, F4  SUB.D F8, F2, F6  DIV.D F10, F0, F6 ADD.D F6, F8, F2 Reservation stations Busy Op Vj Vk QjQk A yes Load1 Load Regs[R2]+34 yes Load2 Regs[R3]+45 Load Load2 Add1 yes Sub Mem[34+Regs[R2]] Add2 Add3 Regs[F4] yes Mult1 Load2 Mult Mult2 Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Load2 Load1 Add1

  7. Time =5 Second load writes result Mult is stalled Sub is stalled Instruction status Div is issued Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4  SUB.D F8, F2, F6  DIV.D F10, F0, F6  ADD.D F6, F8, F2 Reservation stations Busy Op Vj Vk QjQk A Load1 yes Load2 Regs[R3]+45 Load Add1 yes Sub Load2 Mem[45+Regs[R3]] Mem[34+Regs[R2]] Add2 Add3 Regs[F4] yes Mult1 Load2 Mult Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Load2 Add1 Mult2

  8. Time = 6 Mult is executed (1 out of 10) Sub is executed (1 out of 2) Div is stalled Instruction status Add is issued Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4  6  SUB.D F8, F2, F6   6 DIV.D F10, F0, F6  ADD.D F6, F8, F2  Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 yes Sub Mem[45+Regs[R3]] Mem[34+Regs[R2]] Add2 Add Add1 yes Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Add1 Mult2

  9. Time = 7 Mult is executed (2 out of 10) Sub is executed (2 out of 2) Div is stalled Instruction status Add is stalled Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4   6 SUB.D F8, F2, F6   6 DIV.D F10, F0, F6  ADD.D F6, F8, F2  Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 yes Sub Mem[45+Regs[R3]] Mem[34+Regs[R2]] Add2 Add Add1 yes Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Add1 Mult2

  10. Time = 8 Mult is executed (3 out of 10) X=Mem[34+Regs[R2]]-Mem[45+Regs[R3]] Sub writes result Div is stalled Instruction status Add is stalled Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4   6 SUB.D F8, F2, F6   6  DIV.D F10, F0, F6  ADD.D F6, F8, F2  Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 yes Sub Mem[45+Regs[R3]] Mem[34+Regs[R2]] Add2 Add Add1 yes X Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Add1 Mult2

  11. Time = 9 Mult is executed (4 out of 10) Div is stalled Add is executed (1 out of 2) Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4   6 SUB.D F8, F2, F6    DIV.D F10, F0, F6  ADD.D F6, F8, F2   Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add yes X Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Mult2

  12. Time = 10 Mult is executed (5 out of 10) Div is stalled Add is executed (2 out of 2) Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4   6 SUB.D F8, F2, F6    DIV.D F10, F0, F6  ADD.D F6, F8, F2   Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add yes X Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Mult1 Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Mult2

  13. Time = 11 Mult is executed (6 out of 10) Div is stalled Add writes result Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4   6 SUB.D F8, F2, F6    DIV.D F10, F0, F6  ADD.D F6, F8, F2    Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add yes X Mem[45+Regs[R3]] Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Div Mult2 Mult1 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Add2 Mult2

  14. Time = 16 Mult writes result Div is stalled Y=Mem[45+Regs[R3]]*Regs[F4] Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4    SUB.D F8, F2, F6    DIV.D F10, F0, F6  ADD.D F6, F8, F2    Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add3 yes Mult1 Mult Regs[F4] Mem[45+Regs[R3]] yes Y Div Mult2 Mult1 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult1 Mult2

  15. Time = 17 Div is executed (1 out of 40) Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4    SUB.D F8, F2, F6    DIV.D F10, F0, F6   17 ADD.D F6, F8, F2    Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add3 Mult1 yes Y Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult2

  16. Time = 57 Div writes result Instruction status Issue Execute Write result L.D F6, 34(R2)    L.D F2, 45(R3)    MUL.D F0, F2, F4    SUB.D F8, F2, F6    DIV.D F10, F0, F6    ADD.D F6, F8, F2    Reservation stations Busy Op Vj Vk QjQk A Load1 Load2 Add1 Add2 Add3 Mult1 yes Y Div Mult2 Mem[34+Regs[R2]] Register result status F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Mult2

  17. Tomasulo’s Alg. and Loop Unrolling • Consider a loop LOOP: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP • We will assume that branch is always predicted as taken and issue instructions from two loop iterations • Assume none of the load/store or FP operations have completed

  18. Issue Execute Write result L.D F0, 0(R1)   MUL.D F4, F0, F2  S.D F4, 0(R1)  L.D F0, -8(R1)   MUL.D F4, F0, F2  S.D F4, -8(R1)  Busy Op Vj Vk QjQk A yes Regs[R1]+0 Load1 Load yes Load2 Load Regs[R1]-8 Add1 Add2 Add3 Mult yes Mult1 F2 Load1 Mult yes Load2 Mult2 F2 yes Regs[R1]+0 Store Store1 Mult1 yes Store Mult2 Store2 Regs[R1]-8 F0 … F2 … F4 … F6 … F8 … F10 … F12 Qi Load2 Mult2

  19. Dynamic Memory Disambiguation • Order of loads and stores must be preserved • Since they access memory locations we can examine order only after we calculate effective address • Effective address calculation is performed in order • Address of a load is examined against A fields of all store buffers • Address of a store is examined against A fields of all load and store buffers

  20. Dynamic Hardware Branch Prediction • Predict the outcome of a branch • Change the prediction after observing a few iterations • To achieve good effectiveness we must • Have accurate prediction technique • Have a low cost for misprediction

  21. Local Prediction: Branch Prediction Buffer • A table indexed by low bits of branch instruction address • It contains a bit indicating whether the branch was recently taken or not • If it turns out we have been wrong the bit is inverted 1 bit Branch address 4

  22. 1-bit Branch Prediction Buffer • Problem – even simplest branches are mispredicted twice LD R1, #5 Loop: LD R2, 0(R5) ADD R2, R2, R4 STORE R2, 0(R5) ADD R5, R5, #4 SUB R1, R1, #1 BNEZ R1, Loop First time: prediction = 0 but the branch is taken  change prediction to 1 miss Time 2, 3, 4: prediction = 1 and the branch is taken Time 5: prediction = 1 but the branch is not taken  change prediction to 0 miss

  23. 2-bit Branch Prediction Buffer • To amend this we will use 2 bits, we must miss twice before we change our prediction Taken Taken Predict taken11 Predict taken10 Not taken Not taken Taken Taken Predict not taken01 Predict not taken00 Not taken Not taken

  24. 2-bit Branch Prediction Buffer • First time we encounter this loop LD R1, #5 Loop: LD R2, 0(R5) ADD R2, R2, R4 STORE R2, 0(R5) ADD R5, R5, #4 SUB R1, R1, #1 BNEZ R1, Loop First time: prediction = 00, not taken the branch is taken  change prediction to 01 miss Time 2: prediction = 01, not taken the branch is taken  change prediction to 11 miss Time 3,4: prediction = 11, taken the branch is taken Time 5: prediction = 11, taken the branch is not taken  change prediction to 10 miss

  25. n-bit Branch Prediction Buffer • We can generalize this technique to n-bit prediction buffers • When the counter is ≥ 2n-1, branch is predicted as taken • Those predictors are not much more accurate than 2-bit Taken Not taken Not taken Predict taken111 Predict taken110 Predict taken100 Taken Taken Taken Not taken Not taken Not taken Predict not taken011 Predict not taken001 Predict not taken000 Taken Taken

  26. Correlating (Global) Branch Predictors • Assign two prediction bits, one if the previous branch was not taken, the other if it was taken b1: if (d==0) d=1; b2: if (d==1) b1: BNEZ R1, L1 DADDUI R1, R0, #1 L1:DSUBUI R3, R1, #1 b2: BNEZ R3, L2 ……. L2: If b1 is taken, b2 is taken If b1 is not taken, b2 is not taken 0/0 One bit indicating what to do if one previous branch was not taken One bit indicating what to do if one previous branch was taken

  27. Correlating Branch Predictors • Assign two prediction bits, one if the previous branch was not taken, the other if it was taken b1: BNEZ R1, L1 DADDUI R1, R0, #1 L1:DSUBUI R3, R1, #1 b2: BNEZ R3, L2 ……. L2: This is (1,1) predictor  it usesoutcome of 1 previous branch to do prediction with 1-bit predictor R1=? b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction m m T T NT/NT NT/NT T/NT NT/T 2 0 2 0 NT NT/T NT T/NT NT/T T/NT NT/T T/NT T T NT/T T/NT NT NT/T NT T/NT NT/T T/NT

  28. Correlating Branch Predictors (m,n) • Observe behavior of m previous branches, use n-bit predictor 0/0 One bit indicating what to do if one previous branch was not taken One bit indicating what to do if one previous branch was taken (1,1) 0/0/0/…/0 One bit indicating what to do if m previous branches were not taken One bit indicating what to do if m previous branches were taken (m,1) 0111/0011/0001/…/1110 n bits indicating what to do if m previous branches were not taken n bits indicating what to do if m previous branches were taken (m,n)

  29. Correlating Branch Predictors (m,n) • 2m combinations, n-bits each n bits n bits n bits Branch address 4 … m bits indicatingoutcome of m previous branches

  30. Correlating Branch Predictors (m,n) • How many bits do we need for (m,n) predictor? • 2m combinations, n-bits each, suppose we use last t bits of branch target to select prediction 2m * n * 2t

  31. Tournament Predictors • Combine one global and one local predictor with a selector 1/1, 0/0, 1/0 1/1, 0/0, 0/1 Use predictor 1 Use predictor 2 First selector was right Second selector was wrong 1/0 0/1 0/1 1/0 1/0 Use predictor 1 Use predictor 2 0/1 0/0, 1/1 0/0, 1/1

More Related