1 / 30

Chapter3 Limitations on Instruction-Level Parallelism

Chapter3 Limitations on Instruction-Level Parallelism. Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010. Overcome Data Hazards with Dynamic Scheduling. If there is a data dependence, the hazard detection hardware stalls the pipeline

katelynn
Télécharger la présentation

Chapter3 Limitations on Instruction-Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter3 Limitations on Instruction-Level Parallelism Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

  2. Overcome Data Hazards with Dynamic Scheduling • If there is a data dependence, the hazard detection hardware stalls the pipeline • No new instructions are fetched or issued until the dependence is cleared • Dynamic Scheduling:the hardware rearrange the instruction execution to reduce the stalls while maintaining data flow and exception behavior

  3. RAW • If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped • If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard I: add r1,r2,r3 J: sub r4,r1,r3

  4. Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 ADD F10<- F0+F8 SUB F12<- F8-F14

  5. Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8

  6. Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8 • Enables out-of-order execution and allows out-of-order completion(e.g., SUB) • In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

  7. Overcome Data Hazards with Dynamic Scheduling • It offers several advantages: • Simplifies the compiler • It allows code that compiled for one pipeline to run efficiently on a different pipeline • (Allow the processor to tolerate unpredictable delays such as cache misses)

  8. Overcome Data Hazards with Dynamic Scheduling • However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder • Name dependence:when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; • There are 2 versions of name dependence

  9. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAR • InstrJ writes operand before InstrI reads it • If it caused a hazard in the pipeline, called a Write After Read (WAR) hazard

  10. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAW • InstrJ writes operand before InstrI writes it. • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard

  11. Example DIV r0 <- r2 / r4 ADD r6 <- r0 + r8 SUB r8 <- r10 – r14 MUL r6 <- r10 * r7 OR r3 <- r5 or r9

  12. Example RAW

  13. Example WAR

  14. Example WAW

  15. For you to practice • DIV r0 <- r2 / r4 • ADD r6 <- r0 + r8 • ST r1 <- r6 • SUB r8 <- r10 - r14 • MUL r6 <- r10 * r8

  16. Overcome Data Hazards with Dynamic Scheduling • Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict • Register renaming resolves name dependence for regs • Either by compiler or by HW

  17. Limits to ILP Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Perfect Cache

  18. Limits to ILP HW Model comparison

  19. Performance beyond single thread ILP • There can be much higher natural parallelism in some applications • Such as “Online processing system”: which has natural parallelism among the multiple queries and updates that are presented by requests • Such as Data Mining algorithms

  20. Thread-level parallelism (TLP) • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • (Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)

  21. Thread-level parallelism (TLP) • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs • TLP could be more cost-effective to exploit than ILP

  22. New Approach: Mulithreaded Execution • Multithreading: multiple threads to share the functional units of 1 processor via overlapping • Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

  23. New Approach: Mulithreaded Execution • When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

  24. Fine-Grained Multithreading • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved • Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock

  25. Fine-Grained Multithreading • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads

  26. Course-Grained Multithreading • Switches threads only on costly stalls, such as L2 cache misses • Advantages • Relieves need to have very fast thread-switching • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

  27. Course-Grained Multithreading • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen • New thread must fill pipeline before instructions can complete • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

  28. Multithreaded Categories Thread 4 Thread 1 Thread 2 Thread 3 Thread 5

  29. Multithreaded Categories Superscalar Fine-Grained Coarse-Grained (2clock cycle) Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

  30. Multiprocessing Multiprocessing Thread 1 Thread 2

More Related