1 / 108

Exploiting Instruction-Level Parallelism with Software Approaches

Overview. Basic Compiler TechniquesPipeline scheduling loop unrollingStatic Branch PredictionStatic Multiple Issue: VLIWAdvanced Compiler Support for Exposing ILPDetecting loop-level parallelismSoftware pipelining

barr
Télécharger la présentation

Exploiting Instruction-Level Parallelism with Software Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

    2. Overview Basic Compiler Techniques Pipeline scheduling loop unrolling Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling Hardware support for exposing more parallelism Conditional or predicted instructions Compiler speculation with hardware support Hardware vs Software speculation mechanisms Intel IA-64 ISA

    3. Review of Multi-issue Taxonomy

    4. Quote about IA-64 Architecture One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscaler processor. Its hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz. - M. Hopkins, 2000

    5. Basic Pipeline Scheduling To keep pipeline full Find sequences of unrelated instructions to overlap Separate dependent instructions by at least the latency of source instruction Compiler success depends on: Amount of ILP available Latencies of functional units

    6. Assumptions for Examples Standard 5-stage integer pipeline plus floating point pipeline Branches have delay of 1 cycle Integer load latency of 1 cycle, ALU latency of 0 Functional units fully pipelined or replicated so that there are no structural hazards Latencies between dependent FP instructions:

    7. Loop Example Add a scalar to an array. for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.

    8. Straightforward Conversion R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element loop: L.D F0, 0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer (DW) BNE R1,R2, loop ;branch if R1 != R2

    9. Program in MIPS Pipeline Clock cycle issued loop: L.D F0, 0(R1) 1 Stall 2 ADD.D F4,F0,F2 3 Stall 4 Stall 5 S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 Stall 8 BNE R1,R2, loop 9 Stall 10

More Related