Lecture 11: Memory Data Flow Techniques

Lecture 11: Memory Data Flow Techniques Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation

Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache Store: SW R2, 0(R1) Generate virtual address; may wait on base register and data register Translate virtual address into physical address Write data cache Load/Store Execution Steps Unlike in register accesses, memory addressesare not known prior to execution

Support memory-level parallelism Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed Load/store Buffer in Tomasulo IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2

Centralized RS includes part of load/store buffer in Tomasulo Loads and stores wait in RS until there are ready Load/store Unit with Centralized RS IM Fetch Unit Reorder Buffer Decode Rename Regfile RS S-unit L-unit FU1 FU2 data addr addr Store buffer cache

for (i=0;i<100;i++) A[i] = A[i]*2; Loop: L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop F4 store 2.0 Significant improvement from sequential reads/writes Memory-level Parallelism LW1 LW2 LW3 SW1 SW2 SW3

Memory Consistency Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Practical implementations: • Reads may proceed out-of-order • Writes proceed to memory in program order • Reads may bypass earlier writes only if their addresses are different

Wait in RS until base address and store data are available (ready) Move to store unit for address calculation and address translation Move to store buffer (finished) Wait for ROB commit (completed) Write to data cache (retired) Stores always retire in for WAW and WRA Dep. Store Stages in Dynamic Execution RS Store unit Load unit finished completed D-cache Source: Shen and Lipasti, page 197

Load Bypassing and Memory Disambiguation To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences • Compare load address with every older store addresses

Load Bypassing Implementation RS in-order 1. address calc. 2. address trans. 3. if no match, updatedest reg Associative search formatching Load unit Store unit 1 1 2 2 match 3 data addr D-cache Assume in-order executionof load/stores addr data

Load Forwarding: if a load address matches a older write address, can forward data If a match is found, forward the related data to dest register (in ROB) Multiple matches may exists; last one wins Load Forwarding RS in-order Load unit Store unit 1 1 2 2 match 3 To dest.reg data addr D-cache addr data

Any store in RS station may blocks all following loads When is F2 of SW available? When is the next L.S ready? Assume reasonable FU latency and pipeline length In-order Issue Limitation for (i=0;i<100;i++) A[i] = A[i]/2; Loop: L.S F2, 0(R1) DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop

Speculative Load Execution RS Forwarding does not always work if some addresses are unknown No match: predict a load has no RAW on older stores Flush pipeline at commit if predicted wrong out-order Load unit 1 1 Store unit 2 2 match 3 Match at completion addr data addr data Finishedload buffer D-cache data If match: flush pipeline

Alpha 21264 Pipeline

Alpha 21264 Load/Store Queues Int issue queue fp issue queue AddrALU IntALU IntALU AddrALU FPALU FPALU Int RF(80) Int RF(80) FP RF(72) D-TLB L-Q S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue

Load Bypassing, Forwarding, and RAW Detection commit match Load/store? LQ SQ ROB Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head completed IQ IQ If match: forwarding D-cache D-cache If match: mark store-load trapto flush pipeline (at commit)

Speculative Memory Disambiguation Fetch PC Load forwarding 1024 1-bitentry table Renamed inst 1 int issue queue • When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC • When the load is fetched, get its stWait from the table • The load waits in issue queue until old stores are issued • stWait table is cleared periodically

Memory request: search the hierarchy from top to bottom Architectural Memory States LQ SQ Committed states Completed entries L1-Cache L2-Cache L3-Cache (optional) Memory Disk, Tape, etc.

Summary of Superscalar Execution • Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch • Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery • Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti

Lecture 11: Memory Data Flow Techniques

Lecture 11: Memory Data Flow Techniques

Presentation Transcript

CHAPTER 11 Data Validation Techniques

Memory Techniques

Memory Techniques

Memory-Techniques-Training

Memory-Techniques-Training

Lecture 11: Transactional Memory

MEMORY TECHNIQUES

Data Structures Lecture 11

Memory Techniques

Memory techniques

DATA MINING LECTURE 11

DATA MINING LECTURE 11

Memory Data Flow

Memory Buffering Techniques

Lecture 4: Data Compression Techniques

Lecture 11 – Data Modelling

Lecture 11 Memory Design

Memory techniques – p1

Memory Techniques

Lecture 6 Data Flow Modeling

Lecture 6 Data Flow Modeling

Memory Techniques