Decoupled Architecture for Data Prefetching
This presentation explores the design and evaluation of a decoupled architecture aimed at improving data prefetching in computing systems. We address the critical processor-memory performance gap and investigate the potential benefits of utilizing a dedicated coprocessor for prefetching tasks. The results demonstrate that a Prefetching CoProcessor (PCP) can enhance performance, tolerate delays, and adapt to various prefetching schemes. Our findings highlight the importance of correctly integrating multiple algorithms to optimize memory performance while managing overhead effectively.
Decoupled Architecture for Data Prefetching
E N D
Presentation Transcript
Decoupled Architecture for Data Prefetching <chang@cs.wisc.edu> <xuk@cs.wisc.edu> Jichuan Chang Kai Xu CS752
Outline • Motivation • Design and Evaluation • Results • Conclusions CS752
Motivation • Processor-memory performance gap • Prefetching helps, but it has overhead. • Transistor is cheap, will a coprocessor help? Main Processor Prefetching CoProcessor Info Flow Cache Prefetch Requests Data L1-L2 Internal Bus CS752
Why a dedicated coprocessor? • Simple • It simplifies the design of main processor. • Powerful • It can (hopefully) exploit complex algorithms; • It handles computation overhead (i.e. pattern computation, address computation). • Flexible • It can (hopefully) adapt to different situations; • It can implement different algorithms. • But are these true? CS752
Info Flow Main Processor Prefetching CoProcessor Tables RPT, PPW, CT, History, … … Cache Stream Buffer Prefetch Requests Data Bus The Generic Design ALU What ? When ? Where ? CS752
Data Prefetching Techniques • Regular Access Prefetching • Tagged Next Block Lookahead [Smith 82] • Exploit sequential access pattern; • Stride Prefetching [Baer & Chen 91] • Exploit stride access pattern; • Dependency-based Prefetching [Roth, et al 98] • Discover Linked-Data-Structure access pattern • Dead Block Correlation [Lai, et al 01] • History based correlation prediction • Stream Buffer [Joppi 90] • Reduce cache pollution CS752
Simulation Settings • SimpleScalar v3.0 • Modified sim-outorder to implement • information sharing between MP and PCP; • Modified cache module to implement • Prefetching schemes (between L1 and L2 cache), • Prefetch queue (len = 16); Bus sharing/contention, • Stream buffer. • Memory Parameters • L1 Data Cache: 4KB, 32B line, 4-way associative; • L2 Cache: 64KB, 64B line, 4-way associative; • Stream buffer: 8 entries, fully associated, 1 cycle hit; • Hit latency (cycle): L1 = 1 L2 = 12 Mem = 70 (2*); • Pipelined bus: bus contention/latency are modeled. CS752
Benchmarks • From Spec95 • gcc • compress • swim • tomcatv • Microbenchmark • Matrix multiplication (128 X 128 double) • Binary tree (1M nodes, similar to treeadd) CS752
Results (IPC) CS752
Results (Miss Ratio) CS752
L1-L2 Traffic Increase CS752
Results (Delay Tolerance) • How many cycles of delay can PCP tolerate? • More delay • Less useful (can’t get back before demand references) • More pollution (due to outdated information) • Less prefetches (due to bus contention) • To avoid pollution, impl. prefetch queue as circular buffer. • Overwrite outdated entries when queue is full. • The major effect of larger delay will be less prefetches. • Hard to model memory behavior in SimpleScalar • Predetermine latency, no wake-up, no MSHR. CS752
Delay tolerance • Preliminary result • For almost all schemes on all benchmarks: • PCP can tolerant 8 cycles of delay CS752
Can we integrate different schems? • Different applications need different schems • Brute force approach • Use both tagged and stride prefetching • Good speedup, but much more memory traffic. • Adapt prefetching policy dynamically? • Share the same hardware table • Using similar matching schemes • Hard to reconfigure/flush when context-swithes • Use separate tables • More hardware • Similar to tournament predictor (just a thought) CS752
Conclusions • PCP helps performance! (2-30% speedup) • PCP handles prefetching, can tolerates some delays. • Different schemes work for different applications • Requires different information (from different places); • PCP should be placed close to the info source; • Not easy to integrate different schemes. • Limitation of our approach • PCP not fully utilized. • Relies on tables (caches/queues/buffers) • DBCP requires large history table (7.6 M memory)! • Delay is critical to performance • It limits the complexity of prefetch schemes, • It also determines where to place PCP. CS752
Future Work • To evaluate more prefetching schemes • Dependency-based prefetching, etc. • PCP Running Ahead • Probably with the help of trace cache; • To fully utilize PCP; • Need chkpt/rollback mechanisms. • CoProcessor to Support Other Functionalities • Branch prediction, power mgmt. • PCP for Multiprocessor • Suitable for One-Block-Lookahead. • Need to change CC protocol. CS752
Thank You! Questions? CS752
Backup Slides Gauges
Tagged Prefetching CS752
Stride Prefetching • Recurrence Prediction Table (RPT) • Organized like a cache, indexed by PC • (Data addresses, stride, state) • State Machine CS752
Dependency-based Prefetching • Potential Producer Window • Correlation Table • One Step Ahead • Jump Pointer Generation/Maintenance CS752