1 / 32

Access Region Locality for High-Bandwidth Processor Memory System Design

32nd Annual International Symposium on Microarchitecture. Access Region Locality for High-Bandwidth Processor Memory System Design. Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U. Big Picture. On-Chip D-Cache Bandwidth Problem.

berit
Télécharger la présentation

Access Region Locality for High-Bandwidth Processor Memory System Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 32nd Annual International Symposium on Microarchitecture Access Region Locality for High-Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U

  2. Big Picture Cho, Yew, and Lee

  3. On-Chip D-CacheBandwidth Problem

  4. Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee

  5. Multi-Ported Data Cache • Replicated Cache • Alpha 21164 • Time-Division Multiplexed Cache • Alpha 21264 • Interleaved Cache • MIPS R10K Cho, Yew, and Lee

  6. Window Logic Complexity • Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee

  7. Data Decoupling

  8. Data Decoupling: What is it? • A Divide-and-Conquer approach • Instruction stream partitioned before entering RS • Narrower networks • Less ports to each cache • Needs mechanism for proper partitioning Cho, Yew, and Lee

  9. Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee

  10. Access Region Locality& Access Region Prediction

  11. Access Region: Overview • Access Region R • R = (L, U) • L: Lower Bound on Addr. • U: Upper Bound on Addr. • If (D<A) or (B<C), • Region R and Q are said to be exclusive or non-overlapping. • Locations in exclusive regions are independent. Cho, Yew, and Lee

  12. Access Region and Mem. Instructions Cho, Yew, and Lee

  13. Partitioning Memory Space • One way of partitioning memory space into regions: • Data Region / Heap Region / Stack Region • This work assumes this partitioning. Cho, Yew, and Lee

  14. Partitioning Memory Space, Cont’d • Many accesses are toward Data and Stack regions. • Some programs don’t access the Heap region at all. (%) Cho, Yew, and Lee

  15. Partitioning Memory Space, Cont’d • Accesses to Data region are less bursty than others. • Programs such as ijpeg have clustered region accesses. • Window Size = 32 Cho, Yew, and Lee

  16. Partitioning Memory Space, Cont’d • W/ a large window, Stack accesses become less bursty. • Data and Stack regions have quite stable, constant demand. • Window Size = 64 Cho, Yew, and Lee

  17. 1.8% 1.9% 50.4% 51.1% 1.6% 16.2% 45.4% 31.6% go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Partitioning Memory Space, Cont’d • Many instructions access a single region (~98%). • Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references. Cho, Yew, and Lee

  18. Access Region Locality • “A memory reference instruction typically accesses a single region at run time” • Only about 2% of all static memory instructions access more than a single region. • “(Thus) the region it accesses is highly predictable” • Simple predictors with a small look-up table achieve high prediction accuracy. Cho, Yew, and Lee

  19. Predicting Regions: Unlimited Case • One predictor per memory instruction • Predictor types: • 1-bit history saver (0: Data, 1: Stack) • 2-bit saturating counter Cho, Yew, and Lee

  20. Predicting Regions: Adding Context • Run-time context • Caller’s ID (CID): in Link Register • Global Branch History (GBH) • Hybrid of above Cho, Yew, and Lee

  21. Predicting Regions: Utilizing Static Info. • Some instructions’ access regions are revealed through architecture and compiler conventions: • Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack. • Use of Global Pointer ($GP)suggests that the region is non-Stack. • For others, assume non-Stack. • Directly exporting some high-level region information from compiler to processor may improve prediction accuracy. Cho, Yew, and Lee

  22. Region Pred. Result: Unlimited Case w/ GBH • 1-bit predictors do better than 2-bit predictors (not shown). • Hybrid context bits achieve the best prediction rate on average. w/ CID Simple 1-bit w/ Hybrid Static go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

  23. Predicting Regions: Limited-Size ARPT • Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): • Table Entries Initialized to 0’s • 1 to denote stack access • Decoding information exploited to save ARPT space Cho, Yew, and Lee

  24. go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Region Prediction Result: ARPT • Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. • Compiler hints relieve pressure due to smaller sizes. 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee

  25. Dynamic Data Decoupling

  26. Dynamic Data Decoupling Cho, Yew, and Lee

  27. Dynamic Data Decoupling, Cont’d • Dynamically predicting access regions to classify memory instructions: • Utilize Access Region Prediction Table (ARPT). • Utilize any region information revealed through instruction decoding. • Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions. Cho, Yew, and Lee

  28. Base Machine Model Cho, Yew, and Lee

  29. Overall Performance • Over (2+0) conf. go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

  30. Conclusions • Access Region Locality says • Memory instructions access few regions at run time. • Accessed regions are accurately predictable. • Access Region Locality leads to Access Region Prediction techniques. • Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches. Cho, Yew, and Lee

  31. Now Any Questions?

  32. 0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates. (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

More Related