Access Region Locality for High-Bandwidth Processor Memory System Design

32nd Annual International Symposium on Microarchitecture Access Region Locality for High-Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U

Big Picture Cho, Yew, and Lee

On-Chip D-CacheBandwidth Problem

Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee

Multi-Ported Data Cache • Replicated Cache • Alpha 21164 • Time-Division Multiplexed Cache • Alpha 21264 • Interleaved Cache • MIPS R10K Cho, Yew, and Lee

Window Logic Complexity • Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee

Data Decoupling

Data Decoupling: What is it? • A Divide-and-Conquer approach • Instruction stream partitioned before entering RS • Narrower networks • Less ports to each cache • Needs mechanism for proper partitioning Cho, Yew, and Lee

Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee

Access Region Locality& Access Region Prediction

Access Region: Overview • Access Region R • R = (L, U) • L: Lower Bound on Addr. • U: Upper Bound on Addr. • If (D<A) or (B<C), • Region R and Q are said to be exclusive or non-overlapping. • Locations in exclusive regions are independent. Cho, Yew, and Lee

Access Region and Mem. Instructions Cho, Yew, and Lee

Partitioning Memory Space • One way of partitioning memory space into regions: • Data Region / Heap Region / Stack Region • This work assumes this partitioning. Cho, Yew, and Lee

Partitioning Memory Space, Cont’d • Many accesses are toward Data and Stack regions. • Some programs don’t access the Heap region at all. (%) Cho, Yew, and Lee

Partitioning Memory Space, Cont’d • Accesses to Data region are less bursty than others. • Programs such as ijpeg have clustered region accesses. • Window Size = 32 Cho, Yew, and Lee

Partitioning Memory Space, Cont’d • W/ a large window, Stack accesses become less bursty. • Data and Stack regions have quite stable, constant demand. • Window Size = 64 Cho, Yew, and Lee

1.8% 1.9% 50.4% 51.1% 1.6% 16.2% 45.4% 31.6% go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Partitioning Memory Space, Cont’d • Many instructions access a single region (~98%). • Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references. Cho, Yew, and Lee

Access Region Locality • “A memory reference instruction typically accesses a single region at run time” • Only about 2% of all static memory instructions access more than a single region. • “(Thus) the region it accesses is highly predictable” • Simple predictors with a small look-up table achieve high prediction accuracy. Cho, Yew, and Lee

Predicting Regions: Unlimited Case • One predictor per memory instruction • Predictor types: • 1-bit history saver (0: Data, 1: Stack) • 2-bit saturating counter Cho, Yew, and Lee

Predicting Regions: Adding Context • Run-time context • Caller’s ID (CID): in Link Register • Global Branch History (GBH) • Hybrid of above Cho, Yew, and Lee

Predicting Regions: Utilizing Static Info. • Some instructions’ access regions are revealed through architecture and compiler conventions: • Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack. • Use of Global Pointer ($GP)suggests that the region is non-Stack. • For others, assume non-Stack. • Directly exporting some high-level region information from compiler to processor may improve prediction accuracy. Cho, Yew, and Lee

Region Pred. Result: Unlimited Case w/ GBH • 1-bit predictors do better than 2-bit predictors (not shown). • Hybrid context bits achieve the best prediction rate on average. w/ CID Simple 1-bit w/ Hybrid Static go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

Predicting Regions: Limited-Size ARPT • Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): • Table Entries Initialized to 0’s • 1 to denote stack access • Decoding information exploited to save ARPT space Cho, Yew, and Lee

go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Region Prediction Result: ARPT • Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. • Compiler hints relieve pressure due to smaller sizes. 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee

Dynamic Data Decoupling

Dynamic Data Decoupling Cho, Yew, and Lee

Dynamic Data Decoupling, Cont’d • Dynamically predicting access regions to classify memory instructions: • Utilize Access Region Prediction Table (ARPT). • Utilize any region information revealed through instruction decoding. • Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions. Cho, Yew, and Lee

Base Machine Model Cho, Yew, and Lee

Overall Performance • Over (2+0) conf. go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

Conclusions • Access Region Locality says • Memory instructions access few regions at run time. • Accessed regions are accurately predictable. • Access Region Locality leads to Access Region Prediction techniques. • Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches. Cho, Yew, and Lee

Now Any Questions?

0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates. (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

Access Region Locality for High-Bandwidth Processor Memory System Design

Access Region Locality for High-Bandwidth Processor Memory System Design

Presentation Transcript

What impact the memory system design? Principle of Locality Temporal Locality (90% time spent in 10% code)

Processor Design

Processor Design

Processor Design

Memory/Processor

High Bandwidth damper

Interactions between Processor Design and Memory System Design

Memory System Design

Access to High Memory

Region-Centric Memory Design

Processor Design

Processor Design

Principle of Locality: Memory Hierarchies

Microprocessor System Design Processor Timing

CEG3420 Computer Design Locality and Memory Technology

Access to Bandwidth: Proposals for Action

Medium Access for Wider Bandwidth

Processor Design

Embedded Memory Wrapper Generation for Multi-processor SoC Design

Processor design

Access to High Memory

Principle of locality: Memory Hierarchies