E N D
Low-Cost Adaptive Data Prefetching Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals. University of Zaragoza (Spain) Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Introduction Hardware Data Prefetching Effective to hide memory latency Recent successful proposals: GHB, SMS Simple mechanisms in commercial processors: UltraSPARC-IIIcu & SPARC64 VI (sequential tagged) Power4 & Power5 (sequential stream buffers) Intel Core (sequential & stride) Sequential Tagged prefetching (SEQT) Prefetches on a cache miss or on a 1st. use Highest speed-ups High pressure on mem. & perf. losses in hostile app. Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Introduction Our aim: Use the simplest prefetcher (SEQT) Evaluate degree-distance policies and adaptive mechanisms Compare them with: Stride GHB P-DFCM SMS Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Outline Prefetching mechanisms Experimental framework and benchmarks Preliminary results Performance Pressure to memory Degree-distance policies Results Conclusions and future work Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Prefetching mechanisms Stride prefetching @’s separated by a constant distance Table indexed by PC on-miss insertion [Ibáñez et al. 98] SMS (Spatial Memory Streaming) Spatial access patterns Prefetches blocks inside a memory region Avoids useless blocks Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Prefetching mechanisms Correlating prefetchers Tables store memory program behaviour (addresses or deltas) Indexed by address or PC GHB (Global History Buffer) PC/DC Focused on reducing table sizes 2 tables, several accesses to calculate deltas P-DFCM Based on DFCM value predictor 2 tables, delta stream used to predict next delta Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Experimental framework and benchmarks SimpleScalar 3.0 Alpha binaries Aggressive superscalar processor 3-level memory hierarchy (Itanium2) Spec2k Simple Simpoints 200 M instruction warming Selection rule: ideal L2 speed-up > 2% 4 MB 256 KB 16 KB Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Preliminary results: performance a) CINT b) CFP Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Preliminary results: pressure Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Preliminary results: breakdown per application Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Deg(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg(x) on miss & on 1st. use prefetches x blocks Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Dist(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Dist(x) on miss & on 1st. use prefetches the x-th block Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Deg-dist(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg-dist(x) on miss x blocks on 1st. use the x-th block Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Deg(1-4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg(1-x) degmiss = 1 deg1st use = x Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Ad1(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 0 1st. use of a prefetch 0 time 0 Ad1(x) 01 degmiss = 1 deg1st use = f(usefulness) [0..x] 1 100x deg-- 1 50x deg++ Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Ad2(4) prefetch demand hit i i+1 i+2 i+3 i+4 deg i-1 demand miss time 2 1st. use of a prefetch 2 2 Ad2(x) 2 degmiss = 1 (both dir.) deg1st use = f(usefulness) [0..x] 2 100x deg-- 2 50x deg++ k-4 k-3 k-2 k-1 k k+1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Ad3(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 1 1st. use of a prefetch 1 time 1 Ad3(x) degmiss = 1 12 deg1st use = f(usefulness, timeliness, pollution) [0..x] 2 100x deg-- 100x pollution deg-- 2 50x deg++ 50x late deg++ Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Ad4(4,32) prefetch demand hit i i+1 i+2 i+3 i+4 deg i-1 demand miss time 2 1st. use of a prefetch 2 2 Ad4(x,y) 1 region [0..y-1] deg1st use = f(usefulness, region) [0..x] 1 100x deg-- 1 50x deg++ k-4 k-3 k-2 k-1 k k+1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Degree-distance policies Ad5(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 0 1st. use of a prefetch 0 time 1 Ad5(x) [Dahlgren-93] 1 deg = f(usefulness) [0..x] 1 • same deg. on miss & on 1st. use • mechanism needed when deg==0 1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Results: performance • SMS as reference • Ad have no losses • INT deg 4 or 8 • FP deg 8 or 16 • Dist & Ad5 the worse • The rest similar to Deg • Among Ad: INTAd4(8,32) (diff 1%) • FP Ad3(8) (diff 1% - 5%) • Ad4(8,32) & Ad2(8) best on average a) CINT b) CFP Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Results: pressure Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
PAB i+1 i+2 i+3 i+4 i+2 i+3 i+4 i+5 i+1 Deg(4) Prefetch Engine PAB (4 entries) L2 i Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
PAB as a filter i+4 i+2 i+1 i+3 i+5 i+4 i+2 i+3 i+1 i+2 i+1 Deg(4) Prefetch Engine PAB (4 entries) L2 i • L2 lookups reduction: • 2% for Deg-dist • SMS 49% (but continues being the most demanding) • 25%-40% for the rest • Performance unaffected Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Conclusions and future work Ways of tuning the aggressiveness of SEQT prefetchers Ad2(8) and Ad4(8,32) perform the best Adaptive: vary the degree according to prefetch usefulness Ad2 prefetches forward and backward Ad4 adjusts the degree for every of the 32 memory regions Both equal SMS in CINT and outperform it in CFP (60% less lookups in L2) Ad2: 2 bits/line; Ad4: 2b + 64B table; SMS 33KB PAB used to reduce the pressure on L2 (25%-40%) No losses & really low hardware cost Future work: use a realistic on-chip memory controller Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008
Thank you Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008