Enhancing Energy Efficiency in Snoop-based CMPs Through Partial Tag Comparison Techniques
210 likes | 340 Vues
This research explores the use of Partial Tag Comparison (PTC) to improve energy efficiency in snoop-based chip multiprocessors (CMPs). Traditional methods involve broadcasting entire tags, which is inefficient. Our innovative approach leverages PTC prior to snooping, achieving a 2.9% improvement in performance, a 52% reduction in tag array power, and a dramatic 78.5% enhancement in bandwidth utilization. The findings highlight the effectiveness of early miss detection with limited tag bits, enabling significant energy savings and increased performance in modern CMP architectures.
Enhancing Energy Efficiency in Snoop-based CMPs Through Partial Tag Comparison Techniques
E N D
Presentation Transcript
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria
This Work: Improving Snoop Coherency Goal: Improving energy efficiency in snoop-based CMPs. Motivation: Broadcasting/processing entire tag is inefficient. Our Solution: Using Partial Tag Comparison (PTC) prior to snoop. Key Results Performance (2.9%) Tag array power (52%) Bandwidth utilization (78.5%)
Our Solution (PTC) vs. Conventional Conventional Our solution D$ D$ D$ …. D$ D$ …. D$ Interconnect Interconnect Upper Level Cache Upper Level Cache Fast ++ (early miss detection) Power & Bandwidth Efficient + Fast + Power & Bandwidth −
Conventional Snooping CPU CPU 3 D$ D$ 4 1 Redundant (miss): ~70% Address Bus 2 Snoop Bus controller Command Bus 5 4 4 D$ D$ 3 3 CPU CPU
Snoop Filters Goal: Eliminate redundant snoop requests. Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP (ASPLOS’08) PTC: (1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided. How often is that possible?
How often using n bits is enough to detect a miss? 95+% of misses can be detected using 8 bits.
PTC-Filter D$ PTC-Filter LSB LSB LSB hit miss Avoid Snoop Access Upper Level Snoop Potential Targets Address Bus
PTC-Filter 1 2 0 3 4-way D$ 4-way D$ 4-way D$ 4-way D$ PTC-Filter Filter Filter Filter … LSB D V 8 bits Core1’s LSB Core2’s LSB Core3’s LSB
PTC: Filter Miss CPU CPU D$ D$ 1 2 Address Bus 3 Snoop Bus controller Command Bus D$ D$ CPU CPU
PTC: Filter Hit CPU CPU D$ 4 D$ ✓ ✗ 1 ✗ ✓ 5 2 Address Bus 3 Snoop Bus controller Command Bus 6 ✗ ✗ D$ D$ CPU CPU
Filter Maintenance Core 0 Core i CPU Snoop Controller Request =A 1 Pending Request Table ….. ….. 6 PTC- Filter 2 4 A 0 1 1 miss A. place it in position of tag F 6 5 Place A, insert in Way 1 of core 0 Command Bus 3 Address Bus {Address=A, C=0,W=1, D=1}
Methodology • SESC simulator 4-way CMP • SPLASH-2 benchmarks • CACTI 6.0
Performance Average: 2.9%
Bandwidth Average: 78.5%
Tag Power Average: 52%
Discussion • Why do benchmarks show different performance improvement? • Different cache miss frequency • Different early miss detection frequency • Not all cache misses are on the critical path • Filter overhead: • Timing: 1 cycle • Power: 78.5% of single tag array access
Summary • PTC: • Using subset of tag bits to improve bandwidth/power efficiency. • Results: • Performance: 2.9% • Tag Power: 52% • Bandwidth: 78.5%
Global vs. Local Miss Have B? Have B? NO NO NO YES NO • local miss detection better power/bandwidth profile • Remote miss detection (source-based approach) vs. (destination-based filter) D$ D$ D$ D$ D$ D$ D$ …. …. Interconnect interconnect Upper Level Cache Upper Level Cache Global Miss Local Miss