1 / 26

Reconfigurable Caches and their Application to Media Processing

Reconfigurable Caches and their Application to Media Processing. Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University Houston, Texas. Sarita Adve Dept. of Computer Science University of Illinois at Urbana Champaign Urbana, Illinois.

foy
Télécharger la présentation

Reconfigurable Caches and their Application to Media Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University Houston, Texas Sarita Adve Dept. of Computer Science University of Illinois at Urbana Champaign Urbana, Illinois Norman P. Jouppi Western Research Laboratory Compaq Computer Corporation Palo Alto, California

  2. Motivation (1 of 2) • Different workloads on general-purpose processors • Scientific/engineering, databases, media processing, … • Widely different characteristics • Challenge for future general-purpose systems • Use most transistors effectively for all workloads

  3. Motivation (2 of 2) • Challenge for future general-purpose systems • Use most transistors effectively for allworkloads • 50% to 80% of processor transistors devoted to cache • Very effective for engineering and database workloads • BUT large caches often ineffective for media workloads • Streaming data and large working sets [ISCA 1999] • Can we reuse cache transistors for other useful work?

  4. Contributions • Reconfigurable Caches • Flexibility to reuse cache SRAM for other activities • Several applications possible • Simple organization and design changes • Small impact on cache access time

  5. Contributions • Reconfigurable Caches • Flexibility to reuse cache SRAM for other activities • Several applications possible • Simple organization and design changes • Small impact on cache access time • Application for media processing • e.g., instruction reuse – reuse memory for computation • 1.04X to 1.20X performance improvement

  6. Outline for Talk • Motivation • Reconfigurable caches • Key idea • Organization • Implementation and timing analysis • Application for media processing • Summary and future work

  7. On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM Reconfigurable Caches: Key Idea Key idea: reuse cache transistors! • Dynamically divide SRAM into multiple partitions • Use partitions for other useful activities  Cache SRAM useful for both conventional and media workloads

  8. Reconfigurable Cache Uses • Number of different uses for reconfigurable caches • Optimizations using lookup tables to store patterns • Instruction reuse, value prediction, address prediction, … • Hardware and software prefetching • Caching of prefetched lines • Software-controlled memory • QoS guarantees, scratch memory area  Cache SRAM useful for both conventional and media workloads

  9. On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM Key Challenges • How to partition SRAM? • How to address the different partitions as they change? • Minimize impact on cache access (clock cycle) time • Associativity-based partitioning

  10. Address State Tag Data Tag Index Block Way 1 Way 2 Data out Compare Select Hit/miss Conventional Cache Organization

  11. Address State Tag Data Tag Index Block Way 1 Partition 1 Choose Way 2 Tag Index Block Partition 2 Data out Compare Select Hit/miss Associativity-Based Partitioning Partition at granularity of “ways” Multiple data paths and additional state/logic

  12. Reconfigurable Cache Organization • Associativity-based partitioning • Simple - small changes to conventional caches • But # and granularity of partitions depends on associativity • Alternate approach: Overlapped-wide-tag partitioning • More general, but slightly more complex • Details in paper

  13. On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM Other Organizational Choices (1 of 2) • Ensuring consistency of data at repartitioning • Cache scrubbing: flush data at repartitioning intervals • Lazy transitioning: Augment state with partition information • Addressing of partitions - software (ISA) vs. hardware

  14. On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM Other Organizational Choices (2 of 2) • Method of partitioning - hardware vs. software control • Frequency of partitioning - frequent vs. infrequent • Level of partitioning - L1, L2, or lower levels • Tradeoffs based on application requirements

  15. Outline for Talk • Motivation • Reconfigurable caches • Key idea • Organization • Implementation and timing analysis • Application for media processing • Summary and future work

  16. Conventional Cache Implementation • Tag and data arrays split into multiple sub-arrays • to reduce/balance length of word lines and bit lines ADDRESS DATA ARRAY TAG ARRAY BIT LINES WORD LINES DECODERS COLUMN MUXES SENSE AMPS COMPARATORS MUX DRIVERS DATA OUTPUT DRIVER OUTPUT DRIVERS VALID OUTPUT

  17. [1:NP] [1:NP] Changes for Reconfigurable Cache ADDRESS [1:NP] • Associate sub-arrays with partitions • Constraint on minimum number of sub-arrays • Additional multiplexors, drivers, and wiring DATA ARRAY BIT LINES TAG ARRAY WORD LINES DECODERS COLUMN MUXES SENSE AMPS COMPARATORS MUX DRIVERS [1:NP] DATA OUTPUT DRIVER OUTPUT DRIVERS VALID OUTPUT [1:NP]

  18. Impact on Cache Access Time • Sub-array-based partitioning • Multiple simultaneous accesses to SRAM array • No additional data ports • Timing analysis methodology • CACTI analytical timing model for cache time (Compaq WRL) • Extended to model reconfigurable caches • Experiments varying cache sizes, partitions, technology, …

  19. Impact on Cache Access Time • Cache access time • Comparable to base (within 1-4%) for few partitions (2) • Higher for more partitions, especially with small caches • But still within 6% for large caches • Impact on clock frequency likely to be even lower

  20. Outline for Talk • Motivation • Reconfigurable caches • Application for media processing • Instruction reuse with media processing • Simulation results • Summary and future work

  21. Application for Media Processing • Instruction reuse/memoization[Sodani and Sohi, ISCA 1997] • Exploits value redundancy in programs • Store instruction operands and result in reuse buffer • If later instruction and operands match in reuse buffer, • skip execution; • read answer from reuse buffer cache partition cache partition cache partition Few changes for implementation with reconfigurable caches

  22. Simulation Methodology • Detailed simulation using RSIM (Rice) • User-level execution-driven simulator • Media processing benchmarks • JPEG image encoding/decoding • MPEG video encoding/decoding • GSM speech decoding and MPEG audio decoding • Speech recognition and synthesis

  23. System Parameters • Modern general-purpose processor with ILP+media extensions • 1 GHz, 8-way issue, OOO, VIS, prefetching • Multi-level memory hierarchy • 128KB 4-way associative 2-cycle L1 data cache • 1M 4-way associative 20-cycle L2 cache • Simple reconfigurable cache organization • 2 partitions at L1 data cache • 64 KB data cache, 64KB instruction reuse buffer • Partitioning at start of application in software

  24. Impact of Instruction Reuse • Performance improvementsfor all applications(1.04X to 1.20X) • Use memory to reduce compute bottleneck • Greater potential with aggressive design [details in paper] 100 100 100 92 89 84 JPEG decode MPEG decode Speech synthesis

  25. Summary • Goal: Use cache transistors effectively for all workloads • Reconfigurable Caches:Flexibility to reuse cache SRAM • Simple organization and design changes • Small impact on cache access time • Several applications possible • Instruction reuse -reuse memory for computation • 1.04X to 1.20X performance improvement • More aggressive reconfiguration currently under investigation

  26. More information available at • http://www.ece.rice.edu/~parthas • parthas@rice.edu

More Related