1 / 37

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning. Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid.

jbodner
Télécharger la présentation

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid Frank Vahid, UC Riverside

  2. Trend Towards Pre-Fabricated Platforms: ASSPs • ASSP: application specific standard product • Domain-specific pre-fabricated IC • e.g., digital camera IC • ASIC: application specific IC • ASSP revenue > ASIC • ASSP design starts > ASIC • Unique IC design • Ignores quantity of same IC • ASIC design starts decreasing • Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside

  3. Becoming out of reach of mainstream designers Will High End ICs Still be Made? • YES • The point is that mainstream designers likely won’t be making them • Very high volume or very high cost products • Platforms are one such product – high volume • Need to be highly configurable to adapt to different applications and constraints Frank Vahid, UC Riverside

  4. UCR Focus • Configurable Cache • Hardware/Software Partitioning Frank Vahid, UC Riverside

  5. UCR Focus • Configurable Cache • Hardware/Software Partitioning Frank Vahid, UC Riverside

  6. Configurable Cache: Why • ARM920T: Caches consume half of total power (Segars 01) • M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) Periph- erals JPEG dcd L1 cache L1 cache uP DSP FPGA IC Pre-fabricated Platform (A pre-designed system-level architecture) Frank Vahid, UC Riverside

  7. Best Cache for Embedded Systems? • Not clear • Huge variety among popular embedded processors • What’s the best… • Associativity, Line size, Total size? Frank Vahid, UC Riverside

  8. Set associative cache • Multiple “ways” • Fewer index bits, more tag bits, simultaneous comparisons • More expensive, but better hit rate Tag Index 11 D 0000 Conflict 110 D 100 C 000 Direct mapped cache (1-way set associative) 2-way set associative cache Cache Associativity A 00 0 000 • Direct mapped cache • Certain bits “index” into cache • Remaining “tag” bits compared B 01 0 000 C 10 0 000 D 11 0 000 Frank Vahid, UC Riverside

  9. Cache Associativity • Reduces miss rate – thus improving performance • Impact on power and energy? • (Energy = Power * Time) Frank Vahid, UC Riverside

  10. Associativity is Costly • Associativity improves hit rate, but at the cost of more power per access • Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache Frank Vahid, UC Riverside

  11. Significantly poorer energy Associativity and Energy • Best performing cache is not always lowest energy Frank Vahid, UC Riverside

  12. Associativity Dilemma • Direct mapped cache • Good hit rate on most examples • Low power per access • But poor hit rate on some examples • High power due to many misses • Four-way set-associative cache • Good hit rate on nearly all examples • But high power per access • Overkill for most examples, thus wasting energy • Dilemma: Design for the average or worst case? Frank Vahid, UC Riverside

  13. Associativity Dilemma • Obviously not a clear choice • Previous work • Albonesi – proposed configurable cache having way shutdown ability to save dynamic power • Motorola M*CORE also 110 D 0000 11 0 000 Frank Vahid, UC Riverside

  14. Our Solution: Way Concatenatable Cache • Can be configured as 4, 2, or 1 way • Ways can be concatenated 11x D 10x C 0000 This bit selects the way 11 0 000 Frank Vahid, UC Riverside

  15. 6x64 c0 c1 c3 c2 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0 Configuration circuit a11 Small area and performance overhead reg0 a12 reg1 tag part c3 c1 c0 c2 bitline c1 c0 index 6x64 6x64 6x64 data array c2 c3 6x64 6x64 column mux sense amps tag address line offset mux driver data output critical path Frank Vahid, UC Riverside

  16. Way Concatenate Experiments • Experiment • Motorola PowerStone benchmark g3fax • Considering dynamic power only • L1 access energy, CPU stall energy, memory access energy • Way concatenate outperforms 4 way and direct map. • Just as good as way shutdown Frank Vahid, UC Riverside

  17. Way Concatenate Experiments 100% = 4-way conventional cache • Considered 23 programs (Powerstone, MediaBench, and Spec2000) • Dynamic power only (L1 access energy, CPU stall energy, memory access energy) • Way concatenate • Better than way shutdown (due to less performance penalty) • Saves over conventional 4-way • Also avoids big penalties of 1-way on some programs Frank Vahid, UC Riverside

  18. Way Concatenate Experiments • Best configuration varies • Need to tune configuration to a given program Frank Vahid, UC Riverside

  19. Normalized Execution Times • Way shutdown suffers performance penalty • As does direct mapped • Way concatenate has almost no performance penalty • Though 3% longer critical path than conventional 4-way Frank Vahid, UC Riverside

  20. Vdd bitline bitline Gated-Vdd Control Gnd Way Shutdown for Static Power Savings • Albonesi and Motorola used logic to gate clock • Reduced dynamic power, but not static (leakage) • Way concatenate clearly superior for reducing dynamic pwr • Shutting down ways still useful to save static power • But we’ll use another method (Agarwal DRG-cache) SRAM cell Frank Vahid, UC Riverside

  21. Way Concatenate Plus Way Shutdown • We set static power = 30% of dynamic power • Way shutdown now preferred in many examples • But way concatenate still very helpful Frank Vahid, UC Riverside

  22. Configurable Line Size Too 100% = 4-way conventional cache csb: concatenate plus shutdown cache • Best line size also differs per example • Our cache can be configured for line of 16, 32 or 64 bytes • 64 is usually best; but 16 is much better in a couple cases Frank Vahid, UC Riverside

  23. Configurable Cache • A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy • Well-suited for configurable devices like Triscend’s Frank Vahid, UC Riverside

  24. UCR Focus • Configurable Cache • Hardware/Software Partitioning Frank Vahid, UC Riverside

  25. Using On-Chip FPGA to Reduce Sw Energy • Hennessey/Patterson: • “The best way to save power is to have less hardware” (pg 392) • Actually, best way is to have less ACTIVE hw • Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced • How? Frank Vahid, UC Riverside

  26. uP FPGA Using On-Chip FPGA to Reduce Sw Energy • Move critical sw loops to FPGA • Loop executes in 1/10th the time • Use this time to power down the system longer during task period • Alternatively, slow down the microprocessor using voltage scaling Periph- erals JPEG dcd L1 cache uP DSP FPGA IC uP active idle Pre-fabricated Platform idle uP FPGA Task period Frank Vahid, UC Riverside

  27. The 90-10 rule (or 80-20 rule) • Most software time is spent in a few small loops • e.g., MediaBench and NetBench benchmarks • Known as the 90-10 rule • 10% of the code accounts for 90% of the execution time • Move those loops to FPGA Frank Vahid, UC Riverside

  28. Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Hardware/Software Partitioning Results Simulation based Frank Vahid, UC Riverside

  29. Analysis of Ideal Speedup • Each loop is 10x faster in hw (average based on observations) • Notice the leveling off after the first couple loops (due to 90-10 rule) • Thus, most speedup comes from the first few loops • Good for us -- Moderate amount of FPGA gives most of the speedup • How much FPGA? Frank Vahid, UC Riverside

  30. Speedup Gained with Relatively Few Gates • Manually created several partitioned versions of each benchmarks • Most speedup gained with first 20,000 gates • Surprisingly few gates • Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 • Stitt and Vahid, IEEE Design and Test, Dec. 2002 • J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside

  31. Impact of Microprocessor/FPGA Clock Ratio • Previous data assumed equal clock freq. • A faster microprocessor has significant impact • Analyzed 1:1, 2:1, 3:1, 4:1, 5:1 ratios • Planning additional such analyses • Memory bandwidth • Power ratios • More Frank Vahid, UC Riverside

  32. Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement A7 IC • Performed physical measurements on Triscend A7 and E5 devices • Similar results (even a bit better) Triscend A7 development board Frank Vahid, UC Riverside

  33. Other Research Directions: Tiny Caches • Impact of tiny caches on instruction fetch power • Filter caches, dynamic loop cache, preloaded loop cache • Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 • Gordon-Ross, Vahid, ICCD 2002. • Cotterell, Vahid, ISSS 2002 and ICCAD 2002 • Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002 L1 cache or I-mem Loop cache Mux Processor Frank Vahid, UC Riverside

  34. Other Research Directions: Platform-Based CAD • Use physical platform to aid search of configuration space • Configure cache, hw/sw partition • Configure, execute, and measure • Goal: Define best cooperation between desktop CAD and platform • NSF grant 2002-2005 (with N. Dutt at UC Irvine) Frank Vahid, UC Riverside

  35. Other Research Directions: Dynamic Hw/Sw Partitioning • My favorite  • Add component on-chip: • Detects most frequent sw loops • Decompiles a loop • Performs compiler optimizations • Synthesizes to a netlist • Places and routes the netlist onto FPGA • Updates sw to call FPGA • Self-improving IC • Can be invisible to designer • Appears as efficient processor • Can also dynamically tune the cache configuration Mem Processor D$ I$ Profiler Config. Logic Mem DMA Proc. Frank Vahid, UC Riverside

  36. Current Researchers Working in Embedded Systems at UCR • Prof. Frank Vahid • 5 Ph.D. students, 2 M.S. • Prof. Walid Najjar • 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on compiling C to FPGAs • Prof. Tom Payne • 1 Ph.D. student, working on compiling C to FPGAs • Prof. Jun Yang (new hire) • Working on low power architectures (frequent value detection) • Prof. Harry Hsieh • 2 Ph.D. students, working on formal verification of system models • Prof. Sheldon Tan (new hire) • 1 Ph.D, working on physical design, and analog synthesis Frank Vahid, UC Riverside

  37. Conclusions • Highly configurable platforms have a bright future • Cost equations just don’t justify ASIC production as much as before • Triscend parts are well situated; close collaboration desired • Configurable cache improves memory energy • Tuning to a particular program is CRUCIAL to low energy • Way concatenation is effective at reducing dynamic power • Way shutdown saves static power • Variable line size reduces traffic • All must be tuned to a particular program • Configurable logic improves software energy • Without requiring excessive amounts of hardware • Many exciting avenues to investigate! Frank Vahid, UC Riverside

More Related