Download
tputcache high frequency multi way cache for high throughput fpga applications n.
Skip this Video
Loading SlideShow in 5 Seconds..
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS PowerPoint Presentation
Download Presentation
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

117 Views Download Presentation
Download Presentation

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUTFPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

  2. Our Problem • We use overlays for data processing • Partially/fully fixed processing elements • Virtual CGRAs, soft vector processors • Memory: • Large register files/scratchpad in overlay • Low latency, local data • Trivial (large DMA): burst to/from DDR • Non-trivial?

  3. Scatter/Gather • Data dependent store/load • vscatteradr_ptr, idx_vect, data_vect • for i in 1..N • adr_ptr[idx_vect[i]] <= data_vect[i] • Random narrow (32-bit) accesses • Waste bandwidth on DDR interfaces

  4. If Data Fits on the FPGA… • BRAMs with interconnect network • General network… • Not customized per application • Shared: all masters <-> all slaves • Memory mapped BRAM • Double-pump (2x clk) if possible • Banking/LVT/etc. for further ports

  5. Example BRAM system

  6. But if data doesn’t fit… (oversimplified)

  7. So Let’s Use a Cache • But a throughput focused cache • Low latency data held in local memories • Amortize latency over multiple accesses • Focus on bandwidth

  8. Replace on-chip memory or augment memory controller? • Data fits on-chip • Want BRAM like speed, bandwidth • Low overhead compared to shared BRAM • Data doesn’t fit on-chip • Use ‘leftover’ BRAMs for performance

  9. TputCache Design Goals • Fmax near BRAM Fmax • Fully pipelined • Support multiple outstanding misses • Write coalescing • Associativity

  10. TputCache Architecture • Replay based architecture • Reinsert misses back into pipeline • Separate line fill/evict logic in background • Token FIFO for completing requests in order • No MSHRs for tracking misses • Fewer muxes (only single replay request mux) • 6 stage pipeline -> 6 outstanding misses • Good performance with high hit rate • Common case fast

  11. TputCache Architecture

  12. Cache Hit

  13. Cache Miss

  14. Evict/Fill Logic

  15. Area & Fmax Results • Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV • 423MHz compared to 490MHz BRAM fmax on Stratix IV • Minor degredation with increasing size, associativity • 13% to 35% extra BRAM usage for tags, queues

  16. Benchmark Setup • TputCache • 128kB, 4-way, 32-byte lines • MXP soft vector processor • 16 lanes, 128kB scratchpad memory • Scatter/Gather memory unit • Indexed loads/stores per lane • Doublepumping port adapters • TputCache runs at 2x frequency of MXP

  17. MXP Soft Vector Processor

  18. Histogram • Instantiate a number of Virtual Processors (VPs) mapped across lanes • Each VP histograms part of the image • Final pass to sum VP partial histograms

  19. Hough Transform • Convert an image to 2D Hough Space (angle, radius) • Each vector element calculates the radius for a given angle • Adds pixel value to counter

  20. Motion Compensation • Load block from reference image, interpolate • Offset by small amount from location in current image

  21. Future Work • More ports needed for scalability • Share evict/fill BRAM port with 2nd request • Banking (sharing same evict/fill logic) • Multiported BRAM designs • Write cache • Allocate on write currently • Track dirty state of bytes in BRAMs 9th bit • Non-blocking behavior • Multiple token FIFOs (one per requestor)?

  22. FAQ • Coherency • Envisioned as only/LLC • Future work • Replay loops/problems • Random replacement + associativity • Power expected to be not great…

  23. Conclusions • TputCache: alternative to shared BRAM • Low overhead (13%-35% extra BRAM) • Nearly as high fmax (253MHz vs 270MHz) • More flexible than shared BRAM • Performance degrades gradually • Cache behavior instead of manual filling

  24. Questions? • Thank you