TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUTFPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

Our Problem • We use overlays for data processing • Partially/fully fixed processing elements • Virtual CGRAs, soft vector processors • Memory: • Large register files/scratchpad in overlay • Low latency, local data • Trivial (large DMA): burst to/from DDR • Non-trivial?

Scatter/Gather • Data dependent store/load • vscatteradr_ptr, idx_vect, data_vect • for i in 1..N • adr_ptr[idx_vect[i]] <= data_vect[i] • Random narrow (32-bit) accesses • Waste bandwidth on DDR interfaces

If Data Fits on the FPGA… • BRAMs with interconnect network • General network… • Not customized per application • Shared: all masters <-> all slaves • Memory mapped BRAM • Double-pump (2x clk) if possible • Banking/LVT/etc. for further ports

Example BRAM system

But if data doesn’t fit… (oversimplified)

So Let’s Use a Cache • But a throughput focused cache • Low latency data held in local memories • Amortize latency over multiple accesses • Focus on bandwidth

Replace on-chip memory or augment memory controller? • Data fits on-chip • Want BRAM like speed, bandwidth • Low overhead compared to shared BRAM • Data doesn’t fit on-chip • Use ‘leftover’ BRAMs for performance

TputCache Design Goals • Fmax near BRAM Fmax • Fully pipelined • Support multiple outstanding misses • Write coalescing • Associativity

TputCache Architecture • Replay based architecture • Reinsert misses back into pipeline • Separate line fill/evict logic in background • Token FIFO for completing requests in order • No MSHRs for tracking misses • Fewer muxes (only single replay request mux) • 6 stage pipeline -> 6 outstanding misses • Good performance with high hit rate • Common case fast

TputCache Architecture

Cache Hit

Cache Miss

Evict/Fill Logic

Area & Fmax Results • Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV • 423MHz compared to 490MHz BRAM fmax on Stratix IV • Minor degredation with increasing size, associativity • 13% to 35% extra BRAM usage for tags, queues

Benchmark Setup • TputCache • 128kB, 4-way, 32-byte lines • MXP soft vector processor • 16 lanes, 128kB scratchpad memory • Scatter/Gather memory unit • Indexed loads/stores per lane • Doublepumping port adapters • TputCache runs at 2x frequency of MXP

MXP Soft Vector Processor

Histogram • Instantiate a number of Virtual Processors (VPs) mapped across lanes • Each VP histograms part of the image • Final pass to sum VP partial histograms

Hough Transform • Convert an image to 2D Hough Space (angle, radius) • Each vector element calculates the radius for a given angle • Adds pixel value to counter

Motion Compensation • Load block from reference image, interpolate • Offset by small amount from location in current image

Future Work • More ports needed for scalability • Share evict/fill BRAM port with 2nd request • Banking (sharing same evict/fill logic) • Multiported BRAM designs • Write cache • Allocate on write currently • Track dirty state of bytes in BRAMs 9th bit • Non-blocking behavior • Multiple token FIFOs (one per requestor)?

FAQ • Coherency • Envisioned as only/LLC • Future work • Replay loops/problems • Random replacement + associativity • Power expected to be not great…

Conclusions • TputCache: alternative to shared BRAM • Low overhead (13%-35% extra BRAM) • Nearly as high fmax (253MHz vs 270MHz) • More flexible than shared BRAM • Performance degrades gradually • Cache behavior instead of manual filling

Questions? • Thank you

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

Presentation Transcript

High-Frequency Phrases

High Frequency Oscillatory Ventilation

High Frequency Ventilation in PPHN

High Throughput Computing

High Throughput Computing

High-Frequency Phrases

High Throughput Sequencing: Microscope in the Big Data Era

Presented By:

1-100 High Frequency Words

High Frequency Words Words 301-400

MS108 Computer System I

High-Level Synthesis: Creating Custom Circuits from High-Level Code

High Throughput Computing

High Frequency Word Phrases

HIGH FREQUENCY WORDS

HIGH FREQUENCY WORDS

High Frequency Words Words 601-700

High Frequency Words

High Frequency Words

High Frequency Words Words 801-900