1 / 31

Indexing Stream Register Files

Indexing Stream Register Files. Nuwan Jayasena 10/8/2002. Indexing Stream Register Files. Motivation Architecture overview Usage examples Language and compiler issues Implementation issues. Stream Memory Hierarchy. Memory Sys. Roughly order of magnitude increase in BW at each level

Télécharger la présentation

Indexing Stream Register Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Stream Register Files Nuwan Jayasena 10/8/2002

  2. Indexing Stream Register Files • Motivation • Architecture overview • Usage examples • Language and compiler issues • Implementation issues NSJ

  3. Stream Memory Hierarchy Memory Sys • Roughly order of magnitude increase in BW at each level • Maximize data reuse at each level • Focus on Stream Register File (SRF) for this talk Streams Stream RF Records Local Registers Local variables Compute Units NSJ

  4. SRF Data Reuse Temporal or producer-consumer locality • Current SRF only supports in-order reuse  Indexed access to SRF allows reordered reuse In-order reuse Reordered reuse App-dependent reordering Data-dependent reordering NSJ

  5. SRF-Memory Stream Transfers • Types of stream transfers • Compulsory: application I/O • Capacity: due to SRF capacity pressure • Reordering: re-ordering of data already in SRF • SRF indexing… • Eliminates most reordering transfers • Reduces data replication in SRF • Eliminates some capacity transfers NSJ

  6. Architecture Overview • High-level view of SRF indexing implementation • Mostly to highlight capabilities and limitations of SRF indexing • More detailed view of hardware and mechanisms later NSJ

  7. Current Stream Processor Arch. SRF Bank 0 SRF Bank 1 SRF Bank N • N “lanes” each with SRF bank and compute cluster • Cross-lane communication via inter-cluster switch Stream buffers Cluster 0 Cluster 1 Cluster N Inter-cluster switch NSJ

  8. In-lane SRF Indexing SRF Bank X • Each cluster can index in to its own bank of the SRF • Address queue between cluster and SRF bank • Sequence of steps for indexed read: • Cluster places index in address queue • Bank read using index • Result placed in stream buffer • Cluster reads data from stream buffer (+) High bandwidth indexed accesses (+) Few changes to exiting architecture (–) Only 1/N of data structure visible within each cluster Cluster X NSJ

  9. SRF Bank 0 SRF Bank 1 SRF Bank 7 SRF address switch Inter-cluster switch Cluster 0 Cluster 1 Cluster 7 Cross-lane SRF Indexing • Any cluster can access any SRF location • Adds interconnect between clusters for address communication • Data return takes place over existing inter-cluster network NSJ

  10. Cross-lane SRF Indexing (Contd.) • Sequence of steps • Clusters place indices in their own index queues • Indices broadcast on address switch • Arbitrate to resolve bank conflicts • Access SRF banks and return data via inter-cluster network • Write data in to requesting clusters’ stream buffer • Clusters read data from stream buffers (+) Entire data structure visible to all clusters (–) Low bandwidth (1 word/cycle/cluster peak) (–) Extra hardware for cross-lane index issue NSJ

  11. Usage Examples • Application-specific uses • Efficient access to application data structures • System-level uses • Hide hardware limitations NSJ

  12. Multidimensional Data w/o SRF Indexing Rotate Memory • 90º rotation (“corner-turn”) between accesses along different dimensions SRF Clusters Compute Compute time NSJ

  13. Multidimensional Data w/ SRF Indexing Memory • Accesses along 2nd dimension can typically use in-lane indexing • Eliminates data reordering through memory  reduce reordering stream transfers to/from memory system SRF Clusters Compute Compute NSJ

  14. Regular Grid Stencils w/o SRF Indexing • Each row is a different stream, all streams consumed at same rate • Values from adjacent columns communicated among neighbor lanes • 3 streams for 2D grid with 1-wide stencil • Many streams for higher dimension grids and/or wider stencils • Number of streams currently limited by hardware resources NSJ

  15. Regular Grid Stencils w/ SRF Indexing • Primary stream consumed sequentially • Accesses within vertical planes use in-lane indexing • Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case) • Reduces number of streams needed • May reduce reordering and/or redundant transfers NSJ

  16. Arbitrary Stencils w/o SRF Indexing Lookup Memory • Repeated accesses to same node leads to data replication in SRF SRF Clusters Index Gen Compute NSJ

  17. Arbitrary Stencils w/ SRF Indexing Memory • Cross-lane indexing supports arbitrary access pattern • Eliminates data replication in SRF • May reduce capacity stream transfers • Increases strip size • Reduce redundant transfers from memory system SRF Clusters Compute NSJ

  18. Sub-stream Extraction w/o SRF Indexing Extract Memory • Splitting records require pass through memory or passing useless data through clusters • Same for selecting subset of records SRF Clusters Compute Compute NSJ

  19. Sub-stream Extraction w/ SRF Indexing Memory • In-lane indexing to select words from records • Selecting subset of records may require cross-lane indexing to preserve ordering SRF Clusters Compute Compute NSJ

  20. Virtual Streams • Current SRF has hard limit on number of streams used by a kernel • Imposed by hardware constraints • Exceeding limit requires merging streams, splitting kernels or other workarounds • Indexing in to SRF provides a mechanism to access any number of sequences • Essentially multiplex multiple logical streams on to one hardware stream NSJ

  21. Other Uses • Space allocation for variable length streams • Current SRF requires space allocation for worst case stream size for variable length streams • Indexing can be used to allocate for common case and gracefully degrade if overflows • Spill local variables from kernels • Reduce register pressure for large kernels • Etc. NSJ

  22. Summary of Benefits • Reduce memory system bandwidth demands • Most reordering transfers and some capacity transfers • Reduce SRF capacity pressure by eliminating replication • Increases strip sizes • Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length • Increases strip size • Flexible stream control • More streams per kernel than hardware supports • Efficient SRF allocation for variable length streams NSJ

  23. Language & Compiler Issues • System-level issues should clearly be handled by compiler/scheduler • Virtual streams • SRF allocation for variable length streams • Register spilling etc. • How much of the application-level uses can be inferred by compiler? • Substream extraction, regular stencils etc. can be inferred w/o programmer help? • Multi-dimensional data structures, irregular stencils etc. need programmer help? • If so, what should the API be? NSJ

  24. Implementation Issues • Hiding indexed SRF access latency • Merging scratchpad and SRF • SRF access arbitration • Memory array implementation NSJ

  25. Hiding SRF Access Delay • Kernels are statically scheduled • SRF access by streams is dynamically arbitrated • Allows optimal run-time allocation of SRF BW to cluster and memory streams • Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay • Indexed accesses are treated much like another stream for arbitration purposes • In order to hide arbitration and access delay for reads, SRF indices must be issued early and data read a few cycles later • Breaks indexed accesses in to two distinct ops at machine level NSJ

  26. Hiding SRF Access Delay (Contd.) • Split read operation example: User pseudocode: Kernel XYZ(…, idx_istream<int> S1, …) { int a, b, R, S; loop(…) { Independent_ops; a = addr_compute1(); S1[a] >> R; b = addr_compute2(); S1[b] >> S; Use(R, S); } } Post-compile pseudocode: loop(…) { a = addr_compute(); S1.index(a); S1.index(b); Independent_ops; S1 >> R; S1 >> S; Use(R, S); } • Address/data separation is not critical for writes NSJ

  27. Merging Scratchpad w/ Indexable SRF • Data structures in SRF are typically read-only or write-only • Scratchpad needs to support read/write data • Pending writes are matched against new reads and multiple writes to same location are collapsed • Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency • Reads are performed immediately after matching with pending writes (if no match found) to avoid ordering problems • Must sustain at least the current scratchpad bandwidth – one read and one write every cycle NSJ

  28. SRF Memory Array Implementation • SSS SRF: • 64K word total  4K words per cluster • Non-indexable bank can be implemented as a single 512x512 bit macro • Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses NSJ

  29. SRF Memory Array Implementation (Contd.) • Non-indexed SRF bank • 512x512 macro • 4x4 array of blocks assuming 128x128 blocks • 2:1 column decode to sustain 4 words/cycle peak BW SRAM Array Row Dec. Col. Dec. Rd/Wr Circuits • Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads NSJ

  30. SRF Memory Array Implementation (Contd.) • Option 1: Multiple narrow columns • One word per cycle per bank • All accesses are one word wide • Best BW utilization for mixed indexed and stream accesses • High area overhead due to replicated row decoders • No replication of column decoders and rd/wr circuits • Power in SRAM array(s) comparable to non-banked memory Row Dec. Row Dec. Row Dec. Row Dec. Col. Col. Col. Col. Rd/Wr Circuits NSJ

  31. SRF Memory Array Implementation (Contd.) • Option 2: Multiple banks along rows of blocks • Leverage hierarchical bitlines with additional muxing • With appropriate data interleaving, mux area fairly small • Low area overhead • Low power for wide accesses only • BW utilization may be suboptimal for mixed stream and indexed accesses Row Mux Row Mux Row Mux Row Mux Rd/Wr Circuits NSJ

More Related