310 likes | 427 Vues
Indexing Stream Register Files. Nuwan Jayasena 10/8/2002. Indexing Stream Register Files. Motivation Architecture overview Usage examples Language and compiler issues Implementation issues. Stream Memory Hierarchy. Memory Sys. Roughly order of magnitude increase in BW at each level
E N D
Indexing Stream Register Files Nuwan Jayasena 10/8/2002
Indexing Stream Register Files • Motivation • Architecture overview • Usage examples • Language and compiler issues • Implementation issues NSJ
Stream Memory Hierarchy Memory Sys • Roughly order of magnitude increase in BW at each level • Maximize data reuse at each level • Focus on Stream Register File (SRF) for this talk Streams Stream RF Records Local Registers Local variables Compute Units NSJ
SRF Data Reuse Temporal or producer-consumer locality • Current SRF only supports in-order reuse Indexed access to SRF allows reordered reuse In-order reuse Reordered reuse App-dependent reordering Data-dependent reordering NSJ
SRF-Memory Stream Transfers • Types of stream transfers • Compulsory: application I/O • Capacity: due to SRF capacity pressure • Reordering: re-ordering of data already in SRF • SRF indexing… • Eliminates most reordering transfers • Reduces data replication in SRF • Eliminates some capacity transfers NSJ
Architecture Overview • High-level view of SRF indexing implementation • Mostly to highlight capabilities and limitations of SRF indexing • More detailed view of hardware and mechanisms later NSJ
Current Stream Processor Arch. SRF Bank 0 SRF Bank 1 SRF Bank N • N “lanes” each with SRF bank and compute cluster • Cross-lane communication via inter-cluster switch Stream buffers Cluster 0 Cluster 1 Cluster N Inter-cluster switch NSJ
In-lane SRF Indexing SRF Bank X • Each cluster can index in to its own bank of the SRF • Address queue between cluster and SRF bank • Sequence of steps for indexed read: • Cluster places index in address queue • Bank read using index • Result placed in stream buffer • Cluster reads data from stream buffer (+) High bandwidth indexed accesses (+) Few changes to exiting architecture (–) Only 1/N of data structure visible within each cluster Cluster X NSJ
SRF Bank 0 SRF Bank 1 SRF Bank 7 SRF address switch Inter-cluster switch Cluster 0 Cluster 1 Cluster 7 Cross-lane SRF Indexing • Any cluster can access any SRF location • Adds interconnect between clusters for address communication • Data return takes place over existing inter-cluster network NSJ
Cross-lane SRF Indexing (Contd.) • Sequence of steps • Clusters place indices in their own index queues • Indices broadcast on address switch • Arbitrate to resolve bank conflicts • Access SRF banks and return data via inter-cluster network • Write data in to requesting clusters’ stream buffer • Clusters read data from stream buffers (+) Entire data structure visible to all clusters (–) Low bandwidth (1 word/cycle/cluster peak) (–) Extra hardware for cross-lane index issue NSJ
Usage Examples • Application-specific uses • Efficient access to application data structures • System-level uses • Hide hardware limitations NSJ
Multidimensional Data w/o SRF Indexing Rotate Memory • 90º rotation (“corner-turn”) between accesses along different dimensions SRF Clusters Compute Compute time NSJ
Multidimensional Data w/ SRF Indexing Memory • Accesses along 2nd dimension can typically use in-lane indexing • Eliminates data reordering through memory reduce reordering stream transfers to/from memory system SRF Clusters Compute Compute NSJ
Regular Grid Stencils w/o SRF Indexing • Each row is a different stream, all streams consumed at same rate • Values from adjacent columns communicated among neighbor lanes • 3 streams for 2D grid with 1-wide stencil • Many streams for higher dimension grids and/or wider stencils • Number of streams currently limited by hardware resources NSJ
Regular Grid Stencils w/ SRF Indexing • Primary stream consumed sequentially • Accesses within vertical planes use in-lane indexing • Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case) • Reduces number of streams needed • May reduce reordering and/or redundant transfers NSJ
Arbitrary Stencils w/o SRF Indexing Lookup Memory • Repeated accesses to same node leads to data replication in SRF SRF Clusters Index Gen Compute NSJ
Arbitrary Stencils w/ SRF Indexing Memory • Cross-lane indexing supports arbitrary access pattern • Eliminates data replication in SRF • May reduce capacity stream transfers • Increases strip size • Reduce redundant transfers from memory system SRF Clusters Compute NSJ
Sub-stream Extraction w/o SRF Indexing Extract Memory • Splitting records require pass through memory or passing useless data through clusters • Same for selecting subset of records SRF Clusters Compute Compute NSJ
Sub-stream Extraction w/ SRF Indexing Memory • In-lane indexing to select words from records • Selecting subset of records may require cross-lane indexing to preserve ordering SRF Clusters Compute Compute NSJ
Virtual Streams • Current SRF has hard limit on number of streams used by a kernel • Imposed by hardware constraints • Exceeding limit requires merging streams, splitting kernels or other workarounds • Indexing in to SRF provides a mechanism to access any number of sequences • Essentially multiplex multiple logical streams on to one hardware stream NSJ
Other Uses • Space allocation for variable length streams • Current SRF requires space allocation for worst case stream size for variable length streams • Indexing can be used to allocate for common case and gracefully degrade if overflows • Spill local variables from kernels • Reduce register pressure for large kernels • Etc. NSJ
Summary of Benefits • Reduce memory system bandwidth demands • Most reordering transfers and some capacity transfers • Reduce SRF capacity pressure by eliminating replication • Increases strip sizes • Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length • Increases strip size • Flexible stream control • More streams per kernel than hardware supports • Efficient SRF allocation for variable length streams NSJ
Language & Compiler Issues • System-level issues should clearly be handled by compiler/scheduler • Virtual streams • SRF allocation for variable length streams • Register spilling etc. • How much of the application-level uses can be inferred by compiler? • Substream extraction, regular stencils etc. can be inferred w/o programmer help? • Multi-dimensional data structures, irregular stencils etc. need programmer help? • If so, what should the API be? NSJ
Implementation Issues • Hiding indexed SRF access latency • Merging scratchpad and SRF • SRF access arbitration • Memory array implementation NSJ
Hiding SRF Access Delay • Kernels are statically scheduled • SRF access by streams is dynamically arbitrated • Allows optimal run-time allocation of SRF BW to cluster and memory streams • Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay • Indexed accesses are treated much like another stream for arbitration purposes • In order to hide arbitration and access delay for reads, SRF indices must be issued early and data read a few cycles later • Breaks indexed accesses in to two distinct ops at machine level NSJ
Hiding SRF Access Delay (Contd.) • Split read operation example: User pseudocode: Kernel XYZ(…, idx_istream<int> S1, …) { int a, b, R, S; loop(…) { Independent_ops; a = addr_compute1(); S1[a] >> R; b = addr_compute2(); S1[b] >> S; Use(R, S); } } Post-compile pseudocode: loop(…) { a = addr_compute(); S1.index(a); S1.index(b); Independent_ops; S1 >> R; S1 >> S; Use(R, S); } • Address/data separation is not critical for writes NSJ
Merging Scratchpad w/ Indexable SRF • Data structures in SRF are typically read-only or write-only • Scratchpad needs to support read/write data • Pending writes are matched against new reads and multiple writes to same location are collapsed • Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency • Reads are performed immediately after matching with pending writes (if no match found) to avoid ordering problems • Must sustain at least the current scratchpad bandwidth – one read and one write every cycle NSJ
SRF Memory Array Implementation • SSS SRF: • 64K word total 4K words per cluster • Non-indexable bank can be implemented as a single 512x512 bit macro • Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses NSJ
SRF Memory Array Implementation (Contd.) • Non-indexed SRF bank • 512x512 macro • 4x4 array of blocks assuming 128x128 blocks • 2:1 column decode to sustain 4 words/cycle peak BW SRAM Array Row Dec. Col. Dec. Rd/Wr Circuits • Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads NSJ
SRF Memory Array Implementation (Contd.) • Option 1: Multiple narrow columns • One word per cycle per bank • All accesses are one word wide • Best BW utilization for mixed indexed and stream accesses • High area overhead due to replicated row decoders • No replication of column decoders and rd/wr circuits • Power in SRAM array(s) comparable to non-banked memory Row Dec. Row Dec. Row Dec. Row Dec. Col. Col. Col. Col. Rd/Wr Circuits NSJ
SRF Memory Array Implementation (Contd.) • Option 2: Multiple banks along rows of blocks • Leverage hierarchical bitlines with additional muxing • With appropriate data interleaving, mux area fairly small • Low area overhead • Low power for wide accesses only • BW utilization may be suboptimal for mixed stream and indexed accesses Row Mux Row Mux Row Mux Row Mux Rd/Wr Circuits NSJ