Indexing Stream Register Files

Indexing Stream Register Files Nuwan Jayasena 10/8/2002

Indexing Stream Register Files • Motivation • Architecture overview • Usage examples • Language and compiler issues • Implementation issues NSJ

Stream Memory Hierarchy Memory Sys • Roughly order of magnitude increase in BW at each level • Maximize data reuse at each level • Focus on Stream Register File (SRF) for this talk Streams Stream RF Records Local Registers Local variables Compute Units NSJ

SRF Data Reuse Temporal or producer-consumer locality • Current SRF only supports in-order reuse  Indexed access to SRF allows reordered reuse In-order reuse Reordered reuse App-dependent reordering Data-dependent reordering NSJ

SRF-Memory Stream Transfers • Types of stream transfers • Compulsory: application I/O • Capacity: due to SRF capacity pressure • Reordering: re-ordering of data already in SRF • SRF indexing… • Eliminates most reordering transfers • Reduces data replication in SRF • Eliminates some capacity transfers NSJ

Architecture Overview • High-level view of SRF indexing implementation • Mostly to highlight capabilities and limitations of SRF indexing • More detailed view of hardware and mechanisms later NSJ

Current Stream Processor Arch. SRF Bank 0 SRF Bank 1 SRF Bank N • N “lanes” each with SRF bank and compute cluster • Cross-lane communication via inter-cluster switch Stream buffers Cluster 0 Cluster 1 Cluster N Inter-cluster switch NSJ

In-lane SRF Indexing SRF Bank X • Each cluster can index in to its own bank of the SRF • Address queue between cluster and SRF bank • Sequence of steps for indexed read: • Cluster places index in address queue • Bank read using index • Result placed in stream buffer • Cluster reads data from stream buffer (+) High bandwidth indexed accesses (+) Few changes to exiting architecture (–) Only 1/N of data structure visible within each cluster Cluster X NSJ

SRF Bank 0 SRF Bank 1 SRF Bank 7 SRF address switch Inter-cluster switch Cluster 0 Cluster 1 Cluster 7 Cross-lane SRF Indexing • Any cluster can access any SRF location • Adds interconnect between clusters for address communication • Data return takes place over existing inter-cluster network NSJ

Cross-lane SRF Indexing (Contd.) • Sequence of steps • Clusters place indices in their own index queues • Indices broadcast on address switch • Arbitrate to resolve bank conflicts • Access SRF banks and return data via inter-cluster network • Write data in to requesting clusters’ stream buffer • Clusters read data from stream buffers (+) Entire data structure visible to all clusters (–) Low bandwidth (1 word/cycle/cluster peak) (–) Extra hardware for cross-lane index issue NSJ

Usage Examples • Application-specific uses • Efficient access to application data structures • System-level uses • Hide hardware limitations NSJ

Multidimensional Data w/o SRF Indexing Rotate Memory • 90º rotation (“corner-turn”) between accesses along different dimensions SRF Clusters Compute Compute time NSJ

Multidimensional Data w/ SRF Indexing Memory • Accesses along 2nd dimension can typically use in-lane indexing • Eliminates data reordering through memory  reduce reordering stream transfers to/from memory system SRF Clusters Compute Compute NSJ

Regular Grid Stencils w/o SRF Indexing • Each row is a different stream, all streams consumed at same rate • Values from adjacent columns communicated among neighbor lanes • 3 streams for 2D grid with 1-wide stencil • Many streams for higher dimension grids and/or wider stencils • Number of streams currently limited by hardware resources NSJ

Regular Grid Stencils w/ SRF Indexing • Primary stream consumed sequentially • Accesses within vertical planes use in-lane indexing • Values from adjacent vertical planes communicated among neighbor lanes (same as unindexed case) • Reduces number of streams needed • May reduce reordering and/or redundant transfers NSJ

Arbitrary Stencils w/o SRF Indexing Lookup Memory • Repeated accesses to same node leads to data replication in SRF SRF Clusters Index Gen Compute NSJ

Arbitrary Stencils w/ SRF Indexing Memory • Cross-lane indexing supports arbitrary access pattern • Eliminates data replication in SRF • May reduce capacity stream transfers • Increases strip size • Reduce redundant transfers from memory system SRF Clusters Compute NSJ

Sub-stream Extraction w/o SRF Indexing Extract Memory • Splitting records require pass through memory or passing useless data through clusters • Same for selecting subset of records SRF Clusters Compute Compute NSJ

Sub-stream Extraction w/ SRF Indexing Memory • In-lane indexing to select words from records • Selecting subset of records may require cross-lane indexing to preserve ordering SRF Clusters Compute Compute NSJ

Virtual Streams • Current SRF has hard limit on number of streams used by a kernel • Imposed by hardware constraints • Exceeding limit requires merging streams, splitting kernels or other workarounds • Indexing in to SRF provides a mechanism to access any number of sequences • Essentially multiplex multiple logical streams on to one hardware stream NSJ

Other Uses • Space allocation for variable length streams • Current SRF requires space allocation for worst case stream size for variable length streams • Indexing can be used to allocate for common case and gracefully degrade if overflows • Spill local variables from kernels • Reduce register pressure for large kernels • Etc. NSJ

Summary of Benefits • Reduce memory system bandwidth demands • Most reordering transfers and some capacity transfers • Reduce SRF capacity pressure by eliminating replication • Increases strip sizes • Collapsing/eliminating index generation and/or reordering steps at stream level potentially shortens software pipeline length • Increases strip size • Flexible stream control • More streams per kernel than hardware supports • Efficient SRF allocation for variable length streams NSJ

Language & Compiler Issues • System-level issues should clearly be handled by compiler/scheduler • Virtual streams • SRF allocation for variable length streams • Register spilling etc. • How much of the application-level uses can be inferred by compiler? • Substream extraction, regular stencils etc. can be inferred w/o programmer help? • Multi-dimensional data structures, irregular stencils etc. need programmer help? • If so, what should the API be? NSJ

Implementation Issues • Hiding indexed SRF access latency • Merging scratchpad and SRF • SRF access arbitration • Memory array implementation NSJ

Hiding SRF Access Delay • Kernels are statically scheduled • SRF access by streams is dynamically arbitrated • Allows optimal run-time allocation of SRF BW to cluster and memory streams • Address generation for sequential streams can run arbitrarily ahead to hide arbitration delay • Indexed accesses are treated much like another stream for arbitration purposes • In order to hide arbitration and access delay for reads, SRF indices must be issued early and data read a few cycles later • Breaks indexed accesses in to two distinct ops at machine level NSJ

Hiding SRF Access Delay (Contd.) • Split read operation example: User pseudocode: Kernel XYZ(…, idx_istream<int> S1, …) { int a, b, R, S; loop(…) { Independent_ops; a = addr_compute1(); S1[a] >> R; b = addr_compute2(); S1[b] >> S; Use(R, S); } } Post-compile pseudocode: loop(…) { a = addr_compute(); S1.index(a); S1.index(b); Independent_ops; S1 >> R; S1 >> S; Use(R, S); } • Address/data separation is not critical for writes NSJ

Merging Scratchpad w/ Indexable SRF • Data structures in SRF are typically read-only or write-only • Scratchpad needs to support read/write data • Pending writes are matched against new reads and multiple writes to same location are collapsed • Special high-priority reads that preempt all other SRF accesses and completes within a fixed latency • Reads are performed immediately after matching with pending writes (if no match found) to avoid ordering problems • Must sustain at least the current scratchpad bandwidth – one read and one write every cycle NSJ

SRF Memory Array Implementation • SSS SRF: • 64K word total  4K words per cluster • Non-indexable bank can be implemented as a single 512x512 bit macro • Indexing requires some form of banking to sustain few words/cycle bandwidth for scratchpad + SRF accesses NSJ

SRF Memory Array Implementation (Contd.) • Non-indexed SRF bank • 512x512 macro • 4x4 array of blocks assuming 128x128 blocks • 2:1 column decode to sustain 4 words/cycle peak BW SRAM Array Row Dec. Col. Dec. Rd/Wr Circuits • Key is to support word granularity indexed access w/o losing implementation and power efficiency for wide sequential reads NSJ

SRF Memory Array Implementation (Contd.) • Option 1: Multiple narrow columns • One word per cycle per bank • All accesses are one word wide • Best BW utilization for mixed indexed and stream accesses • High area overhead due to replicated row decoders • No replication of column decoders and rd/wr circuits • Power in SRAM array(s) comparable to non-banked memory Row Dec. Row Dec. Row Dec. Row Dec. Col. Col. Col. Col. Rd/Wr Circuits NSJ

SRF Memory Array Implementation (Contd.) • Option 2: Multiple banks along rows of blocks • Leverage hierarchical bitlines with additional muxing • With appropriate data interleaving, mux area fairly small • Low area overhead • Low power for wide accesses only • BW utilization may be suboptimal for mixed stream and indexed accesses Row Mux Row Mux Row Mux Row Mux Rd/Wr Circuits NSJ

Indexing Stream Register Files

Indexing Stream Register Files

Presentation Transcript

Register Files and Memories

Multiple Banked Register Files

Indexing:

Separable 2D Convolution with Polymorphic Register Files

Register Files and Memories

Multiple Banked Register Files

Stream Register Files with Indexed Access

Compiler Optimization to Reduce Soft Errors in Register Files

Learning Epinfo with stream files

Register Files and Memories

From Unordered Files to Indexing

Exploiting Value Locality in Physical Register Files

Register Files and Memories

Indexing Structures for Files

Indexing Structures for Files

Indexing

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Multiple Banked Register Files

Exploiting Value Locality in Physical Register Files

Indexing