1 / 46

Stream Register Files with Indexed Access

Stream Register Files with Indexed Access. Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally. NV35. NV10. Scaling Trends. ILP increasingly harder and more expensive to extract. Graphics processors exploit data parallelism. CPU data courtesy of Francois Labonte, Stanford University.

chidi
Télécharger la présentation

Stream Register Files with Indexed Access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

  2. NV35 NV10 Scaling Trends • ILP increasingly harder and more expensive to extract • Graphics processors exploit data parallelism CPU data courtesy of Francois Labonte, Stanford University NSJ

  3. Renewed Interest in Data Parallelism • Data parallel application classes • Media, signal, network processing, scientific simulations, encryption etc. • High-end vector machines • Have always been data parallel • Academic research • Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. • “Main-stream” industry • Sony Emotion Engine, Tarantula etc. NSJ

  4. DRAM Cache Stream/vector storage + x + x + x Storage Hierarchy • Bandwidth taper • Only supports sequential streams/vectors • But many data parallel apps with • Data reorderings • Irregular data structures • Conditional accesses NSJ

  5. a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 a30 a31 a32 a33 b00 b01 b02 b03 b33 b13 b12 b11 b10 b03 b02 b01 b00 b10 b11 b12 b13 Reorder b20 b21 b22 b23 b33 b31 b21 b11 b01 b30 b20 b10 b00 b30 b31 b32 b33 c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Sequential Streams/Vectors Inefficient  Evaluate arbitrary order access to streams Memory/cache Stream/vector storage Compute units Row major Time Column major NSJ

  6. Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

  7. in1 Out FFT_stage FFT_stage FFT_stage Output in2 Stream Programming • Streams of records passing through compute kernels • Parallelism • Across stream elements • Across kernels • Locality • Within kernels • Between kernels NSJ

  8. Memory Stream register file (SRF) Compute units FFT_stage Time FFT_stage FFT_stage Bandwidth Hierarchy • Stream programming is well matched to bandwidth hierarchy NSJ

  9. Memory system Memory switch SRF SRF bank bank 0 (N-1) Compute Compute cluster cluster 0 (N-1) Lane 0 Inter-cluster network Stream Processors • Several lanes • Execute in SIMD • Operate on records • Inter-cluster network NSJ

  10. Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

  11. Stream data reuse Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups Stream-Level Data Reuse • Sequential streams only capture in-order reuse • Arbitrary access patterns in SRF capture more of available temporal locality NSJ

  12. Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ

  13. Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ

  14. Memory/cache Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G Intra-stream Reuse • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers NSJ

  15. Replicate D C B A Intra-stream Reuse Memory/cache • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G NSJ

  16. Conditional Accesses • Fine-grain conditional accesses • Expensive in SIMD architectures • Translate to conditional address computation NSJ

  17. Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

  18. SRF bank (N-1) SRF bank 0 b*W Compute cluster 0 Compute cluster (N-1) Inter-cluster network Base Architecture • Each SRF bank accesses block of b contiguous words NSJ

  19. SRF bank (N-1) SRF bank 0 Address FIFOs Compute cluster 0 Compute cluster (N-1) Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ

  20. Local word -line drivers • Sub array 0 SRF bank • Sub array 1 • Sub array 2 Compute cluster • Sub array 3 Base SRF Bank • Several SRAM sub-arrays • Each access is to one sub-array NSJ

  21. Pre-decode & row dec. • Sub array 0 SRF bank mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. • Sub array 2 Compute cluster Pre-decode & row dec. • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ

  22. SRF bank 0 SRF bank 0 SRF address network Inter-cluster network Address FIFOs Compute cluster 0 Compute cluster 0 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ

  23. Overhead - Area • In-lane indexing overheads • 11% over sequential SRF • Per-sub-array independent addressing overheads • Cross-lane indexing overheads • 22% over sequential SRF • Address switch • 1.5% to 3% increase in die area (Imagine processor) NSJ

  24. Overhead - Energy • 0.1nJ (0.13mm) per indexed SRF access • ~4x sequential SRF access • > order of magnitude lower than DRAM access • 0.25nJ per cache access • Each indexed access replaces many SRF and DRAM/cache accesses NSJ

  25. Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

  26. Benchmarks • 64x64 2D FFT • 2D accesses • Rijndael (AES) • Table lookups • Merge-sort • Fine-grain conditionals • 5x5 convolution filter • Regular neighborhood • Irregular graph • Irregular neighborhood access • Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips NSJ

  27. DRAM DRAM DRAM Cache Memory switch Memory switch Memory switch SRF address net Inter-cluster net Inter-cluster net Inter-cluster net Machine Organizations Base (Sequential SRF) SRF banks Compute clusters Base + Cache Indexed SRF NSJ

  28. Machine Parameters NSJ

  29. Off-chip Memory Bandwidth NSJ

  30. Off-chip Memory Bandwidth NSJ

  31. Execution Time NSJ

  32. Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

  33. Conclusions • Data parallelism increasingly important • Current data parallel architectures inefficient for some application classes • Irregular accesses • Indexed SRF accesses • Reduce memory traffic • Reduce SRF data replication • Efficiently support complex/conditional stream accesses • Performance improvements • 3% to 410% for target application classes • Low implementation overhead • 1.5% to 3% die area NSJ

  34. Backups NSJ

  35. Indexed Access Instruction Overhead • Excludes address issue instructions NSJ

  36. LUT.index << a; Indep. instructions; LUT >> b; • 2 separate instructions • Address issue • Data read • Address-data separation • May require loop unrolling, software pipelining etc. Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } NSJ

  37. Sensitivity to SRF Access Latency (1) NSJ

  38. Sensitivity to SRF Access Latency (2) NSJ

  39. NV35 NV30 Pentium 4 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) Slide from Ian Buck, Stanford University NSJ *from Intel P4 Optimization Manual

  40. NVIDIA Graphics growth (225%/yr) Essentially Moore’s Law Cubed. • 1: Dual textured • 2: Programmable NSJ Slide from Pat Hanrahan, Kurt Akeley

  41. NVIDIA Historicals 1.8 2.4 Slide from Pat Hanrahan, Kurt Akeley NSJ

  42. SRF bank 0 SRF bank 7 128b 128b Stream buffers 32b 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Base Architecture • Stream buffers match SRF bandwidth to compute needs NSJ

  43. SRF bank 0 SRF bank 7 128b Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ

  44. Local WL drivers Sub array 0 SRF bank 256 128 Compute cluster • Sub array 1 • Sub array 2 • Sub array 3 Base SRF Bank • Several SRAM sub-arrays NSJ

  45. Pre-decode & row dec. • Sub array 0 256 SRF bank 128 8:1 mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. Compute cluster Pre-decode & row dec. • Sub array 2 • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ

  46. SRF bank 0 SRF bank 7 SRF address network Inter-cluster network Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ

More Related