1 / 41

SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations

SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations. Norman Margolus MIT/BU. The Task: Large-scale brute force computations with a crystalline-lattice structure The Opportunity: Exploit row-at-a-time access in DRAM to achieve speed and size.

celina
Télécharger la présentation

SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPACERAM:An embedded DRAM architecturefor large-scale spatial lattice computations Norman Margolus MIT/BU

  2. The Task: Large-scale brute force computations with a crystalline-lattice structure The Opportunity: Exploit row-at-a-time access in DRAM to achieve speed and size. The Challenge: Dealing efficiently with memory granularity, commun, proc The Trick: Data movement by layout and addressing Overview 4Mbits/DRAM, 256 bit I/O 20 @200 MHz  1 Tbit/sec, 10 MB, 130 mm2 (.18µm) and 8 W (mem)

  3. 1. Crystalline lattice computations2. Mapping a lattice into hardware

  4. Crystalline lattice computations • Large scale, with spatial regularity, as in finite difference • 1 Tbit/sec ≈ 1010 32-bit mult-accum / sec (sustained) • Many image processing and rendering algorithms have spatial regularity.

  5. Symbolic lattice computations • Brute-force physical simulations (complex) • Bit-mapped 3D games and virtual reality • Logic simulation (physical  wavefront) • Pyramid, multigrid and multiresolution computations

  6. Site update rate / processor chip

  7. Site update rate / processor chip

  8. 1. Crystalline lattice computations2. Mapping a lattice into hardware

  9. Mapping a lattice into hardware • Divide the lattice up evenly among mesh array of chips • Each chip handles an equal sized sector of the emulated lattice

  10. Mapping a lattice into hardware

  11. Mapping a lattice into hardware • Use blocks of DRAM to hold the lattice data (Tbits/sec) • Use virtual processors for large comps (depth-first, wavefront, skew) • Main challenge:mapping lattice computations onto granular memory

  12. Lattice computation model 1D example: the rows are bit-fieldsand columns sites A B Shift bit-fields periodically and uniformly (may be large)  A B  Process sites independently and identically (SIMD)

  13. Mapping the lattice into memory A A B B  A  A B  B 

  14. Mapping the lattice into memory CAM-8 approach: group adjacent into memory words A B Shifts can be long but re-aligning messy  chaining  A B  Can process word-wide (CAM-8 did site-at-a-time)

  15. Mapping the lattice into memory CAM-8: Adjacent bits of bit-field are stored together = + + +

  16. Mapping the lattice into memory SPACERAM: bits stored as evenly spread skip samples. = + + +

  17. Processing skip-samples A[0] • Process sets of evenly spaced lattice sites(site groups) • Data movement is done by reading shifted data B[0] A[1] B[1] A[2] B[2] A[3] B[3]

  18. Processing skip-samples A[0] A[0] B[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3]

  19. Processing skip-samples A[0] A[0] B[0] B[0] A[1] A[1] B[1] B[1] A[2] B[2] A[3] B[3]

  20. Processing skip-samples A[0] A[0] B[0] B[0] A[1] A[1] B[1] B[1] A[2] A[2] B[2] B[2] A[3] B[3]

  21. Processing skip-samples A[0] A[0] B[0] B[0] A[1] A[1] B[1] B[1] A[2] A[2] B[2] B[2] A[3] A[3] B[3] B[3]

  22. A[0] A[2] B[0] B[0] A[1] A[3] B[1] B[1] A[2] A[0] B[2] B[2] A[3] A[1] B[3] B[3] Processing skip-samples

  23. Processing skip-samples A[0] A[2] B[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3]

  24. Processing skip-samples A[0] A[2] B[0] B[0] A[1] A[3] B[1] B[1] A[2] B[2] A[3] B[3]

  25. Processing skip-samples A[0] A[2] B[0] B[0] A[1] A[3] B[1] B[1] A[2] A[0] B[2] B[2] A[3] B[3]

  26. Processing skip-samples A[0] A[2] B[0] B[0] A[1] A[3] B[1] B[1] A[2] A[0] B[2] B[2] A[3] A[1] B[3] B[3]

  27. Processing skip-samples A[0] A[2] B[0] B[0] A[1] A[3] B[1] B[1] A[2] A[0] B[2] B[2] A[3] A[1] B[3] B[3]

  28. A[0] A[2] B[0] B[3] A[1] A[3] B[1] B[0] A[2] A[0] B[2] B[1] A[3] A[1] B[3] B[2] Processing skip-samples

  29. Assume hardware word size is 64-bits Assume we can address any word For each site group: address next word we need; rot if necessary Each bit field resides in a different module DRAM module DRAM block (64 bit words) address 64 shift amt Barrel Rotator 64 to processor

  30. Problem: granularity • DRAM block has structure: eg., all words in a row must be used before switching rows Solution: grouping = ( + ) + ( + )

  31. 3 levels of granularity: 2K-bit row, 256-bit word, 64-bit word All handled by data layout, addressing, and rot within 64-word DRAM module DRAM block 2Kx8x256 row addr col addr 256 sub-addr 64x4:64 64 shift amt Barrel Rotator 64 to processor

  32. Gluing chips together • All chips operate in lockstep • Shift is periodic within each sector • To glue sectors, bits that stick out replace corresponding bits in the next sector

  33. Any data that wraps around is transmitted over the mesh Corresponding bit on corresponding wire is substituted for it This is done sequentially in each of the three directions DRAM module DRAM block 2Kx8x256 row addr col addr 256 sub-addr 64x4:64 64 shift amt Barrel Rotator 64 wrap info Mesh xyz Subst 64 to processor

  34. Partitioning a 2D sector

  35. Partitioning a 2D sector

  36. The SPACERAM chip • .18 µm CMOS DRAM proc. (20W, 215 mm2) • 10% of memory bandwidth  control • Memory hierarchy (external memory) • Single DRAM address space

  37. module #00 … module #01 … module #02 … … module #03 … module #19 PE0 PE63 The SPACERAM chip

  38. The SPACERAM chip Module #19 2K rows x 2K cols /20 /20 /20 PE0 PE1 PE63 32/ /2K 32/ /2K 32/ /2K ...

  39. SPACERAM: symbolic PE • Computation by LUT • Permuter lets any DRAM wire play any role • MUX lets LUT output be used conditionally • LUT is really a MUX, with data bussed

  40. SPACERAM: symbolic PE

  41. Conclusions • Addressing is powerful • Simple hardware provides a general mechanism for lattice data movement • We can exploit the parallelism of DRAM access without extra overhead . 18µm CMOS, 10 Mbytes DRAM, 1 Tbit/sec to mem, 215 mm2, 20W http://www.im.lcs.mit.edu/nhm/isca.pdf

More Related