Download
m machine and grids parallel computer architectures n.
Skip this Video
Loading SlideShow in 5 Seconds..
M-Machine and Grids Parallel Computer Architectures PowerPoint Presentation
Download Presentation
M-Machine and Grids Parallel Computer Architectures

M-Machine and Grids Parallel Computer Architectures

103 Vues Download Presentation
Télécharger la présentation

M-Machine and Grids Parallel Computer Architectures

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. M-Machine and GridsParallel Computer Architectures Navendu Jain

  2. Readings • The M-machine multicomputer Marco et al., MICRO 1995 • Exploiting fine-grain thread level parallelism on the MIT multi-ALU processorKeckler et al., MICRO 1998 • A design space evaluation of grid processor architecturesNagarajan et al., MICRO 2001

  3. Outline • The M-Machine Multicomputer • Thread Level Parallelism on M-Machine • Grid Processor Architectures • Review and Discussion

  4. The M-Machine Multicomputer

  5. Design Motivation • Achieve higher throughput of memory resources • Increase chip area devoted to processors • Arithmetic to bandwidth ratio of 12 operations/word • Minimize global communication (local sync.) • Faster execution of fixed size problems • Easier programmability of parallel computers • Incremental approach

  6. Architecture • A bi-directional 3-D network mesh of multi-threaded processing nodes • A chip comprises of a multi-ALU processor (MAP) and 128KB on-chip sync. DRAM • A user-accessible message passing system (SEND) • Single global virtual address space • Target CLK 100 MHz (control logic 40MHz)

  7. Multi-ALU processor (MAP) • A MAP chip comprises : • Three 64-bit 3-issue clusters • 2-way interleaved on-chip cache • A Memory Switch • A Cluster switch • External memory interface • On-chip network interfaces and routers

  8. A MAP Cluster • 64-bit three issue pipelined processor • 2 Integer ALUs • 1 Floating point ALU • Register Files • 4KB Instruction cache • A MAP instruction has 1, 2 or 3 operations

  9. Map Chip Die (18 mm side, 5M transistors)

  10. Exploiting Parallelism on M-Machine

  11. Threads • Exploit ILP both with-in and across the clusters • Horizontal Threads (H-Threads) • Instruction level parallelism • Executes on a single MAP cluster • 3-wide instruction stream • Communication/synchronization through messages/registers/memory • Max. 6 H-Threads can be interleaved dynamically on a cycle-by-cycle basis

  12. Threads (contd.) • Vertical Threads (V-Threads) • Thread level parallelism (a standard process) • contains up-to 4 H-Threads (one per cluster) • Flexibility of scheduling (compiler/run-time) • Communication/synchronization through registers • At-most 6 resident V-Threads • 4 user slots, 1 event slot, 1 exception

  13. Concurrency Model Three Levels of Parallelism • Instruction Level Parallelism ( 1 instruction) • VLIW, Superscalar processors • Issues: Control Flow, Data dependency, Scalability • Thread Level Parallelism (~ 1000 instructions) • Chip Multiprocessors • Issues: Limited coarse TLP, Inner cores non-optimal • Fine grain Parallelism (~ 50 – 1000 instructions)

  14. Mapping Program Architecture Granularity

  15. Fine-grain overheads • Thread creation (11 cycles – hfork) • Communication • Register-Register read/writes • Message passing/on-chip cache • Synchronization • Blocking on a register (full/empty bit) • Barrier Instruction (cbar instruction) • Memory (sync bit)

  16. Grid Processor Architecture

  17. Design Motivation • Continued scaling of the clock rate • Scalability of the processing core • Higher ILP - Instruction throughput (IPC) • Mitigate global wire and delay overheads • Closer coupling of Architecture and compiler

  18. Architecture • An inter-connected 2-D network of ALU arrays • Each node has a IB and a execution unit • A single control thread maps instructions to nodes • Block-Atomic Execution Model • Mapping blocks of statically scheduled instructions • Dynamic execution in data-flow order • Forwarding temp. values to the consumer ALUs • Critical path scheduled along shortest physical path

  19. GPA Architecture

  20. Example: Block-Atomic Mapping

  21. Implementation • Instruction fetch and map • predicated hyper-block, move instructions • Execution - control logic • Operand routing – max 3 dest., split instructions • Hyper-block control • Predication (execute-all approach), cmove instructions • Block-commit • Block-stitching

  22. Review and Discussion

  23. Key Ideas: Convergence • Microprocessor – no. of superscalar processors comm./sync. via registers – low overheads • Exploiting ILP – TLP granularities • Dependency mapped to a grid of ALUs • Replication reduces design/verification effort • Point-to-point communication • Exposing architecture partitioning and flow of operations to the compiler • Avoid wire, routing delays, memory wall problems

  24. Ideas: Divergence M-Machine • On-chip cache Register based mech. [Delays] • Broadcasting and Point-to-point communication GPA • Register Set Grid: Chaining [Scalability] • Point-to-point communication TERA • Fine-grain threads – Memory comm/sync (full/empty) • No support for single threaded code

  25. M-Machine Scalability Clock speeds Memory Synchronization (use hfork) Grid Processor Arch. Data Caches far from ALUs Incur delays between dependent operations due to network router and wires Complex Frame-management and Block-stitching Explicit compiler dependence Drawbacks (Unresolved Issues)

  26. Challenges/Future Directions • Architectural support to extract TLP • Parallelizing compiler technology • How many cores/threads • No. of threads – memory latency, wire delays [Flynn] • Inter-thread communication • Height of Grid == 8 (IPC 5-6) [GPA, Peter] • Optimization - f(comm., delays, memory costs)

  27. Challenges (contd.) • On-fly data-dependence detection (RAW/WAR) • TLP/ILP Balance – M Multi-Computer

  28. Thanks