1 / 39

Message Passing On Tightly-Interconnected Multi-Core Processors

Message Passing On Tightly-Interconnected Multi-Core Processors. James Psota and Anant Agarwal MIT CSAIL. Technology Scaling Enables Multi-Cores. Multi-cores offer a novel environment for parallel computing. cluster. multi-core. Traditional Communication On Multi-Processors. Interconnects

beth
Télécharger la présentation

Message Passing On Tightly-Interconnected Multi-Core Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Message Passing On Tightly-Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL

  2. Technology Scaling Enables Multi-Cores Multi-cores offer a novel environment for parallel computing cluster multi-core

  3. Traditional Communication On Multi-Processors Interconnects • Ethernet TCP/IP • Myrinet • Scalable Coherent Interconnect (SCI) Shared Memory • Shared caches or memory • Remote DMA (RDMA) Beowulf Cluster AMD Dual-Core Opteron

  4. On-Chip Networks Enable Fast Communication • Some multi-cores offer… • tightly integrated on-chip networks • direct access to hardware resources (no OS layers) • fast interrupts MIT Raw Processor used for experimentation and validation

  5. Parallel Programming is Hard • Must orchestrate of computation and communication • Extra resources present both opportunity and challenge • Trivial to deadlock • Constraints on message sizes • No operating system support

  6. rMPI’s Approach Goals • robust, deadlock-free, scalable programming interface • easy to program through high-level routines Challenge • exploit hardware resources for efficient communication • don’t sacrifice performance

  7. Outline • Introduction • Background • Design • Results • Related Work

  8. The Raw Multi-Core Processor • 16 identical tiles • processing core • network routers • 4 register-mapped on-chip networks • Direct access to hardware resources • Hardware fabricated in ASIC process Raw Processor

  9. Raw’s General Dynamic Network • Handles run-time events • interrupts, dynamic messages • Network guarantees atomic, in-order messages • Dimension-ordered wormhole routed • Maximum message length: 31 words • Blocking sends/receives • Minimal network buffering

  10. MPI: Portable Message Passing API • Gives programmers high-level abstractions for parallel programming • send/receive, scatter/gather, reductions, etc. • MPI is a standard, not an implementation • many implementations for many HW platforms • over 200 API functions • MPI applications portable across MPI-compliant systems • Can impose high overhead

  11. process 0 private address space MPI Semantics: Cooperative Communication • Data exchanged cooperatively via explicit send and receive • Receiving process’s memory only modified with its explicit participation • Combines communication and synchronization process 1 recv(src=0, tag=42) send(dest=1, tag=17) recv(src=0, tag=17) send(dest=1, tag=42) temp interrupt interrupt private address space communication channel tag=42 tag=17

  12. Outline • Introduction • Background • Design • Results • Related Work

  13. rMPI System Architecture

  14. High-Level MPI Layer • Argument checking (MPI semantics) • Buffer prep • Calls appropriate low level functions • LAM/MPI partially ported

  15. Collective Communications Layer • Algorithms for collective operations • Broadcast • Scatter/Gather • Reduce • Invokes low level functions

  16. Point-to-Point Layer • Low-level send/receive routines • Highly optimized interrupt-driven receive design • Packetization and reassembly

  17. Outline • Introduction • Background • Design • Results • Related Work

  18. rMPI Evaluation • How much overhead does high-level interface impose? • compare against hand-coded GDN • Does it scale? • with problem size and number of processors? • compare against hand-coded GDN • compare against commercial MPI implementation on cluster

  19. End-to-End Latency Overhead vs. Hand-Coded (1) • Experiment measures latency for: • sender: load message from memory • sender: break up and send message • receiver: receive message • receiver: store message to memory

  20. End-to-End Latency Overhead vs. Hand-Coded (2) 1 word: 481% packet management complexity overflows cache 1000 words: 33%

  21. Performance Scaling: Jacobi 16x16 input matrix 2048 x 2048 input matrix

  22. Performance Scaling: Jacobi, 16 processors sequential version cache capacity overflow sequential version cache capacity overflow

  23. Overhead: Jacobi, rMPI vs. Hand-Coded many small messages memory access synchronization 16 tiles: 5% overhead

  24. Matrix Multiplication: rMPI vs. LAM/MPI many smaller messages; smaller message length has less effect on LAM

  25. Trapezoidal Integration: rMPI vs. LAM/MPI

  26. Pi Estimation: rMPI vs. LAM/MPI

  27. Related Work • Low-latency communication networks • iWarp, Alewife, INMOS • Multi-core processors • VIRAM, Wavescalar, TRIPS, POWER 4, Pentium D • Alternatives to programming Raw • scalar operand network, CFlow, rawcc • MPI implementations • OpenMPI, LAM/MPI, MPICH

  28. Summary • rMPI provides easy yet powerful programming model for multi-cores • Scales better than commercial MPI implementation • Low overhead over hand-coded applications

  29. Thanks! For more information, see Master’s Thesis: http://cag.lcs.mit.edu/~jim/publications/ms.pdf

  30. rMPI messages broken into packets rMPI sender process 1 • GDN messages have a max length of 31 words Receiver buffers and demultiplexes packets from different sources Messages received upon interrupt, and buffered until user-level receive interrupt 1 • rMPI packet format for 65 [payload] word MPI message 2 rMPI receiver process 2 1 3 rMPI sender process 2

  31. rMPI: enabling MPI programs on Raw rMPI… • is compatible with current MPI software • gives programmers already familiar with MPI an easy interface to program Raw • gives programmers fine-grain control over their programs when trusting automatic parallelization tools are not adequate • gives users a robust, deadlock-free, and high-performance programming model with which to program Raw ► easily write programs on Raw without overly sacrificing performance

  32. Packet boundary bookkeeping • Receiver must handle packet interleaving across multiple interrupt handler invocations

  33. Receive-side packet management • Global data structures accessed by interrupt handler and MPI Receive threads • Data structure design minimizes pointer chasing for fast lookups • No memcpy for receive-before-send case

  34. User-thread CFG for receiving

  35. Interrupt handler CFG • logic supports MPI semantics and packet construction

  36. Future work: improving performance • Comparison of rMPI to standard cluster running off-the-shelf MPI library • Improve system performance • further minimize MPI overhead • spatially-aware collective communication algorithms • further Raw-specific optimizations • Investigate new APIs better suited for TPAs

  37. Future work: HW extensions • Simple hardware tweaks may significantly improve performance • larger input/output FIFOs • simple switch logic/demultiplexing to handle packetization could drastically simplify software logic • larger header words (64 bit?) would allow for much larger (atomic) packets • (also, current header only scales to 32 x 32 tile fabrics)

  38. Conclusions • MPI standard was designed for “standard” parallel machines, not for tiled architectures • MPI may no longer make sense for tiled designs • Simple hardware could significantly reduce packet management overhead  increase rMPI performance

More Related