1 / 22

Synergistic Execution of Stream Programs on Multicores with Accelerators

Synergistic Execution of Stream Programs on Multicores with Accelerators. Abhishek Udupa et. al. Indian Institute of Science. Abstract. Orchestrating the execution of a stream program on a multicore platform with an accelerator [GPUs, CellBE]

gyda
Télécharger la présentation

Synergistic Execution of Stream Programs on Multicores with Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science

  2. Abstract • Orchestrating the execution of a stream program on a multicore platform • with an accelerator [GPUs, CellBE] • Formulate the partitioning of work between CPU cores and the GPU by ILP considering • The latencies for data transfer and • The required data layout transformation • Also propose a heuristic partitioning algorithm • Speedup of 50.96X over a single threaded CPU execution 2 2

  3. Challenges • The CPU cores and GPU operate on separate address spaces • requires explicit DMA from the CPU to transfer data into or out of the GPU address space • The communication buffers between StreamIt filters need to be laid out in a specific fashion • Access needs to coalesced for GPU • But this coalesced memory access cause cache misses for CPU • The work partitioning between the CPU and the GPU is complicated by • the DMA and buffer transformation latencies • the filters have non-identical execution times on the two devices

  4. Organization of the NVIDIA GeForce 8800 series of GPUs Architecture of individual SM Architecture of GeForce 8800 GPU

  5. CUDA Memory Model All threads of upto 8 thread blocks can be assigned to one SM A group of thread blocks forms a grid Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid

  6. Buffer Layout Consideration

  7. A Motivating Example • Assuming steady state multiplicity is one for each of the actor • B is a stateful actor which run on CPU • Shuffle and deshuffle costs are zero CPU: 10 GPU: 20 A 20 CPU: 20 B 10 CPU: 80 GPU: 20 C 10 CPU: 15 GPU: 20 D 60 E CPU: 10 GPU: 25 Original Stream Graph

  8. Naïve Partitioning • Naively map filter B on the CPU and execute all the other filters on the GPU • CPU Load = 20 • GPU Load = 75 • DMA Load = 30 • MII = 75 CPU: 10 GPU: 20 A A GPU: 20 20 20 CPU: 20 B CPU: 20 B 10 10 CPU: 80 GPU: 20 C C GPU: 20 10 10 CPU: 15 GPU: 10 D D GPU: 10 60 60 E CPU: 10 GPU: 25 E GPU: 25 Original Stream Graph Naïve partitioning

  9. Greedy Partitioning • Greedily moving an actor to either the CPU or the GPU, where it is most beneficial to be executed • CPU Load = 40 • GPU Load = 35 • DMA Load = 70 • MII = 70 CPU: 10 GPU: 20 A A CPU: 10 20 20 CPU: 20 B CPU: 20 B 10 10 CPU: 80 GPU: 20 C C GPU: 20 10 10 CPU: 15 GPU: 10 D D GPU: 10 60 60 E CPU: 10 GPU: 25 E CPU: 10 Original Stream Graph Greedy partitioning

  10. Optimal Partitioning • CPU Load = 45 • GPU Load = 40 • DMA Load = 40 • MII = 45 CPU: 10 GPU: 20 A A GPU: 20 20 20 CPU: 20 B CPU: 20 B 10 10 CPU: 80 GPU: 20 C C GPU: 20 10 10 CPU: 15 GPU: 10 D D CPU: 15 60 60 E CPU: 10 GPU: 25 E CPU: 10 Original Stream Graph Optimal partitioning

  11. Software Pipelined Kernel

  12. Compilation Process

  13. Overview of the Proposed Method • To obtain performance increase the multiplicities of the steady state • All filters that execute on the CPU are assumed to execute 128 times on each invocation • To reduce the complication • 128 is a common factor of GPU threads number, i.e. 128, 256, 384, 512 • Identify the number of instances of each actor

  14. Partitioning: Two Steps • Task Partitioning [ILP or Heuristic Algorithm] • Partition the stream graph into two sets, one for GPU and one for CPU cores • A filter (all its instances) executes either on the CPU cores or on the GPU [Reduced complexity] • Instance Partitioning [ILP] • Partition the instances of each filter across the CPU cores or across the SMs of the GPU • To obtain performance increase the multiplicities of the steady state

  15. DMA Transfers and Shuffle and Deshuffle Operation • Whenever data is transferred from the CPU to the GPU • DMA from CPU to GPU and • A shuffle operation is performed • For the GPU to CPU transfers • A deshuffle is performed on the GPU • Then DMA transfer takes place

  16. Orchestrate the Execution • Orchestrate the execution [simple modulo scheduling] • Filters • DMA transfers and • Shuffle and deshuffle operations • The shuffle and deshuffle operations are always assigned to the GPU

  17. Stage Assignment A 5 A S Stage 0 2 S B2 C 20 20 10 Stage 1 B1 B1 DMA DMA DMA DMA DMA 2 Proc 1 = 32 J 5 D Proc 2 = 32 B2 C Stage 2 J Fission and processor assignment Stage 3 D Stage 4

  18. Heuristic Algorithm • Intuitively the nodes assigned to the CPU to be the nodes most beneficial to execute on the CPU • Defining • The intuition is • The highest to be assigned to the CPU • Also some of their neighbouring nodes assigned to the CPU • Considering DMA and shuffle and deshuffle costs

  19. Performance of Heuristic Partitioning

  20. Performance of the ILP vs. Heuristic Partitioner

  21. Comparison of Synergistic Execution with Other Schemes

  22. Questions?

More Related