1 / 33

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture

This study explores the use of the shuffle mechanism to enable efficient data communication in Fourier transforms on many-core architectures. It compares the shuffle mechanism with the traditional shared memory approach, highlighting its advantages in terms of cheaper data movement and faster performance. The study also evaluates the shuffle mechanism using matrix transpose and presents a devised Shuffle Transpose Algorithm. The bottleneck of intra-thread data movement is analyzed, and general strategies for optimization are discussed.

jamendola
Télécharger la présentation

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia Tech (Undergrad) Advisor: Dr. Wu-chun Feng*§, Virginia Tech * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

  2. Forecast: Hardware-Software Co-Design Software (Transpose) Hardware (K20c and shuffle) Shuffle Mechanism NVIDIA Kepler K20c Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  3. Q: What is shuffle? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  4. Q: What is shuffle? Cheaper data movement Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  5. Q: What is shuffle? Cheaper data movement • Faster than shared memory Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  6. Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  7. Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs • Limited to a warp Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  8. Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs • Limited to a warp >>> Idea: reduce data communication between threads <<< Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  9. Q: What are you solving? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  10. Q: What are you solving? • Enable efficient data communication Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  11. Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  12. Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  13. Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) • Shuffle (the “new” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  14. Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  15. Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  16. Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT • Devised Shuffle Transpose Algorithm • Consists of horizontal (inter-thread shuffles) and vertical (intra-thread) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  17. Analysis • Bottleneck: Intra-thread data movement Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  18. Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  19. Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  20. Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  21. Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  22. Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  23. Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  24. Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 6% Divergence Divergence Divergence 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  25. Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 6% Divergence Code3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; 44% Divergence Divergence 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  26. Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  27. Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  28. Conclusion • Overall Performance • Max. Speedup (Amdahl’s Law): 1.19-fold • Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  29. Conclusion • Overall Performance • Max. Speedup (Amdahl’s Law): 1.19-fold • Achieved Speedup: 1.17-fold • Surprise Result • Goal: Accelerate communication (“gray bar”) • Result: Accelerated the computation also (“black bar”) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  30. Thank You! • Enabling Efficient Intra-Warp Comunication for Fourier Transforms in a Many-Core Architecture • Student: Carlo del Mundo, Virginia Tech (undergrad) • Overall Performance • Theoretical Speedup: 1.19-fold • Achieved Speedup: 1.17-fold Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  31. Appendix Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  32. Motivation • Goal • Accelerating an application based on hardware-specific mechanisms (e.g., “the hardware-software co-design process”) • Case Study • Application: Matrix transpose as part of a 256-pt FFT • Architecture: NVIDIA Kepler K20c • Use shuffle to accelerate communication • Results • Max. Theoretical Speedup: 1.19-fold • Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

  33. Background: The New and Old • Shuffle • Idea: • Communicate data within a warp w/o shared memory • Pros • Faster (1 cycle to perform load and store) • Eliminate the use of shared memory  higher thread occupancy • Cons • Poorly understood • Only available in Kepler GPUs • Only limited to 32 threads • Shared Memory • Idea • Scratchpad memory to communicate data • Pros • Easy to program • Scales to a block (up to 1536 threads) • Cons • Prone to bank conflicts Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

More Related