110 likes | 221 Vues
This presentation by Charlene DiMeglio explores the use of GPUs for general-purpose high-performance computing, focusing on the need to maximize their untapped potential beyond graphics. It discusses various challenges, including high costs, memory management, and latency issues, and proposes solutions such as improved CUDA drivers and efficient scheduling techniques. By comparing performance metrics between GPUs and supercomputers, the presentation highlights how strategic utilization of GPGPUs can enhance speed and efficiency, making their use more practical for computational tasks.
E N D
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji, et al.
Overview • Problem: • Want to use the GPU for things other than graphics, however the costs can be high • Solution: • Improve the CUDA drivers • Results: • As compared to node of a supercomputer, worth it • Conclusion • These improvements make using GPGPU’s more feasible
Problem: Need to computation power • Why GPU’s? • GPU’s are not being fully realized as a resource, often sitting idle when not being used for graphics • Better performance for less power as compared to CPU’s • What’s the issue? Cost. • Efficient scheduling – timing data loads with its uses • Memory management – using the small amount of memory available effectively • Loads and stores – waiting for memory transfers, taking 100’s of cycles
Solutions • Brook+ by AMD, Larrabee by Intel • CUDA by NVIDA • Greatest technological maturity at the time • Paper investigating existing technology and suggested improvements 8 Streaming Processors 30 Multi-Processors 16kb
NVIDA’s Tesla C1060 GPU vs. Hitachi HA8000-tc/RS425 (T2K) Super Computer • T2K – fastest supercomputer in Japan
Issues to Overcome • High SIMD vector length • Small main memory size • High register spill cost • No L2 cachebut rather read-only texture caches
Methods to Hide Away Latency • CUDA compiler option limits number of registers used per warp • 1 warp = the 32 threads running in a block (SMID) • Maximizes number of warps that can run at a time • Could cause spills • Variable-sized multi-round data transfer scheduling with PCI express • PCI express allows for data transfer, GPU and CPU computation to occur in parallel • Allows for constant flow of information: • Allows for up to O(log x/x) as compared to uniform scheduling’s O()
Methods to Hide Away Latency • Computation time between communications > Communication latency • Worth sending the data over to the GPU • Increasing bandwidth and size of messages makes the constant term in overhead latency seem smaller • Efficient use of registers to prevent spills • Deciding what work to do where, GPU vs. CPU, work sharing • Minimizing divergent warps using atomic operations found in CUDA • Divergent warp occur when threads must follow both paths
Results • Variable-sized multi-round data transfer scheduling Number of rounds
Results • Use of atomic instructions in CUDA to minimize latency
Conclusion • CUDA gives programmers the ability to harness the power of the GPU for general uses. • The improvements presented allow this option to be more feasible. • Strategic use of GPGPU’s as a resource will improve speed and efficiency. • However, presented material mainly theoretical, not much strong data to back up • More suggestions than implementations, promoting GPGPU use