1 / 25

A GPGPU transparent virtualization component for high performance computing clouds

http://lmncp.uniparthenope.it. http://dsa.uniparthenope.it. A GPGPU transparent virtualization component for high performance computing clouds. G. Giunta, Raffaele Montella , G. Agrillo, G. Coviello University of Napoli Parthenope Department of Applied Science

aizza
Télécharger la présentation

A GPGPU transparent virtualization component for high performance computing clouds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://lmncp.uniparthenope.it http://dsa.uniparthenope.it A GPGPU transparent virtualization component for high performance computing clouds G. Giunta, Raffaele Montella, G. Agrillo, G. Coviello University of Napoli ParthenopeDepartment of Applied Science {giunta,montella,agrillo,coviello}@uniparthenope.it

  2. uniParthenope • Oneof the fiveUniversities in Napoli (Italy) • 20K students • 5 faculties • Science and Technologies • Engineering • Economy • Law • Sports & Health http://www.uniparthenope.it

  3. Summary • Introduction • System Architecture and Design • Performance evaluation • Conclusions and developments gVirtuS GPGPU virtualization service

  4. Introduction & Contextualization • High Performance Computing: • Stackoftechnologiesenabling high performance computingresourcesdemanding software • Gridcomputing: • Stackoftechnoogiesenabling the resourcesharing and aggregation • Manycore: • The “enforcement” of the Moore’s law • GPGPUs: • Computingefficient and costeffective high performance computingusingmanycoregraphics processing units • Virtualization: • Hardware and software resourcesabstraction • Oneof the manycoreCPUskillingapplications • Cloudcomputing: • Stackoftechnologiesenabling hosting on virtualizedresources • On demandresourcevirtualization • Payasyou go

  5. High Performance CloudComputing • Hardware: • High performance computing cluster • Multicore / Multi processorcomputingnodes • GPGPUs • Software: • Linux • Virtualizationhypervisor • Private cloud management software • +Specialingredients…

  6. gVirtuS • GPU Virtualization Service • Bonded on nVidia/CUDA APIs • Hypervisorindependent • Uses a front-end (FE)/ back-endapproach (BE) • FE/BE communicatorindependent The key properties of the proposed system are: 1. Enabling the CUDA kernels execution in a virtualized environment 2. With an overall performance not too far from un-virtualized machines

  7. System Architecture and Design • CUDA device is under control of the hypervisor • Interface between guest and host machine • Any GPU access is routed via the FE/BE • The management component controls invocation and data movement

  8. The Communicator • Provides a high performance communication between virtual machines and their hosts. • The choice of the hypervisor deeply affects the efficiency of the communication.

  9. HowgVirtuSworks • CUDA library: • deals directly with the hardware accelerator • interacts with a GPU virtualization front end • The Front End: • packs the library function invocation • sends it to the back end • The back end: • deals with the hardware using the CUDA driver • unpacks the library function invocation • maps memory pointers • executes the CUDA operation • retrieves the results • sends them to the front end using the communicator • The Front End: • interacts with the CUDA library by terminating the GPU operation • provides results to the calling program. • This design is: • hypervisor independent • communicator independent • accelerator independent • The same approach could be followed to implement different kinds of virtualization.

  10. Choices and Motivations • We focused on VMware and KVM hypervisors. • vmSocket is the component we have designed to obtain a high performance communicator • vmSocket exposes Unix Sockets on virtual machine instances thanks to a QEMU device connected to the virtual PCI bus. vmSocket

  11. vmSocket: virtual PCI device • Programming interface: • Unix Socket • Communication between guest and host: • Virtual PCI interface • QEMU has been modified • GPU based high performance computing applications usually require massive data transfer between host (CPU) memory and device (GPU) memory… • FE/BE interaction efficiency: • there is no mapping between guest memory and device memory • the memory device pointers are never de-referenced on the host side • CUDA kernels are executed on the BE where the pointers are fully consistent.

  12. Performance Evaluation • CUDA Workstation • Genesis GE-i940 Tesla • i7-940 2,93 133 GHz fsb, Quad Core hyper-threaded 8 Mb cache CPU and 12Gb RAM. • 1 nVIDIAQuadro FX5800 4Gb RAM video card • 2 nVIDIA Tesla C1060 4 Gb RAM • The testing system: • Fedora 12 Linux • nVIDIA CUDA Driver, and the SDK/Toolkit version 2.3. • VMware vs. KVM/QEMU (using different communicators).

  13. …from CUDA SDK… • ScalarProd computes k scalar products of two real vectors of length m. Notice that each product is executed by a CUDA thread on the GPU so no synchronization is required. • MatrixMul computes a matrix multiplication. The matrices are mxn and nxp, respectively. It partitions the input matrices in blocks and associates a CUDA thread to each block. As in the previous case, there is no need of synchronization. • Histogram returns the histogram of a set of m uniformly distributed real random numbers in 64 bins. The set is distributed among the CUDA threads each computing a local histogram. The final result is obtained through synchronization and reduction techniques.

  14. Test cases • Host/cpu: CPUwithoutvirtualization (no gVirtuS) • Host/gpu: GPU withoutvirtualization (no gVirtuS) • Host/afunix: GPU withoutvirtualizatiuon (withgVirtuS)measures the impact of the gVirtuSstack • Host/tcp: GPU withoutvirtualization (withgVirtuS)measures the impact of the communicationstack • */cpu: CPU in a virtualizedenvironment (no gVirtuS) • */tcp: GPU in a virtualizedenvironment (withgVirtuS) • Vmware/vmci: GPU in a vmwarevirtualmachinewithgVirtuSusing the VMCI basedcommunicator • KVM/vmSocket: GPU in a KVM/QEMU virtualmachinewithgVirtuSusing the vmSocketbasedcommunicator

  15. ScalarProd

  16. MatrixMul

  17. Histogram

  18. AboutResults • Virtualization doesn’t affect computing performances in a heavy way • gVirtuS-kvm/vmsocket gives the best efficiency with the less impact respect to the raw host/gpu setup • The tcp based communicator could be used in a production scenario: • The problem size and the computing speed-up justify the poor communication performances

  19. HPCC: High PerformanceCloudComputing [ ] • Intel based 12 computing nodes cluster • Each node: • quad core 64 bit CPU / 4 GB of RAM • nVIDIAGeForce GT 9400 video card with 16 CUDA cores and a memory of 1 Gbyte. • Software stack: • Fedora 12 • Eucalyptus • KVM/QEMU • gVirtus

  20. HPCC Performance Evaluation • Ad hoc benchmark • Matrix multiplication algorithms. • Classic memory distributed parallel approach. • The first matrix is distributed by rows, the second one by columns • Each process has to perform a local matrix multiplication. • MPICH 2 as message passing interface among processes • Each process uses the CUDA library to perform the local matrix multiplication.

  21. MatrixMul MPI/CUDA/gVirtus

  22. Future Directions • Enable shared memory communication between host and guest machines in order to improve virtual host to device and vice-versa memory copying. • Implementation of OpenGL interoperability to integrate gVirtuS and VMGL for 3D graphics virtualization. • Integrate MPICH2 with vmSocket in order to implement a high performance message passing standard interface.

  23. Conclusions • The gVirtuS GPU virtualization and sharing system enables thin Linux based virtual machines to be accelerated by the computing power provided by nVIDIA GPUs. • The gVirtuS stack permits to accelerate virtual machines with a small impact on overall performance respect to a pure host/gpu setup. • gVirtuS can be easily extended to other CUDA enabled devices • This approach is based on highly proprietary and close-source nVIDIA products. Download Try & Contribute! http://osl.uniparthenope.it/projects/gvirtus/

  24. gVirtuSimplementation (1/2) • The BE runs on the hostdevice • The FE runs on the virtualmachine • gVirtusisimplementend in C++ • BE/FE runasdeamons

  25. gVirtuSimplementation (2/2) • The FE class diagram

More Related