1 / 24

Kokkos : The Tutorial alpha+1 version

Kokkos : The Tutorial alpha+1 version. The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland. Introduction . What this tutorial is: Introduction to Kokkos ’ main API features List of example codes (valid Kokkos programs) Incrementally increasing complexity

palila
Télécharger la présentation

Kokkos : The Tutorial alpha+1 version

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kokkos: The Tutorial alpha+1 version • The Kokkos Team: • Carter Edwards • Christian Trott • Dan Sunderland

  2. Introduction • What this tutorial is: • Introduction to Kokkos’ main API features • List of example codes (valid Kokkos programs) • Incrementally increasing complexity • What this tutorial is NOT: • Introduction to parallel programming • Presentation of Kokkos features • Performance comparison of Kokkos with other approaches • What you should know: • C++ (a bit of experience with templates helps) • General parallel programming concepts • Where the code can be found: • Trilinos/packages/kokkos/example/tutorial • Compilation: • make all CUDA=yes/no –j 8

  3. A Note on Devices • Use of Kokkos in applications has informed interface changes • Most Kokkos changes are already reflected in tutorial material • Not yet: Split Device into ExecutionSpace and MemorySpace • For this tutorial a Device fulfills a dual role: it is either a MemorySpace or an ExecutionSpace • Kokkos::Cudais used as a MemorySpace(GPU memory): • Kokkos::View<double*, Kokkos::Cuda> • Device is used as an ExecutionSpace: • template<classDevice> • structfunctor { • typedefDevicedevice_type; • };

  4. A Note on C++11 • Lambda interface requires C++11 • It is not currently supported on GPUs • is expected for NVIDIA in March 2015 • early access for NVIDIA probably fall 2014 • not sure about AMD • Lambda interface does not support all features • use for the simple cases • currently dispatches always to the default Device type • reductions only on POD with += and default initialize • parallel_scan operation not supported • shared memory for teams (scratch-pad) not supported • not obvious which limitations will stay in the future – but some will

  5. 01_HelloWorld • Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.) • Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration (e.g., whether Cudaor OpenMPis enabled) • parallel_for is used to dispatch work to threads or a GPU • By default parallel_for dispatches work to DefaultDeviceType Functor interface (C++98) Lambda interface (C++11) #include<Kokkos_Core.hpp> #include<cstdio> int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (constint& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> // A minimal functorwith just an operator(). // That operator will be called in parallel. structhello_world { KOKKOS_INLINE_FUNCTION void operator()(constint& i) const { printf("Hello World %i\n",i); } }; int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); }

  6. 02_SimpleReduce • Kokkosparallel_reduce offers deterministic reductions (same order of operations each time) • By default the reduction sets initial value to zero (default constructor) & uses += to combine values, but the functor interface can be used to define specialized init and join functions Functor interface (C++98) Lambda interface (C++11) #include<Kokkos_Core.hpp> #include<cstdio> structsquaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (inti, int &lsum) const { lsum+= i*i; } }; int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (inti, int& lsum) { lsum+=i*i; }, sum); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); }

  7. 03_SimpleViews • Kokkos::View: Multi-dimensional array (up to 8 dimensions) • Default layout (row- or column-major) depends on Device • Hooks for current & next-gen memory architecture features #include<Kokkos_Core.hpp> #include<cstdio> // A simple 2D array (rank==2) with one compile dimension // It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits. // By default a view using this type will be reference counted. typedefKokkos::View<double*[3]> view_type; int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10); // The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_datasuch as pointers and shape information is copied Kokkos::parallel_for(10,[=](inti){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; }); double sum = 0; Kokkos::parallel_reduce(10,[=](inti, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize(); }

  8. 04_SimpleMemorySpaces • Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies) • Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”) #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::View<double*[3]> view_type; // HostMirroris a view with the same layout / padding as its parent type but in the host memory space. // This memory space can be the same as the device memory space for example when running on CPUs. typedefview_type::HostMirrorhost_view_type; structsquaresum { view_typea; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (inti, int &lsum) const{ lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_typeh_a = Kokkos::create_mirror_view(a); for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } } // Transfer data fromh_a to a. This doesnothing if bothviewsreference the samedata. Kokkos::deep_copy(a,h_a); int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize(); }

  9. 05_SimpleAtomics • Atomics make updating a single memory location (<= 64 bits) thread-safe • Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetch-compare-exchange (more can be implemented if needed) • Performance of atomics depends on hardware & how many atomic operations hit the same address at the same time • If the atomic density is too large, explore different algorithms #include<Kokkos_Core.hpp> #include<cstdio> #include<cstdlib> #include<cmath> // Define View types used in the code typedefKokkos::View<int*> view_type; typedefKokkos::View<int> count_type; // A functorto find prime numbers. Append all // primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open // spot in ‘result_’. structfindprimes { view_typedata_; view_typeresult_; count_typecount_; // The functor’s constructor. findprimes (view_typedata, view_type result, count_type count) : data_ (data), result_ (result), count_ (count) {} // operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION voidoperator() (inti) const{ // Is data_(i) a prime number? constint number = data_(i); constintupper_bound = sqrt(1.0*number)+1; boolis_prime = !(number%2 == 0); intk = 3; while(k<upper_bound && is_prime) { is_prime= !(number%k == 0); k+=2; } if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. intidx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } } };

  10. main() for simple atomics example typedefview_type::HostMirrorhost_view_type; typedefcount_type::HostMirrorhost_count_type; intmain() { Kokkos::initialize(); srand(61391); intnnumbers = 100000; view_type data("RND",nnumbers); view_type result("Prime",nnumbers); count_type count("Count"); host_view_typeh_data = Kokkos::create_mirror_view(data); host_view_typeh_result = Kokkos::create_mirror_view(result); host_count_typeh_count = Kokkos::create_mirror_view(count); for(int i = 0; i < data.dimension_0(); i++) h_data(i) = rand()%100000; Kokkos::deep_copy(data,h_data); int sum = 0; Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count)); Kokkos::deep_copy(h_count,count); printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers); Kokkos::finalize(); }

  11. Advanced Views: 01_data_layouts • Data Layouts determine the mapping between indices and memory addresses • Each ExecutionSpace has a default Layout optimized for parallel execution on the first index • Data Layouts can be set via a template parameters in Views • Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride ([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library) • Custom Layouts can be added with <= 200 lines of code • Choosing wrong layout can reduce performance by 2-10x #include<Kokkos_Core.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> typedefKokkos::View<double**, Kokkos::LayoutLeft> left_type; typedefKokkos::View<double**, Kokkos::LayoutRight> right_type; typedefKokkos::View<double*> view_type; template<classViewType> structinit_view { ViewTypea; init_view(ViewType a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (inti) const { for(int j = 0; j < a.dimension_1(); j++) a(i,j) = 1.0*a.dimension_0()*i + 1.0*j; } }; template<classViewType1, classViewType2> structcontraction { view_typea; typenameViewType1::const_typev1; typenameViewType2::const_typev2; contraction(view_type a_, ViewType1 v1_, ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < v1.dimension_1(); j++) a(i) = v1(i,j)*v2(j,i); } };

  12. structdot { view_typea; dot(view_type a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (inti, double &lsum) const { lsum+= a(i)*a(i); } }; int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 10000; view_type a("A",size); left_type l("L",size,10000); right_type r("R",size,10000); Kokkos::parallel_for(size,init_view<left_type>(l)); Kokkos::parallel_for(size,init_view<right_type>(r)); Kokkos::fence(); Kokkos::Impl::Timer time1; Kokkos::parallel_for (size,contraction<left_type,right_type>(a,l,r)); Kokkos::fence(); double sec1 = time1.seconds(); double sum1 = 0; Kokkos::parallel_reduce(size,dot(a),sum1); Kokkos::fence(); Kokkos::Impl::Timer time2; Kokkos::parallel_for (size,contraction<right_type,left_type>(a,r,l)); Kokkos::fence(); double sec2 = time2.seconds(); double sum2 = 0; Kokkos::parallel_reduce(size,dot(a),sum2); printf("ResultLeft/Right %lf Right/Left %lf (equalresult: %i)\n",sec1,sec2,sum2==sum1); Kokkos::finalize(); } [crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2 Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1) [crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1)

  13. Advanced Views: 02_memory_traits • Memory Traits are used to specify usage patterns of Views • Views with different traits (which are equal otherwise) can usually be assigned to each other • Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess • Choosing the correct traits can have significant performance impact if special hardware exists to support a usage pattern (e.g., texture cache for random access on GPUs) #include<Kokkos_Core.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> typedefKokkos::View<double*> view_type; // We expect to access these data “randomly” (noncontiguously). typedefKokkos::View<constdouble*, Kokkos::MemoryRandomAccess> view_type_rnd; typedefKokkos::View<int**> idx_type; typedefidx_type::HostMirroridx_type_host; // Template the Functor on the View type to show performance difference with MemoryRandomAccess. template<classDestType, classSrcType> structlocalsum { idx_type::const_typeidx; DestTypedest; SrcTypesrc; localsum (idx_typeidx_, DestTypedest_, SrcTypesrc_) : idx (idx_), dest (dest_), src (src_) {} KOKKOS_INLINE_FUNCTION void operator() (inti) const { doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { // Indirect (hence probably noncontiguous) access constdoubleval = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) = tmp; } };

  14. int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; idx_typeidx("Idx",size,64); idx_type_hosth_idx = Kokkos::create_mirror_view(idx); view_typedest("Dest",size); view_typesrc("Src",size); srand(134231); for(inti=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) { h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } } Kokkos::deep_copy(idx,h_idx); Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); // InvokeKernelwithviewsusingthe // RandomAccessTrait Kokkos::Impl::Timer time1; Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); doublesec1 = time1.seconds(); // InvokeKernelwithviewswithout // theRandomAccessTrait Kokkos::Impl::Timer time2; Kokkos::parallel_for(size, localsum<view_type,view_type>(idx,dest,src)); Kokkos::fence(); double sec2 = time2.seconds(); printf("Time withTraitRandomAccess: %lfwithPlain: %lf \n",sec1,sec2); Kokkos::finalize(); } [crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2 Time with Trait RandomAccess: 0.004979 with Plain: 0.004999 [crtrott@perseus 02_memory_traits]$ ./memory_traits.cuda Time with Trait RandomAccess: 0.004043 with Plain: 0.009060

  15. Advanced Views: 04_DualViews • DualViewsmanage data transfer between host and device • You mark a View as modified on host or device; you ask for synchronization (conditional, if marked) • DualView has same template arguments as View • To access View on a specific MemorySpace, must extract it #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::DualView<double*> view_type; typedefKokkos::DualView<int**> idx_type; template<classDevice> structlocalsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedefDevicedevice_type; // Get view types on the particular Device // for which the functoris instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src; Localsum (idx_typedv_idx, view_typedv_dest, view_typedv_src) // Constructor { // Extract view on correct Device from DualView idx= dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>(); // Synchronize DualViewon correct Device dv_idx.sync<Device>(); dv_dest.template sync<Device>(); dv_src.template sync<Device>(); // Mark dest as modified on Device dv_dest.template modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (inti) const{ doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { constdoubleval = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) += tmp; } };

  16. int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); srand(134231); int size = 1000000; // Create DualViews. This will allocate on both // the device and its host_mirror_device idx_typeidx("Idx",size,64); view_typedest("Dest",size); view_typesrc("Src",size); // Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_hosth_idx = idx.h_view; for(inti=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } // Mark idx as modified on thehost_mirror_device_type // sothat a synctothedevicewillactuallymove // data. // Thesynchappens in theconstructor of thefunctor idx.modify<idx_type::host_mirror_device_type>(); // Run on thedevice // Thiswillcause a sync of idxtothedevice since // itsmarkedas modified on thehost Kokkos::Impl::Timertimer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); // Run on thehost (could be thesame as device) // Thiswillcause a syncbacktothehost of dest // Notethatifthe Device is CUDA: the data layout // willnot be optimal on host, soperformance is // lowerthanwhat it would be for a purehost // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); timer.reset(); Kokkos::parallel_for(size,localsum<view_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); printf("Device Time withSync: %lfwithoutSync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time withSync: %lfwithoutSync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); }

  17. Advanced Views: 05 NVIDIA UVM • NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer • Makes coding easier: pretend there is only one MemorySpace • But: can come with significant performance penalties if frequently complete allocations are moved #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::View<double*> view_type; typedefKokkos::View<int**> idx_type; template<classDevice> structlocalsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedefDevicedevice_type; // Use the same ViewType no matter where the // functor is executed idx_type::const_typeidx; view_typedest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src; localsum(idx_typeidx_, view_typedest_, view_typesrc_):idx(idx_),dest(dest_),src(src_) { } KOKKOS_INLINE_FUNCTION void operator() (inti) const { doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { constdoubleval = src(idx(i,j)); tmp+= val*val + 0.5*(idx.dimension_0()*val– idx.dimension_1()*val); } dest(i) += tmp; } };

  18. int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; // Create Views idx_typeidx("Idx",size,64); view_typedest("Dest",size); view_typesrc("Src",size); srand(134231); // When using UVM Cuda views can be accessed on the // Host directly for(inti=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; } Kokkos::fence(); // Run on thedevice // Thiswillcause a sync of idxtothedevice since // it wasmodified on thehost Kokkos::Impl::Timertimer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); // No data transfer willhappennow, since nothing is // accessedon thehost timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); // Run on thehost // Thiswillcause a syncbacktothehost of // dest whichwaschanged on thedevice // Compareruntime here withthedual_viewexample: // dest will be copiedback in 4k blocks // whentheyareaccessedthefirst time duringthe // parallel_for. Duetothelatency of a memcpy // thisgiveslowereffectivebandwidthwhendoing // a manualcopyviadualviews timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); // No data transferswillhappennow timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); printf("Device Time withSync: %lfwithoutSync: %lf \n",sec1_dev,sec2_dev); printf("Host Time withSync: %lfwithoutSync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); }

  19. [crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseusAdvanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2 Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801 [crtrott@perseusAdvanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575 [crtrott@perseusAdvanced_Views]$ export CUDA_VISIBLE_DEVICES=0 [crtrott@perseusAdvanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688 Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent this When looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s

  20. Hierarchical Parallelism: 01 ThreadTeams • Kokkos supports the notion of a “League of Thread Teams” • Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset • On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 • The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::Impl::DefaultDeviceTypedevice_type; int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), [=](device_typedev, int& lsum) { lsum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); },sum); printf("Result %i\n",sum); Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::Impl::DefaultDeviceTypedevice_type; structhello_world { KOKKOS_INLINE_FUNCTION void operator() (device_typedev, int& sum) const { sum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); } }; int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum); Kokkos::finalize(); }

  21. Hierarchical Parallelism: 02 Shared Memory • Kokkos supports ScratchPads for Teams • On CPUs, ScratchPadis just a small team-private allocation which hopefully lives in L1 cache #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::Impl::DefaultDeviceTypeDevice; typedefDevice::host_mirror_device_typeHost; #define TS 16 structfind_2_tuples { intchunk_size; Kokkos::View<constint*> data; Kokkos::View<int**> histogram; find_2_tuples(intchunk_size_, Kokkos::DualView<int*> data_, Kokkos::DualView<int**> histogram_): chunk_size(chunk_size_),data(data_.d_view), histogram(histogram_.d_view) { data_.sync<Device>(); histogram_.sync<Device>(); histogram_.modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (Devicedev) const{ // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1); constinti = dev.league_rank() * chunk_size; for(int j = dev.team_rank(); j<chunk_size+1; j+=dev.team_size()) l_data(j) = data(i+j); for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) l_histogram(k,l) = 0; dev.team_barrier(); for(intj = 0; j<chunk_size; j++) { for(intk = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) { if((l_data(j) == k) && (l_data(j+1)==l)) l_histogram(k,l)++; } } for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++){ Kokkos::atomic_fetch_add(&histogram(k,l), l_histogram(k,l)); } dev.team_barrier(); } size_tshmem_size() const { returnsizeof(int)*(chunk_size+2 + TS*TS); } };

  22. main() for hierarchical parallelism example int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); intchunk_size = 1024; intnchunks = 100000; //1024*1024; Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1); srand(1231093); for(int i = 0; i < data.dimension_0(); i++) { data.h_view(i) = rand()%TS; } data.modify<Host>(); data.sync<Device>(); Kokkos::DualView<int**> histogram("histogram",TS,TS); Kokkos::Impl::Timer timer; Kokkos::parallel_for( Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()), find_2_tuples(chunk_size,data,histogram)); Kokkos::fence(); double time = timer.seconds(); histogram.sync<Host>(); printf("Time: %lf \n\n",time); Kokkos::finalize(); }

  23. Wrap Up • Features not presented here: • Getting a subview of a View • ParallelScan& TeamScan • Linear Algebra subpackage • Kokkos::UnorderedMap (thread-scalable hash table) • To learn more, see: • More complex Kokkos examples • MantevoMiniApps (e.g., MiniFE) • LAMMPS (molecular dynamics code)

  24. Questions and further discussion: crtrott@sandia.gov

More Related