Kokkos : The Tutorial alpha+1 version

Kokkos: The Tutorial alpha+1 version • The Kokkos Team: • Carter Edwards • Christian Trott • Dan Sunderland

Introduction • What this tutorial is: • Introduction to Kokkos’ main API features • List of example codes (valid Kokkos programs) • Incrementally increasing complexity • What this tutorial is NOT: • Introduction to parallel programming • Presentation of Kokkos features • Performance comparison of Kokkos with other approaches • What you should know: • C++ (a bit of experience with templates helps) • General parallel programming concepts • Where the code can be found: • Trilinos/packages/kokkos/example/tutorial • Compilation: • make all CUDA=yes/no –j 8

A Note on Devices • Use of Kokkos in applications has informed interface changes • Most Kokkos changes are already reflected in tutorial material • Not yet: Split Device into ExecutionSpace and MemorySpace • For this tutorial a Device fulfills a dual role: it is either a MemorySpace or an ExecutionSpace • Kokkos::Cudais used as a MemorySpace(GPU memory): • Kokkos::View<double*, Kokkos::Cuda> • Device is used as an ExecutionSpace: • template<classDevice> • structfunctor { • typedefDevicedevice_type; • };

A Note on C++11 • Lambda interface requires C++11 • It is not currently supported on GPUs • is expected for NVIDIA in March 2015 • early access for NVIDIA probably fall 2014 • not sure about AMD • Lambda interface does not support all features • use for the simple cases • currently dispatches always to the default Device type • reductions only on POD with += and default initialize • parallel_scan operation not supported • shared memory for teams (scratch-pad) not supported • not obvious which limitations will stay in the future – but some will

01_HelloWorld • Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.) • Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration (e.g., whether Cudaor OpenMPis enabled) • parallel_for is used to dispatch work to threads or a GPU • By default parallel_for dispatches work to DefaultDeviceType Functor interface (C++98) Lambda interface (C++11) #include<Kokkos_Core.hpp> #include<cstdio> int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (constint& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> // A minimal functorwith just an operator(). // That operator will be called in parallel. structhello_world { KOKKOS_INLINE_FUNCTION void operator()(constint& i) const { printf("Hello World %i\n",i); } }; int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); }

02_SimpleReduce • Kokkosparallel_reduce offers deterministic reductions (same order of operations each time) • By default the reduction sets initial value to zero (default constructor) & uses += to combine values, but the functor interface can be used to define specialized init and join functions Functor interface (C++98) Lambda interface (C++11) #include<Kokkos_Core.hpp> #include<cstdio> structsquaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (inti, int &lsum) const { lsum+= i*i; } }; int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (inti, int& lsum) { lsum+=i*i; }, sum); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); }

03_SimpleViews • Kokkos::View: Multi-dimensional array (up to 8 dimensions) • Default layout (row- or column-major) depends on Device • Hooks for current & next-gen memory architecture features #include<Kokkos_Core.hpp> #include<cstdio> // A simple 2D array (rank==2) with one compile dimension // It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits. // By default a view using this type will be reference counted. typedefKokkos::View<double*[3]> view_type; int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10); // The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_datasuch as pointers and shape information is copied Kokkos::parallel_for(10,[=](inti){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; }); double sum = 0; Kokkos::parallel_reduce(10,[=](inti, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize(); }

04_SimpleMemorySpaces • Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies) • Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”) #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::View<double*[3]> view_type; // HostMirroris a view with the same layout / padding as its parent type but in the host memory space. // This memory space can be the same as the device memory space for example when running on CPUs. typedefview_type::HostMirrorhost_view_type; structsquaresum { view_typea; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (inti, int &lsum) const{ lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_typeh_a = Kokkos::create_mirror_view(a); for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } } // Transfer data fromh_a to a. This doesnothing if bothviewsreference the samedata. Kokkos::deep_copy(a,h_a); int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize(); }

05_SimpleAtomics • Atomics make updating a single memory location (<= 64 bits) thread-safe • Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetch-compare-exchange (more can be implemented if needed) • Performance of atomics depends on hardware & how many atomic operations hit the same address at the same time • If the atomic density is too large, explore different algorithms #include<Kokkos_Core.hpp> #include<cstdio> #include<cstdlib> #include<cmath> // Define View types used in the code typedefKokkos::View<int*> view_type; typedefKokkos::View<int> count_type; // A functorto find prime numbers. Append all // primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open // spot in ‘result_’. structfindprimes { view_typedata_; view_typeresult_; count_typecount_; // The functor’s constructor. findprimes (view_typedata, view_type result, count_type count) : data_ (data), result_ (result), count_ (count) {} // operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION voidoperator() (inti) const{ // Is data_(i) a prime number? constint number = data_(i); constintupper_bound = sqrt(1.0*number)+1; boolis_prime = !(number%2 == 0); intk = 3; while(k<upper_bound && is_prime) { is_prime= !(number%k == 0); k+=2; } if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. intidx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } } };

main() for simple atomics example typedefview_type::HostMirrorhost_view_type; typedefcount_type::HostMirrorhost_count_type; intmain() { Kokkos::initialize(); srand(61391); intnnumbers = 100000; view_type data("RND",nnumbers); view_type result("Prime",nnumbers); count_type count("Count"); host_view_typeh_data = Kokkos::create_mirror_view(data); host_view_typeh_result = Kokkos::create_mirror_view(result); host_count_typeh_count = Kokkos::create_mirror_view(count); for(int i = 0; i < data.dimension_0(); i++) h_data(i) = rand()%100000; Kokkos::deep_copy(data,h_data); int sum = 0; Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count)); Kokkos::deep_copy(h_count,count); printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers); Kokkos::finalize(); }

Advanced Views: 01_data_layouts • Data Layouts determine the mapping between indices and memory addresses • Each ExecutionSpace has a default Layout optimized for parallel execution on the first index • Data Layouts can be set via a template parameters in Views • Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride ([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library) • Custom Layouts can be added with <= 200 lines of code • Choosing wrong layout can reduce performance by 2-10x #include<Kokkos_Core.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> typedefKokkos::View<double**, Kokkos::LayoutLeft> left_type; typedefKokkos::View<double**, Kokkos::LayoutRight> right_type; typedefKokkos::View<double*> view_type; template<classViewType> structinit_view { ViewTypea; init_view(ViewType a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (inti) const { for(int j = 0; j < a.dimension_1(); j++) a(i,j) = 1.0*a.dimension_0()*i + 1.0*j; } }; template<classViewType1, classViewType2> structcontraction { view_typea; typenameViewType1::const_typev1; typenameViewType2::const_typev2; contraction(view_type a_, ViewType1 v1_, ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < v1.dimension_1(); j++) a(i) = v1(i,j)*v2(j,i); } };

structdot { view_typea; dot(view_type a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (inti, double &lsum) const { lsum+= a(i)*a(i); } }; int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 10000; view_type a("A",size); left_type l("L",size,10000); right_type r("R",size,10000); Kokkos::parallel_for(size,init_view<left_type>(l)); Kokkos::parallel_for(size,init_view<right_type>(r)); Kokkos::fence(); Kokkos::Impl::Timer time1; Kokkos::parallel_for (size,contraction<left_type,right_type>(a,l,r)); Kokkos::fence(); double sec1 = time1.seconds(); double sum1 = 0; Kokkos::parallel_reduce(size,dot(a),sum1); Kokkos::fence(); Kokkos::Impl::Timer time2; Kokkos::parallel_for (size,contraction<right_type,left_type>(a,r,l)); Kokkos::fence(); double sec2 = time2.seconds(); double sum2 = 0; Kokkos::parallel_reduce(size,dot(a),sum2); printf("ResultLeft/Right %lf Right/Left %lf (equalresult: %i)\n",sec1,sec2,sum2==sum1); Kokkos::finalize(); } [crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2 Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1) [crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1)

Advanced Views: 02_memory_traits • Memory Traits are used to specify usage patterns of Views • Views with different traits (which are equal otherwise) can usually be assigned to each other • Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess • Choosing the correct traits can have significant performance impact if special hardware exists to support a usage pattern (e.g., texture cache for random access on GPUs) #include<Kokkos_Core.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> typedefKokkos::View<double*> view_type; // We expect to access these data “randomly” (noncontiguously). typedefKokkos::View<constdouble*, Kokkos::MemoryRandomAccess> view_type_rnd; typedefKokkos::View<int**> idx_type; typedefidx_type::HostMirroridx_type_host; // Template the Functor on the View type to show performance difference with MemoryRandomAccess. template<classDestType, classSrcType> structlocalsum { idx_type::const_typeidx; DestTypedest; SrcTypesrc; localsum (idx_typeidx_, DestTypedest_, SrcTypesrc_) : idx (idx_), dest (dest_), src (src_) {} KOKKOS_INLINE_FUNCTION void operator() (inti) const { doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { // Indirect (hence probably noncontiguous) access constdoubleval = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) = tmp; } };

int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; idx_typeidx("Idx",size,64); idx_type_hosth_idx = Kokkos::create_mirror_view(idx); view_typedest("Dest",size); view_typesrc("Src",size); srand(134231); for(inti=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) { h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } } Kokkos::deep_copy(idx,h_idx); Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); // InvokeKernelwithviewsusingthe // RandomAccessTrait Kokkos::Impl::Timer time1; Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); doublesec1 = time1.seconds(); // InvokeKernelwithviewswithout // theRandomAccessTrait Kokkos::Impl::Timer time2; Kokkos::parallel_for(size, localsum<view_type,view_type>(idx,dest,src)); Kokkos::fence(); double sec2 = time2.seconds(); printf("Time withTraitRandomAccess: %lfwithPlain: %lf \n",sec1,sec2); Kokkos::finalize(); } [crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2 Time with Trait RandomAccess: 0.004979 with Plain: 0.004999 [crtrott@perseus 02_memory_traits]$ ./memory_traits.cuda Time with Trait RandomAccess: 0.004043 with Plain: 0.009060

Advanced Views: 04_DualViews • DualViewsmanage data transfer between host and device • You mark a View as modified on host or device; you ask for synchronization (conditional, if marked) • DualView has same template arguments as View • To access View on a specific MemorySpace, must extract it #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::DualView<double*> view_type; typedefKokkos::DualView<int**> idx_type; template<classDevice> structlocalsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedefDevicedevice_type; // Get view types on the particular Device // for which the functoris instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src; Localsum (idx_typedv_idx, view_typedv_dest, view_typedv_src) // Constructor { // Extract view on correct Device from DualView idx= dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>(); // Synchronize DualViewon correct Device dv_idx.sync<Device>(); dv_dest.template sync<Device>(); dv_src.template sync<Device>(); // Mark dest as modified on Device dv_dest.template modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (inti) const{ doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { constdoubleval = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) += tmp; } };

int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); srand(134231); int size = 1000000; // Create DualViews. This will allocate on both // the device and its host_mirror_device idx_typeidx("Idx",size,64); view_typedest("Dest",size); view_typesrc("Src",size); // Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_hosth_idx = idx.h_view; for(inti=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } // Mark idx as modified on thehost_mirror_device_type // sothat a synctothedevicewillactuallymove // data. // Thesynchappens in theconstructor of thefunctor idx.modify<idx_type::host_mirror_device_type>(); // Run on thedevice // Thiswillcause a sync of idxtothedevice since // itsmarkedas modified on thehost Kokkos::Impl::Timertimer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); // Run on thehost (could be thesame as device) // Thiswillcause a syncbacktothehost of dest // Notethatifthe Device is CUDA: the data layout // willnot be optimal on host, soperformance is // lowerthanwhat it would be for a purehost // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); timer.reset(); Kokkos::parallel_for(size,localsum<view_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); printf("Device Time withSync: %lfwithoutSync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time withSync: %lfwithoutSync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); }

Advanced Views: 05 NVIDIA UVM • NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer • Makes coding easier: pretend there is only one MemorySpace • But: can come with significant performance penalties if frequently complete allocations are moved #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::View<double*> view_type; typedefKokkos::View<int**> idx_type; template<classDevice> structlocalsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedefDevicedevice_type; // Use the same ViewType no matter where the // functor is executed idx_type::const_typeidx; view_typedest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src; localsum(idx_typeidx_, view_typedest_, view_typesrc_):idx(idx_),dest(dest_),src(src_) { } KOKKOS_INLINE_FUNCTION void operator() (inti) const { doubletmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { constdoubleval = src(idx(i,j)); tmp+= val*val + 0.5*(idx.dimension_0()*val– idx.dimension_1()*val); } dest(i) += tmp; } };

int main(intnarg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; // Create Views idx_typeidx("Idx",size,64); view_typedest("Dest",size); view_typesrc("Src",size); srand(134231); // When using UVM Cuda views can be accessed on the // Host directly for(inti=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; } Kokkos::fence(); // Run on thedevice // Thiswillcause a sync of idxtothedevice since // it wasmodified on thehost Kokkos::Impl::Timertimer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); // No data transfer willhappennow, since nothing is // accessedon thehost timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); // Run on thehost // Thiswillcause a syncbacktothehost of // dest whichwaschanged on thedevice // Compareruntime here withthedual_viewexample: // dest will be copiedback in 4k blocks // whentheyareaccessedthefirst time duringthe // parallel_for. Duetothelatency of a memcpy // thisgiveslowereffectivebandwidthwhendoing // a manualcopyviadualviews timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); // No data transferswillhappennow timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); printf("Device Time withSync: %lfwithoutSync: %lf \n",sec1_dev,sec2_dev); printf("Host Time withSync: %lfwithoutSync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); }

[crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseusAdvanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2 Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801 [crtrott@perseusAdvanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575 [crtrott@perseusAdvanced_Views]$ export CUDA_VISIBLE_DEVICES=0 [crtrott@perseusAdvanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688 Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent this When looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s

Hierarchical Parallelism: 01 ThreadTeams • Kokkos supports the notion of a “League of Thread Teams” • Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset • On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 • The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::Impl::DefaultDeviceTypedevice_type; int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), [=](device_typedev, int& lsum) { lsum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); },sum); printf("Result %i\n",sum); Kokkos::finalize(); } #include<Kokkos_Core.hpp> #include<cstdio> typedefKokkos::Impl::DefaultDeviceTypedevice_type; structhello_world { KOKKOS_INLINE_FUNCTION void operator() (device_typedev, int& sum) const { sum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); } }; int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum); Kokkos::finalize(); }

Hierarchical Parallelism: 02 Shared Memory • Kokkos supports ScratchPads for Teams • On CPUs, ScratchPadis just a small team-private allocation which hopefully lives in L1 cache #include<Kokkos_Core.hpp> #include<Kokkos_DualView.hpp> #include<impl/Kokkos_Timer.hpp> #include<cstdio> #include<cstdlib> typedefKokkos::Impl::DefaultDeviceTypeDevice; typedefDevice::host_mirror_device_typeHost; #define TS 16 structfind_2_tuples { intchunk_size; Kokkos::View<constint*> data; Kokkos::View<int**> histogram; find_2_tuples(intchunk_size_, Kokkos::DualView<int*> data_, Kokkos::DualView<int**> histogram_): chunk_size(chunk_size_),data(data_.d_view), histogram(histogram_.d_view) { data_.sync<Device>(); histogram_.sync<Device>(); histogram_.modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (Devicedev) const{ // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1); constinti = dev.league_rank() * chunk_size; for(int j = dev.team_rank(); j<chunk_size+1; j+=dev.team_size()) l_data(j) = data(i+j); for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) l_histogram(k,l) = 0; dev.team_barrier(); for(intj = 0; j<chunk_size; j++) { for(intk = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) { if((l_data(j) == k) && (l_data(j+1)==l)) l_histogram(k,l)++; } } for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++){ Kokkos::atomic_fetch_add(&histogram(k,l), l_histogram(k,l)); } dev.team_barrier(); } size_tshmem_size() const { returnsizeof(int)*(chunk_size+2 + TS*TS); } };

main() for hierarchical parallelism example int main(intnarg, char* args[]) { Kokkos::initialize(narg,args); intchunk_size = 1024; intnchunks = 100000; //1024*1024; Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1); srand(1231093); for(int i = 0; i < data.dimension_0(); i++) { data.h_view(i) = rand()%TS; } data.modify<Host>(); data.sync<Device>(); Kokkos::DualView<int**> histogram("histogram",TS,TS); Kokkos::Impl::Timer timer; Kokkos::parallel_for( Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()), find_2_tuples(chunk_size,data,histogram)); Kokkos::fence(); double time = timer.seconds(); histogram.sync<Host>(); printf("Time: %lf \n\n",time); Kokkos::finalize(); }

Wrap Up • Features not presented here: • Getting a subview of a View • ParallelScan& TeamScan • Linear Algebra subpackage • Kokkos::UnorderedMap (thread-scalable hash table) • To learn more, see: • More complex Kokkos examples • MantevoMiniApps (e.g., MiniFE) • LAMMPS (molecular dynamics code)

Questions and further discussion: crtrott@sandia.gov

Kokkos : The Tutorial alpha+1 version

Kokkos : The Tutorial alpha+1 version

Presentation Transcript

Welcome to RUSH week…

Online Tutorial

The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K

Collaborative Filtering: A Tutorial (abridged version of tutorial from my Web page, given at Dimacs W/S in 2003?)

Android application development tutorial

WRF demo/tutorial

WRF Tutorial

External Cephalic Version

UCL Tutorial on: Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)

Structured Belief Propagation for NLP

Web-Scale Discovery from Alpha to Omega

Using Sources in your Work: A Tutorial on Avoiding Plagiarism GRADE 9

Using Sources in your Work: A Tutorial on Avoiding Plagiarism

The Need for Translations

Minimax with Alpha Beta Pruning

Question Answering Tutorial

Welcome to the Common Component Architecture Tutorial

Alpha Kappa Alpha Sorority, Incorporated

Chapter 8 Performance Analysis of Alpha-Beta Pruning

Welcome to the IP Tutorial

Alpha Kappa Alpha Sorority, Inc. presents