Uintah Runtime Systems Tutorial

www.uintah.utah.edu Uintah Runtime Systems Tutorial • Outline of Runtime System • Adaptive Mesh refinement and Load Balancing • Executing tasks • Heterogeneous architectures with GPUs • Heterogeneous architectures with Intel MIC Alan Humphrey Justin Luitjens* Dav de St Germain and Martin Berzins *NVDIA here as an ex Uintah Researcher/Developer DOE ASCI (97-10), NSF , DOE NETL+NNSA ARL NSF , INCITE, XSEDE

Uintah Runtime System: How Uintah Runs Tasks • Memory Manager: Uintah Data Warehouse (DW) • Variable dictionary (hashed map from: Variable Name, Patch ID, Material ID keys to memory) • Provide interfaces for tasks to • Allocate variables • Put variables into DW • Get variables from DW • Automatic Scrubbing (garbage collection) • Checkpointing & Restart (data archiver) • Task Manager: Uintah schedulers • Decides when and where to run tasks • Decides when to process MPI

xml Task-Graph Compilation Run Time (each timestep) RUNTIME SYSTEM Calculate Residuals Solve Equations Parallel I/O Visus PIDX VisIt UINTAH ARCHITECTURE

Adaptive Mesh Refinement

Berger Rigoutsos Communications costs • Every processor has to pass its part of the histogram to every other so that the partitioning decision can be made • The is done using the MPI_ALLGATHER function - cost of an allgather on P (cores) processors is Log(P) Version 1: (2B-1)log(P) messages Version 2: Only certain cores take part [Gunney et al.] 2P –log(P)-2 messages Alternative: use Berger Rigoutsos locally on each processor

AMR Regridding Algorithm Originally implemented Berger Rigoutsos algorithm Version 1: (2B-1)log(P) messages Version 2: Only certain cores take part 2P –log(P)-2 messages Tiled Algorithm (Scalable) Tiles that contain flags become patches Simple and easy to parallelize Semi-regular patch sets that canbe exploited Example: Neighbor finding Each core processes subset of refinement flags in parallel-helps produce global patch set [Analysis to appear in Concurrency]

EXAMPLE – MESH REFINEMENT AROUND A CIRCULAR FRONT Global Berger- Rigoutsos Local Berger- Rigoutsos Tiled Algorithm More patches generated by tiled but noneed to split them to load balance

Theoretical Models of Refinement Algorithms C = number of mesh cells in domain F = number of refinement flags in domain B = number of mesh patches in the domain Bc = number of coarse mesh patches in the domain P = number of processing cores M = number of messages T: GBRv1 = c1 (F/P) log(B) + c2 M(GBRv1) T: LBR = c4 (F/P) log(B/Bc) T: Tiled = c5 C/P T: Split = c6 B/P + c7 log (P) - is the time to split large patches

Strong Scaling of AMR Algorithms GBR is Global Berger-Rigoutsos – problem size fixed as number of cores increases should lead to decreasing execution time. No strong scaling Strong Scaling T: GBRv1 = c1 (Flags/Proc) log(npatches ) + c2 M(GBRv1) T: Tiled = c5 Nmeshcells / Proc T: LBR = GBR on a compute node Dots are data and lines are models

Performance Improvements Improve Algorithmic Complexity Identify & Eliminate O(Processors) O(Patches) via Tau, and hand profiling, memory usage analysis Improve Task Graph Execution Out of order execution of task graph e.g.  Improve Load Balance Cost Estimation Algorithms based on data assimilation Use load balancing algorithms based on patches and a new fast space filling curve algorithm  Hybrid model using message passing between nodes and asynchronous threaded execution of tasks on cores Partial Ice task graph

Load Balancing Weight Estimation Algorithmic Cost Models based on discretization method Vary according to simulation+machine Requires accurate information from the user Time Series Analysis Used to forecast time for execution on each patch Automatically adjusts according to simulation and architecture with no user interaction Need to assign the same workload to each processor. Er,t: Estimated Time Or,t: Observed Time α: Decay Rate Er,t+1 =αOr,t + (1 -α) Er,t =α (Or,t-Er,t) + Er,t Error in last prediction Simple Exponential Smoothing:

Comparison between Forecast Cost Model FCM & Algorithmic Cost Model Particles + Fluid code FULL SIMULATION

Runtime System

Thread/MPI Scheduler (De-centralized) MPI sends completed task Network Data Warehouse (variables directory) Core runs tasks and checks queues receives • One MPI Process per Multicore node • All threads directly pull tasks from task queues execute tasks and process MPI sends/receives • Tasks for one patch may run on different cores • One data warehouse and task queue per multicore node • Lock-free data warehouse enables all cores to access memory quickly completed task Core runs tasks and checks queues PUT GET Threads • Core runs tasks and checks queues PUT GET Ready task Task Graph Task Queues Shared Data New tasks

Uintah Runtime System Data Management Network 3 2 Data Warehouse (one per-node) Thread 1 Execution Layer (runs on multiple cores) MPI_ IRecv Internal Task Queue Task Graph recv Select Task & Post MPI Receives Comm Records MPI_ Test valid Check Records & Find Ready Tasks External Task Queue get Select Task & Execute Task put MPI_ ISend Post Task MPI Sends Shared Objects (one per node) send

New Hybrid Model Memory Savings: Ghost Cells MPI: Thread/MPI: Raw Data: 49152 doubles 31360 doubles MPI buffer: 28416 doubles 10624 doubles Total: 75K doubles40Kdoubles Local Patch Ghost Cells (example on Kraken, 12 cores/node, 98K core 11% of memory needed

Memory Savings • Global Meta-data copies • 60 bytes or 7.5 doubles per patch • Each copy per core vs Each copy per node • MPI library buffer overhead • Results: AMRICE: Simulation of the transport of two fluids with a prescribed initial velocity of Mach two: 435 million cells, strong scaling runs on Kraken

Task prioritization algorithms not executed executed MPI sends Sub-domain Prioritize tasks with external communications over purely internal ones

Granularity Effect • Decrease patch size • (+) Increase queue length • (+) More overlap, lower task wait time • (+) More patches, better load balance • (-) More MPI messages • (-) More regrid overheads • Other Factors • Problem size • Implied task level parallelism • Interconnection bandwidth and legacy • CPU cache size • Solution- Self Tuning?

Lock-Free Data Structures Global scalability depends on the details of nodal run-time system. Change from Jaguar to Titan – more faster cores and faster communications broke our Runtime System which worked fine with locks previously • Using atomic instruction set • Variable reference counting • fetch_and_add, fetch_and_subcompare_and_swap • both read and write simultaneously • Data warehouse • Redesigned variable container • Update: compare_and_swap • Reduce: test_and_set

Scalability Improvement De-centralized MPI/Thread Hybrid Scheduler (with Lock-free Data Warehouse) And new Threaded runtime system Original Dynamic MPI-only Scheduler Out of memory on 98K cores Severe load imbalance • Achieved much better CPU Scalability

OndemandDatawarehouse Directory based hash map Var versions v v1 v v1 v2 v3 C v2 v3 v v v R M v v1 v3 v For in order execution Variable versioning for out-of-order execution

Schedule Global Sync Task R1 R2 • Synchronization task • Update global variable • e.g. MPI Allreduce • Call third party library • e.g. PETSc • Out-of-order issues • Deadlock • Load imbalance • Task phases • One global task per phase • Global task runs last • In phase out-of-order R2 R1 Deadlock Load imbalance

Dynamic Task Graph Execution • 1) Static: Predetermined order • Tasks are Synchronized • Higher waiting times Task Dependency Execution Order • 2) Dynamic: Execute when ready • Tasks are Asynchronous • Lower waiting times

Multi-threaded Task Graph Execution • 1) Static: Predetermined order • Tasks are Synchronized • Higher waiting times Task Dependency Execution Order • 2) Dynamic: Execute when ready • Tasks are Asynchronous • Lower waiting times • 3) Dynamic Multi-threaded • Task-Level Parallelism • Memory saving • Decreases Communication and Load Imbalance Multicorefriendly Support GPU tasks (Future)

Weak Scaling AMR+MPM ICEM = Mira, T=Titan, S=Stampede /Proc Only 2 patches per core Includes packing unpacking and data warehouse Only 8 interior patches from 32

Scalability is at least partially achieved by not executing tasks in order e.g. AMR fluid-structure interaction Straight line represents given order of tasks Green X shows when a task is actually executed. Above the line means late execution while below the line means early execution took place. More “late” tasks than “early” ones as e.g. TASKS: 1 2 3 4 5 1 4 2 3 5 Early Late execution

NVIDIA GPU

Unified Heterogeneous Scheduler & Runtime node GPU Data Warehouse Running GPU Task GPU Kernels Running GPU Task PUT GET stream events completed tasks H2D stream D2H stream MPI sends Network Running CPU Task Data Warehouse (variables directory) MPI recvs Running CPU Task PUT GET CPU Threads Running CPU Task PUT GET Task Graph GPU ready tasks ready tasks GPU Task Queues Shared Data GPU-enabled tasks MPI Data Ready CPU Task Queues Internal ready tasks No MPI inside node, lock free DW , cores and GPUs pull work

NVIDIA Fermi Outline Host memory to Device memory is max 8GB/second, BUT … Device memory to cores is 144GB/second with less than half of this being achieved on memory bound applications ( Blas level 1,2) 40GFlops

Hiding PCIe Latency • Nvidia CUDA Asynchronous API • Asynchronous functions provide: • Memcopies asynchronous with CPU • Concurrently execute a kernel and memcopy • Stream - sequence of operations that execute in order on GPU • Operations from different streams can be interleaved Page-locked Memory Normal Data Transfer Data Transfer Kernel Execution Kernel Execution

GPU Task Management With Uintah’s knowledge of the task-graph, task data can be automatically transferred asynchronously to the device before a GPU task executes Pin this memory with cudaHostRegister() existing host memory cudaMemcpyAsync(H2D) hostRequires devRequires 1 Call-back executed here (kernel run) Page locked buffer All device memory allocations and asynchronous transfers handled automatically Can handle multiple devices on-node All device data is made available to component code via convenient interface computation 2 Result back on host hostComputes devComputes 3 cudaMemcpyAsync(D2H) Free pinned host memory 4 5 6 Component requests D2H copy here

DESIGNING FOR EXASCALE Clear trend towards accelerators e.g. GPU but also Intel MIC – new NSF “Stampede” 10-. 15PF Balance factor = flops/bandwidth - high GPU performance “ok” for stencil-based codes, 2x over multicore cpu estimated and achieved for ICE . Similar results by others. CPU/GPU performance increasing much faster than Network/Memory bandwidth. GPU performace of ray-tracing radiation method is 100x cpu. Overlapping and hiding Communications essential

Performance and Scaling Comparisons • Uintah strong scaling results when using: • Multi-threaded MPI • Multi-threaded MPI w/ GPU • GPU-enabled RMCRT • All-to-all communication • 100 rays per cell 1283cells • Need port Multi-level CPU tasks to GPU Cray XK-7, DOE Titan Thread/MPI: N=16 AMD Opteron CPU cores Thread/MPI+GPU: N=16 AMD Opteron CPU cores + 1 NVIDIA K20 GPU

NVIDIA AMGX Linear Solvers on GPUs • Fast, scalable iterative gpu linear solvers for packages e.g., • Flexible toolkit provides GPU accelerated Ax = b solver • Simple API for multiple apps domains. • Multiple GPUs (maybe thousands) with scaling Key Features Ruge-Steuben algebraic MG Krylovmethods: CG, GMRES, BiCGStab, Smoothers and Solvers: Block- Jacobi, Gauss-Seidel, incomplete LU, Flexible composition system MPI support OpenMPsupport, Flexible and high level C API, Free for non-commercial use Utah access via Utah CUDA COE.

Intel Xeon Phi

Xeon Phi Execution Models

Uintah on Stampede: Host-only Model Incompressible turbulent flow • Intel MPI issues beyond 2048 cores (seg faults) • MVAPICH2 required for larger core counts Using Hypre with a conjugate gradient solver Preconditioned with geometric multi-grid Red Black Gauss Seidel relaxation - each patch

Uintah on Stampede: Native Model • Compile with –mmic • cross compiling • Need to build all Uintah required 3p libraries • libxml2 • zlib • Run Uintah natively on Xeon Phi within 1 day • Single Xeon Phi Card

Uintah on Stampede: Offload Model • Use compiler directives (#pragma) • Offload target: #pragma offload target(mic:0) • OpenMP: #pragma ompparallel • Find copy in/out variables from task graph • Functions called in MIC must be defined with __attribute__((target(mic))) • Hard for Uintah to use offload mode • Rewrite highly templated C++ methods with simple C/C++ so they can be called on the Xeon Phi • Less effort than GPU port, but still significant work for complex code such as Uintah

Uintah on Stampede: Symmetric Model • Xeon Phi directly calls MPI • Use Pthreads on both host CPU and Xeon Phi: • 1 MPI process on host – 16 threads • 1 MPI process on MIC – up to 120 threads • Currently only Intel MPI supported mpiexec.hydra-n 8 ./sus – nthreads16 : -n 8./sus.mic –nthreads 120 • Challenges: Different floating point accuracy on host and co-processor • Result Consistency • MPI message mismatch issue: Control related FP operations

p=0.421874999999999944488848768742172978818416595458984375 c=0.0026041666666666665221063770019327421323396265506744384765625 b=p/c Example: Symmetric Model FP Issue Host-only Model Symmetric Model b=162 b=162 161 162 163 164 161 162 163 164 MPI Size Mismatch Rank0: CPU MPI OK Rank0: CPU Rank1: CPU Rank1: MIC 161 162 163 164 161 162 163 164 b=162 b=161.99999999999999 Control related FP operations must use consistent accuracy model

Scaling Results on Xeon Phi • Multi MIC Cards (Symmetric Model) • Xeon Phi card: 60 threads per MPI process, 2 MPI processes • host CPU :16 threads per MPI process, 1 MPI processes • Issue: load imbalance - profiling differently on host and Xeon Phi Multiple MIC cards

Resilience • Need interfaces at system level to help us consider: • Core failure – reroute tasks • Comms failure – reroute message • Node failure – need to replicate patches use an AMR type approach in which a coarse patch is on another node. In 3D has 12.5% overhead – suggested by QingyuMengMike Heroux and others. • Will explore this from fall 2014 onwards

Summary • DAG abstraction important for achieving scaling • Layered approach very important for not needing to change applications code • Scalability still requires much engineering of the runtime system. • General approach very powerful indeed. • Obvious applicability to new architectures • DSL approach very important very future • Scalability still a challenge even with DAG approach – which does work amazingly well • en for fluid-structure calculations • GPU development ongoing • The approach used here shows promise for very large core and GPU counts but using these architectures is an exciting challenge

Uintah Runtime Systems Tutorial