Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi

Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi Dept. of Computer Science and Engineering Advisor: GaganAgrawal

The Death of Single-core CPU Scaling Until 2004 The Landscape of Computing – Moore’s Law • Double the # of Transistors • Simply increase clock frequency • Of course! Consume more power • Significantly improved efficiency • Follows Moore’s law Transistors Since 2005, Now and Future… Clock Speed • The Free Lunch is over ! • Single Core clock frequency reaches a plateau • End of Moore’s law … • Alternate processor design required Power Efficiency • The rise of Multi-core, Many-core architectures … • Parallel programming …

Rise of Multi-core, Many-core … Multi-core CPUs Many-core GPUs GFLOPS Massive arithmetic, least control Specialized Co-processing In the range of 512 cores Clock speed: ~ 1.2 GHz Executive-like: More room for control logics 2 – 12 cores Clock speed: ~ 1.8 GHz – 3.3 GHz

Rise of Heterogeneous Architectures • Today’s Computing Platforms are Heterogeneous! • New Challenges are Emerging … • Today’s High Performance Computing • Multi-core CPUs, Many-core GPUs are mainstream • Many-core GPUs offer • Excellent “price-performance”& “performance-per-watt” • Financial modeling, Gas and Oil exploration, Medical … • Flavors of Heterogeneous computing • Multi-core CPUs + GPUs connected over PCI-E • Accelerated Processing Units (APU) , AMD Fusion • Intel MIC, Sandy Bridge, Nvidia Denver … • Heterogeneous Architectures are pervasive • Supercomputers &Clusters, Clouds, Desktops, Notebooks, Tablets, Mobiles …

New Challenges Question 4: Mechanisms to debug and profile GPU programs? Tools development for GPUs Question 3: Job Scheduling for hetero. clusters? Revisit Job scheduling for CPU-GPU clusters Enable Sharing of GPU across diff. apps. Application(s) Question 2: Improve utilization of GPUs? CPU + GPU Heterogeneous Architecture Concurrency control/Synchronization between CPU/GPU Question 1: How to benefit from CPU and GPU simultaneously? CPU/GPU Work Distribution Module

My Thesis Focus Tools development for GPUs Revisit Job scheduling for CPU-GPU clusters Enable Sharing of GPU across diff. apps. Application(s) CPU + GPU Heterogeneous Architecture Concurrency control/Synchronization between CPU/GPU Primary Focus CPU/GPU Work Distribution Module

Thesis Contributions • Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems • Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010) • A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011) • Job Scheduling for Heterogeneous Clusters • Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012) • Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission) Support for GPU Sharing across Multiple Applications • Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)

Today’s Talk • Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems • Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010) • A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011) Pre-Candidacy Work • Job Scheduling for Heterogeneous Clusters • Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012) • Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission for SC 2012) Post-Candidacy Work Support for GPU Sharing across Multiple Applications • Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)

Outline of Presentation • Recap of Pre-Candidacy work • Runtime system and Work Distribution • GPU Sharing Through Runtime Consolidation Framework • Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput • Value-based Job Scheduling • Future Work • Thesis Conclusions

Motivation • In HPC, demand for computing is ever increasing • CPU+GPU platform expose huge raw processing power • Top 6 Supercomputers • Heterogeneous - utilization is under ~50% • Homogeneous - utilization is about 80% • Application development for multi-core CPU and GPU is still independent • “No established mechanism” to exploit aggregate power • Can computations benefit from simultaneously utilizing CPU and GPU?

Runtime System and Work Distribution for CPU-GPU Architectures • Focus on specific classes of computation patterns • Generalized Reduction Structure • Structured Grid Computations • Improve application developer productivity • Facilitate High-Level API support • Hide parallelization difficulties through runtime support • Improve efficiency • Dynamic work distribution between CPU & GPU • Show significant performance improvements • Up to 63% for generalized reduction structures • Up to 75% for structured grid computations

Motivation Sharing a GPU is necessary, but how? • Emergence of Cloud – “Pay-as-you-go” model • Cluster instances, High-speed interconnects for HPC users • Amazon, Nimbix, SoftLayer - GPU instances • Sharing is the basis of cloud, GPU no exception • Multiple virtual machines may share a physical node • Modern GPUs are expensive than multi-core CPUs • Fermi cards with 6 GB memory, 4000 $ • Need better resource utilization • Modern GPUs expose high degree of parallelism • Applications may not utilize full potential

GPU Sharing Through Runtime Consolidation Framework • Software Framework to enable GPU Sharing • Extended Open Source Call Interception Tool, gVirtuS • GPU sharing through kernel consolidation & virtual context • Basic GPU-Sharing Mechanisms • Time- and Space-Sharing • Solutions to GPU Kernel Consolidation Problem • Affinity score, to predict benefit upon consolidation • Kernel Molding policies, to handle high resource contention • Overall scheduling algorithm for multiple GPUs • Show significant global throughput improvements • Up to 50% improvement using advanced sharing policies

Motivation • Revisit Scheduling problems for CPU-GPU clusters • Exploit portability offered by models like OpenCL • Automatic mapping of jobs to resources • Desirable advanced scheduling considerations • Software Stack to program CPU-GPU arch. has evolved • Combination of (Pthreads/OpenMP…) + (CUDA/Stream) • Now, OpenCL is becoming more popular • OpenCL, a device agnostic platform • Offers great flexibility with portable solutions • Write kernel once, execute on any device • Supercomputers and Cloud environments are typically “Shared” • Accelerate a set of applications as opposed to single application • “Job Scheduler” is a critical component of software stack • Today’s schedulers (like TORQUE) for hetero. clusters: • DO NOT exploit the portability offered by OpenCL • User-guided Mapping of jobs to hetero. resources • Does not consider desirable & advanced scheduling possibilities

Problem Formulations Problem Goal: • Accelerate a set of applications on CPU-GPU cluster • Each node has two resources: A Multi-core CPU and a GPU • Map applications to resources to: • Maximize overall system throughput • Minimize application latency Scheduling Formulations: 1) Single-Node, Single-Resource Allocation & Scheduling 2) Multi-Node, Multi-Resource Allocation & Scheduling

Scheduling Formulations Single-Node, Single-Resource Allocation & Scheduling Multi-Node, Multi-Resource Allocation & Scheduling • In addition, allows CPU+GPU allocation • Desirable in future to allow flexibility in acceleration of applications • In addition, allows multiple node allocation per job • MATE-CG [IPDPS’12], a framework for Map-Reduce class of apps. allows such implementations • Allocates a multi-core CPU or a GPU from a node in cluster • Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-node apps. • Limited mechanisms to exploit CPU+GPU simultaneously • Exploit the portability offered by OpenCLprog. Model

Challenges and Solution Approach Decision Making Challenges: • Allocate/Map to CPU-only, GPU-only, or CPU+GPU? • Wait for optimal resource (involves queuing delay) • Assign to non-optimal resource (involves penalty) • Always allocating CPU+GPU  may affect global throughput • Should consider other possibilities like CPU-only or GPU-only • Always allocate requested # of nodes? • May increase wait time, can consider allocation of lesser nodes Solution Approach: • Take different levels of user inputs (relative speedups, execution times…) • Design scheduling schemes for each scheduling formulation

Scheduling Schemes for First Formulation Two Input Categories & Three Schemes: • Categories are based on the amount of input expected from the user • Category 1: Relative Multi-core (MP) and GPU (GP) performance as input • Scheme1: Relative Speedup based w/ Aggressive Option (RSA) • Scheme2: Relative Speedup based w/ Conservative Option (RSC) • Category 2: Additionally, sequential CPU exec. Time (SQ) • Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC) Takes multi-core and GPU speedup as input N Jobs, MP[n], GP[n] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) • Create CPU/GPU queues • Map jobs to optimal resource queue Sort CJQ and GJQ in Desc. Order R=GetNextResourceAvialable() IsGPU Yes No GJQ Empty? Assign GJQtop to R Yes Aggressive, minimizes penalty Aggressive? Conservative Yes No Wait for CPU Assign CJQbottomto R

Adaptive Shortest Job First (ASJF) N Jobs, MP[n], GP[n], SQ[N] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) Minimize latency for short jobs Sort CJQ and GJQ in Asc. Order of Exec. Time R=GetNextResourceAvialable() IsGPU Yes No GJQ Empty? Assign GJQtop to R Yes Automatic switch for aggressive or conservative option T1= GetMinWaitTimeForNextCPU() T2k= GetJobWithMinPenOnGPU(CJQ) Yes No Wait for CPU to become free or for GPU jobs T1 > T2k Assign CJQkto R

Scheduling Scheme for Second Formulation Solution Approach: • Flexibly schedule on CPU-only, GPU-only, or CPU+GPU • Molding the # of nodes requested by job • Consider allocating ½ or ¼th of requested nodes • Inputs from User: • Execution times of CPU-only, GPU-only, CPU+GPU • Execution times of jobs with n, n/2, n/4 nodes • Such app. Information can also be obtained from profiles

Flexible Moldable Scheduling Scheme (FMS) N Jobs, Exec. Times… Minimize resource fragmentation Group Jobs with # of Nodes as the Index Helps co-locate CPU and GPU job on the same node Sort each group based on exec. time of CPU+GPU version Gives global view to co-locate on same node Pick a pair of jobs to schedule in order of sorting Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each job Choose same resource for both jobs (C,C) (G,G) (CG,CG) Choose C for one job & G for the other 2N Nodes Avail? Co-locate jobs on same set of nodes No Consider Molding by Res. Type if CG Yes Consider Molding # of nodes for the next job Schedule pair of jobs in parallel on 2N nodes

Cluster Hardware Setup • Cluster of 16 CPU-GPU nodes • Each CPU is 8 core Intel Xeon E5520 (2.27GHz) • Each GPU is an Nvidia Tesla C2050 (1.15 GHz) • CPU Main Memory – 48 GB • GPU Device Memory – 3 GB • Machines are connected through Infiniband

Benchmarks Single-Node Jobs • We use 10 benchmarks • Scientific, Financial, Datamining, Image Processing applications • Run each benchmark with 3 different exec. Configurations • Overall, a pool of 30 jobs • Multi-Node Jobs • We use 3 applications • Gridding kernel, Expectation-Maximization, PageRank • Applications run with 2 different datasets and on 3 different node numbers • Overall, a pool of 18 jobs

Baselines & Metrics Baseline for Single-Node Jobs • Blind Round Robin (BRR) • Manual Optimal (Exhaustive search, Upper Bound) • Baseline for Multi-Node Jobs • TORQUE, a widely used resource manager for hetero. clusters • Minimum Completion Time (MCT), [Maheswaranet.al, HCW’99] • Metrics • Completion Time (Comp. Time) • Application Latency: • Non-optimal Assignment (Ave. NOA. Lat) • Queuing Delay (Ave. QD Lat.) • Maximum Idle Time (Max. Idle Time)

Single-Node Job Results Uniform CPU-GPU Job Mix For each metric • 24 Jobs on 2 Nodes Proposed schemes • 108% better than BRR • Within 12% of Manual Optimal • Tradeoff between non-optimal penalty vs wait-time for resource • BRR has the highest latency • RSA, non-optimal penalty • RSC, high Queue delay • ASF as good as Manual optimal 4 different metrics CPU-biased Job Mix • BRR, very high idle times • RSC, can be very high too • RSA has the best utilization among proposed schemes

Multi-Node Job Results Varying Job Execution Lengths Proposed schemes • 32 Jobs on 16 Nodes • FMS, 42% better than best of Torque or MCT • Each type of molding gives reasonable improvement • Our schemes utilizes the resource betterhigh throughput • Intelligent on deciding to wait for res. or mold it for smaller res. Short Job (SJ), Long Job (LJ) Varying Resource Request Size • FMS, 32% better than best of Torque or MCT • Benefit from ResType Molding is better than NumNodes Molding Small Request (SJ), Large Request (LJ)

Summary • Revisit scheduling problems on CPU-GPU clusters • Goal to improve aggregate throughput • Single-node, single-resource scheduling problem • Multi-node, multi-resource scheduling problem • Developed novel scheduling schemes • Exploit portability offered by OpenCL • Automatic mapping of jobs to hetero. resources • RSA, RSC, and ASJF for single-node jobs • Flexible Molding Scheduling (FMS) for multi-node jobs • Significant improvement over state-of-the-art

Motivation • Revisit Scheduling problems for CPU-GPU clusters • Exploit portability offered by models like OpenCL • Automatic mapping of jobs to resources • Market-based scheduling considerations • Schemes to enable automatic sharing of resources • Previously, goal to improve overall global throughput & latency • Other desirable goals for supercomputer and cloud environments • Market-based scheduling goals (providers’ profit and user-satisfaction) • For eg., MOAB (with SLAs) for supercomputers and large clusters • For eg., Amazon classifies as Free, Spot, On-Demand, Reserved users • Each user has different levels of importance and satisfaction • Supercomputer, clouds engage massively parallel resources • Multi-core CPUs with 16 cores, GPUs with 512 cores • Recent announcements of MIC (about 50-60 cores) in stampede • Efficient resource utilization is important • Today’s schedulers (like TORQUE) for hetero. clusters: • No notion of market-based scheduling • User-guided Mapping of jobs to hetero. resources • Lack ability/schemes to share massively parallel resources

Value Function • Each job is attached with a value function • Linear-DecayValue Function [Irwin et.al HPDC’04] • Maximum ValueImportance/priority • Decay RateUrgency • Value function with different shapes • Can represent different SLAs, eg. Step function • Yield is obtained after job completion, defined as • Delay can be a sum of any of four components • Queuing, non-optimal penalty, sharing 1-core penalty, sharing CPU/GPU penalty • Yield represents both “Providers’ profit” as well as “User-satisfaction” Yield = maxValue – decay * delay We believe that value function provides rich, yet, simple formulation for market-based scheduling

Scheduling Problem Formulation • Given hetero. cluster with each node containing: • 1 multi-core CPU and 1 GPU • Schedule a set of jobs on the cluster • To maximize the aggregate yield • Allocates a multi-core CPU or a GPU from a node in cluster • Does not allocate both multi-core CPU and GPU to a job • Does not allocate multiple nodes to a job • Considerations for future work • Exploit the portability offered by OpenCLprog. Model • Flexibly map the job on to either CPU or GPU • Allow sharing of multi-core CPU or GPU • Up to two jobs per resource • Limited to space-sharing

Overall Scheduling Approach Jobs arrive in batches Push job in to its optimal resource queue and sort Initial Mapping and Ordering Enqueue into CPU Queue Enqueue into GPU Queue Sort jobs to improve yield Sort jobs to improve yield When both job queues are non-empty Execute on CPU Execute on GPU • Resource (CPU) is free • But, job (CPU) queue is empty • Resource will be idle • Propose various schemes for dynamic re-mapping

Heuristics for Different Stages • Initial mapping & Ordering of queues • Initial assignment of jobs to queue: Based on optimal walltime • Sorting of jobs in the queue: Adapt Reward [Earlier Work: HPDC’04] to our formulation • Dynamic Re-mapping of jobs to Non-optimal Resource • Uncoordinated Schemes (Three new heuristics) • Last Optimal Reward (LOR) • First Non-Optimal Reward (FNOR) • Last Non-optimal Reward Penalty (LNORP) • Coordinated Schemes (One new heuristic) • Coordinated Least Penalty (CORLP) • Sharing jobs on a single type of resource (One New heuristic) • Scalability-Decay Factor, Top K fraction [K is tunable]

Sorting Jobs in the Queues PVi/ OptimalWTi= yieldi / (1+dis_rate*OptimalWTi) n Costi/ OptimalWTi= Σdecayj – decayi j=0 Rewardi = (PVi – Costi) / OptimalWTi • Reward heuristic is based on two market-based terms • Present (Discounted Gain) Value • Opportunity Cost • Present Value (PV) • Value gain after time ‘t’, after discounting risk of running the job • Receiving $1,000 now is worth more than $1,000 five years from now • Shorter the job, lower the risk • Opportunity Cost (Cost) • Degradation cost of an alternative to pursue a certain action • Prefer high decay jobs over low decay jobs • In our case, cost of choosing a job ‘i’ over a job ‘j’ • Reward • Choose the job with highest reward to schedule on the corresponding resource

Dynamic Remapping – Uncoordinated Schemes • Only when the resource is idle, and job queue is empty • Idle resources reduce utilization, hence overall yield (considering waiting jobs in other queue) • Dynamically assign a job to non-optimal resource from optimal queue for that job • Three Schemes based on two key aspects • Which job will have best reward on non-optimal resource? • Which job will suffer least reward penalty ? • Last Optimal Reward (LOR) • Exploits “Reward score” computed on each queue for each job • Simply chooses job with least reward from the optimal resource queue • Anyway least reward on optimal resource, least risk in moving • O(N) to seek the last job in the queue

Dynamic Remapping – Uncoordinated Schemes Suff_factori = Non-OptimalWTi / OptimalWTi Non-OptimalRewardi = OptimalRewardi / Suff_factori Non-OptimalRewardPenalty = OptimalRewardi - NonOptimalRewardi 2. First Non-Optimal Reward (FNOR) • Compute the reward job could produce on non-optimal resource • Explicitly considers non-optimal penalty • Job with highest reward on non-optimal resource • O(Nlog(N)) to sort the newly computed reward • Last Non-Optimal Reward Penalty (LNORP) • FNOR fails to consider reward degradation • LNORP computes reward degradation on non-optimal resource • Moves the job with least reward degradation

Dynamic Remapping – Coordinated Scheme n TQDPi= ΣQueuing_delayj * decayj j=0 • Even when resource is not idle, and job queue is non-empty • May be necessary to move job from one queue to another due to imbalance • Better global view of both the queues • Factors affecting imbalance, • Decay rates of jobs across queues • Execution lengths (or queuing delays) of jobs across queues • For coordination across queue, • Determine when coordination is required • If coordination required, heuristic for “which” job to move • Detecting when coordination is required • Total Queuing-Delay Decay-Rate Product (TQDP), for each queue ‘i’ • Heuristic for picking a job to move • Move the job with least non-optimal penalty • Coordinated Least Penalty (CORLP)

Heuristic for Sharing • Allow up to two jobs to space-share a resource • For eg., on a multi-core CPU with 8 cores, 2 jobs each use 4 cores • Penalties from time-sharing can be high due to more resource contention • Factors affecting sharing • Jobs will use half the resources, will incur a slowdown • On the other hand, more resources may be available • Jobs/applications • Can be categorized as low, medium, high scaling (based on models/profiling) • Some jobs are less urgent than the other • “When” to enable sharing? • Large fraction of jobs in pending queues with negative yield • “Who” are the candidates to share? (Scalability-DecayRate factor) • Jobs grouped in the order of low to high scalability • Within each group, jobs are ordered by decay rate • Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)

Register High-Level Scheduler Framework Design Master Node Cluster Level Scheduler Submission Queue TCP Communicator Scheduling Schemes & Policies Pending Queues Execution Queues TCP Communicator Finished Queues Compute Node Compute Node Node Level Scheduler Node Level Scheduler GPU Jobs Exec. Thread(s) CPU Jobs Exec. Thread(s) … Multi-core CPU Multi-core CPU GPU GPU GPU Consolidation Framework

GPU Sharing Framework Workloads arrive from Frontend BackEnd Server CUDA App2 CUDA App1 Front End Queues Workloads to Dispatcher Interception Library InterceptionLibrary Front End – Back End Communication Channel Dispatcher GPU Consolidation Framework Queues Workloads to Virtual Context Ready Queue CUDA Runtime Virtual Context Back End Virtual Context CUDA Driver Workload Consolidator Workload Consolidator GPU1 GPUn …

Cluster Hardware Setup Cluster Hardware Setup • Cluster of 16 CPU-GPU nodes • Each CPU is 8 core Intel Xeon E5520 (2.27GHz), Main memory 48GB • Each GPU is an Nvidia Tesla C2050 (1.15 GHz), Device memory 3GB Benchmarks • We use 10 benchmarks • Scientific, Financial, Datamining, Image Processing applications • Run each benchmark with 3 different exec. Configurations • Overall, a pool of 30 jobs • Baselines • TORQUE, a widely used resource manager for hetero. clusters • Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99] • Metrics • Completion Time • Application Latency • Average Yield

Comparison with Torque-based Metrics • Efficient use of resources (no idle time) • Idle time outweighs non-optimal penalty • Worse with biased-mix (BM) 20% better 22% better 10% better • Our schemes may prefer short jobs, reducing latency • Also minimizes non-optimal penalty • Also reduces queuing delay • Baselines and our schemes use two different set of metrics • See how our schemes perform with Torque-based metrics • In all cases, we run 256 jobs on a 16-node cluster

Results with Average Yield Metric Varying CPU-GPU Job Mix Up to 8.8x better • Biased cases very high improvement • More room for idle times and dynamic mapping • 2.3x better for even uniform mix • Torque, no notion of value • Our schemes order the jobs for yield • Eliminates the idle time for resources 25% CPU Jobs, 75% GPU Jobs Impact of Value Decay Functions Up to 6.9x better • Adaptability of the proposed schemes to different shapes of value functions Up to 3.8x better • Step decay is more coarse-grained, hence improvement is better

Results with Average Yield Metric Impact of Varying Load Up to 8.2x better • As load increases, yield from baselines decreases linearly • Proposed schemes achieve initially increased yield and then sustained yield • As it tries to maximize the yield Coordinated Vs Uncoordinated Schemes • Why do we need coordination? • Imbalance in decay rate or queuing delays across queues Up to 78% better As the imbalance increase, improvement from CORLP increases

Yield Improvements from Sharing Effect of Sharing K Fraction Fraction of Job to share • Benefit from freeing a resource is always offset by the slowdown incurred by sharing jobs • Benefit increase up to a point, then decreases (K=0.5 in this case) • Emphasizes careful selection of K Fraction • Up to 23% improvement due to sharing

Overhead of Sharing a CPU Core • A CPU core is shared b/w a CPU and GPU jobs scheduled on the same node • Overhead is within 10% • Variation depends on the amount or frequency of data transfer/commn. b/w CPU and GPU

Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi

Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi

Presentation Transcript

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU Scheduling

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

CPU Scheduling

CPU Scheduling

CPU Scheduling

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

GPU vs. CPU

CPU Scheduling

CPU Scheduling

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

CPU Scheduling

CPU Scheduling

CPU Scheduling

CPU scheduling

CPU Scheduling

CPU Scheduling