Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

Scheduling on Heterogeneous Multicore Processors UsingArchitectural Signatures Daniel Shelepov and Alexandra Fedorova School of Computing Science, Simon Fraser University, Vancouver, Canada

Architectural Signatures in a Nutshell Task: to schedule jobs appropriately given a variety of different cores available Caveats: Scheduler doesn’t know job behaviour a priori Scalability: hundreds of cores potentially available Our approach: Analyze job performance offline Describe findings in a job’s architectural signature Scheduler uses signatures to make intelligent core assignment decisions

Talk Outline Background Methodology Results Summary and Future Work

Background: Heterogeneous CPUs Heterogeneous CPUs = several types of cores: Simple vs. Complex: cache size, issue width, presence of advanced features, power consumption Specialized (possibly) Example: many FPUs Expose a common ISA May contain 100s or 1000s of cores (“manycore”) Bottom line: better efficiency = saved power Now: homogeneous multicore CPUs Cores: Complex Simple Specialized Future: heterogeneous multi- and manycore CPUs

Background: Heterogeneous Scheduling Job Scheduler needs to be aware of: underlying core features job performance on various cores Otherwise, no informed scheduling decision can be made => no benefit from heterogeneity Scheduler ?

Architectural Signature Approach Job Job A signature is provided along with the job binary. Signatures are constructed offline are μarch.-independent provide guidance for selecting appropriate cores Scheduler ü

Constructing Signatures PREDICTION MODEL Create a model for generating meaningful performance-predicting metrics from collected profiling data OFFLINE PROFILING Collect microarchitecture-independent profiling data Examples: instruction mix, memory access patterns OFFLINE ANALYSIS Generate performance-predicting metrics that a scheduler is able to use Examples: optimal cache size, inherent ILP, clock speed sensitivity SCHEDULING Interpret performance-predicting metrics and schedule

Case Study: Clock Speed Sensitivity Frequency changes affect different jobs differently. Clock speed sensitivity is the means to capture these differences. Completion time at different clock speeds

Offline Profiling We use MICA, a custom toolkit for Pin by Hoste and Eeckhout [2] (http://trappist.elis.ugent.be/~kehoste/MICA/). MICA gathers a variety of μarch.-independent metrics. For clock speed sensitivity, we want reusedistance data.

Offline Analysis Reuse distances are used to estimate abstract L2 cache miss rates. L2 cache miss rates are used to estimate clock speedelasticity, a metric that puts a number on sensitivity. requires a prediction model for elasticity as function of cache miss rate (see next slide) Elasticity values are placed into the architectural signature.

Prediction Model The graph shows a mapping of SPEC CPU benchmarks displaying estimated L2 miss rates and clock speed elasticity We build a linear model and then use it to predict elasticity during offline analysis • Constructed once, it can be used for all future analysis, unless a better model is proposed More sensitive Less sensitive

Scheduling Recall: the architectural signature contains elasticity values Elasticity is straightforward to interpret Using elasticity, the scheduler categorizes jobs into: highly, moderately and insensitive Finally, we’re ready to schedule

Clock Speed Sensitivity Data Flow MICA reuse distance data abstract L2 cache miss rates clock speed elasticity values clock speed sensitivity category

Evaluating Clock Speed Sensitive Scheduling Completion times with our clock speed aware prototype normalized to completion times with the default Linux 2.6.18 scheduler Highly heterogeneous workload. Two 2GHz cores, two 3GHz cores Balanced workload. One of each of 2GHz, 2.33GHz, 2.67GHz, 3GHz cores Uniform workload. Two 2GHz cores, two 3GHz cores.

Summary A framework for developing microarchitecture-independent architectural signatures to assist heterogeneity-aware scheduling Proof of concept: clock speed aware scheduling Results: tangible benefits even on mildly heterogeneous platforms up to 4% average throughput increase on a multicore system with 2GHz and 3GHz cores

Future Work Extend our framework to include other core characteristics (cache size, issue width,..) Develop and analyze a heterogeneity-aware scheduler in a real operating system (Sun Solaris) Compare that scheduler with other heterogeneity-aware schedulers

References [1] M. Becchi and P. Crowley. Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. In Proceedings of the Conference on Computing Frontiers, 2006 [2] K. Hoste and L. Eeckhout. Microarchitecture-Independent Workload Characterization. IEEE Micro Hot Tutorials, 27(3):63-72, 2007. [3] R. Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multicore Architectures for Multithreaded Workload Performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004

Appendix A: Existing Approaches Job Job Job Job Job Algorithms by Becchi [1] and Kumar [3] These rely on performance monitoring to determine optimal assignment. Potential drawbacks: don’t scale well to many types of cores limited applicability to short-lived threads Scheduler ü

Appendix B: Inputs Sets and Performance Varying input sets can drastically affect performance ref vs. test input in SPEC CPU2000 One architectural signature can provide for at most one input Difficult problem that we are not currently tackling There are smart ways to create parameterized approximations that account for data input size: Y. Zhong, S. G. Dropsho and C. Ding. Miss rate prediction across all program inputs. In Proceedings of Parallel Architechtures and Compilation Techniques, 2003.

Appendix C: Elasticity We need two measurements of completion time at two different frequencies Then we calculate clock speed elasticity of completion time as follows (E = Elasticity, T = Completion time, F = clock speed): The larger the magnitude, the more sensitive is the completion time to clock speed In this case, -1.0 is considered very elastic (sensitive), because it means that an increase in frequency by a factor of X will decrease the completion time by the same factor.

Appendix D: Different Cache Sizes L2 miss rates (and elasticity) depend heavily on cache size => it has to be taken into account Solution: calculate miss rates and elasticity for common cache configurations, the scheduler picks appropriate Reasonable approach, because cache size aware scheduling takes precedence before clock speed aware scheduling

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures