1 / 20

Active Sampling for Accelerated Learning of Performance Models

Active Sampling for Accelerated Learning of Performance Models. Piyush Shivam, Shivnath Babu, Jeff Chase Duke University. Networked Computing Utility. A network of clusters or grid sites. Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network)

cecil
Télécharger la présentation

Active Sampling for Accelerated Learning of Performance Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University

  2. Networked Computing Utility A network of clusters or grid sites. Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network) Managed as a shared utility. Jobs are task/data workflows. Challenge: choose the ‘best’ resource mapping/schedule for the job mix. Instance of “utility resource planning”. Solution under construction: NIMO Task workflow Task scheduler Site A C1 C3 C2 Site C Site B

  3. Subproblem: Predict Job Completion Time

  4. Premises (Limitations) • Important batch applications are run repeatedly. • Most resources are consumed by applications we have seen in the past. • Behavior is predictable across data sets. • …given some attributes associated with the data set. • Stable behavior per unit of data processed (D) • D is predictable from data set attributes. • Behavior depends only on resource attributes. • CPU type and clock, seek time, spindle count. • Utility controls the resources assigned to each job. • Virtualization enables precise control. • Your mileage may vary.

  5. NIMONonInvasive Modeling for Optimization • NIMO learns end-to-end performance models • Models predict performance as a function of, (a) application profile, (b) data set profile, and (c) resource profile of candidate resource assignment • NIMO is active • NIMO collects training data for learning models by conducting proactive experiments on a ‘workbench’ • NIMO is noninvasive App/data profiles “What if…” Candidate resource profiles (Target) performance Model

  6. Scheduler Training set database Site A C1 C3 C2 Site C Site B The Big Picture Jobs, benchmarks Resource profiler Application profiler Active learning Correlate metrics with job logs Pervasive instrumentation

  7. stall phases (compute resource stalled on I/O) compute phases (compute resource busy) Os (stall occupancy) ) ( Od (storage occupancy) + On (network occupancy) + Oa (compute occupancy) T = D * comp. time total data Generic End-to-End Model occupancy: average time consumed per unit of data directly observable

  8. Independent variables Dependent variables Data profile ( ) Resource profile ( ) Statistical Learning Complexity (e.g., latency hiding, concurrency, arm contention) is captured implicitly in the training data rather than in the structure of the model.

  9. Sampling Challenges • Full system operating range • Samples must cover space of candidate resource assignments • Cost of sample acquisition • Acquiring a sample has a non-negligible cost, e.g., time to acquire a sample, or opportunity cost for the application • Curse of dimensionality • Too many parameters! • E.g., 10 dimensions X 10 values per dimension • 5 minutes for each sample => 951 years for 1% samples!

  10. Active sampling 100% Accuracy of current model Passive sampling Number of training samples Active Learning in NIMO How to learn accurate models quickly? • Passive sampling might not expose the system operating range • Active sampling using “design of experiments” collects most relevant training data • Automatic and quick

  11. Active sampling with acceleration 100% Passive sampling Accuracy of current model Active sampling without acceleration Number of training samples Sample Carefully

  12. Active Sampling Challenges • How to expose themain factors and interactions in the shortest time? • Which dimensions/attributes to perturb? • What values to choose for the attributes? • Where to conduct the experiment? • On a separate system (“workbench”) or “live”?

  13. Planning `active’ experiments • Choose a predictor function to refine • Focus in on the most significant/relevant predictors….or…the least accurate • Example: CPU-intensive app needs an accurate compute time predictor • Choose attribute (if any) to add to the predictor • Example: CPU speed • Choose the values of the attributes • Conduct the experiment • Compute current prediction error; Go to Step 1

  14. Choosing the Next Predictor • Learn the most significant/relevant predictors first. • Static vs. dynamic ordering • Static: define total order, e.g., a priori or by pre-estimates of influence (Plackett-Burman). • Cycle through the order: round-robin vs. improvement threshold • Dynamic: choose the predictor with maximum current error

  15. Choosing New Attributes • Include the most significant/relevant attributes • Choose attributes to expose main factors and interactions • Add an attribute when error reduction from further training with the current set falls below threshold. • Choose the attribute with maximum potential improvement in accuracy. • Establish total order using pre-estimate of relevance using Plackett-Burman.

  16. Choosing New Values • Select a new value sample to train the selected predictor function with the chosen set of attributes. • Range of approaches balance coverage vs. interactions Binary search/bracket PB to identify interactions La-Ib a = #levels for value b = degree of interactions

  17. Experimental Results • Biomedical applications • BLAST, fMRI, NAMD, CardioWave • Resources • 5 CPU speeds, 6 Network latencies, 5 Memory sizes • 5 X 6 X 5 = 150 resource assignments • Goal: Learn executing time model with least number of training assignments • Separate test set to evaluate the accuracy of the current model

  18. BLAST Application • Total time for 150 assignments: 130 hrs • Active sampling: 5 hrs • Sample space: 2% • Incorrect order of predictor refinement • 12 hrs • 10% sample space

  19. BLAST Application • Total time for 150 assignments: 130 hrs • Active sampling: 5 hrs • Sample space: 2% • Incorrect order of attribute refinement • 12 hrs • 10% sample space

  20. Summary/Conclusions • Current SLT – given the right data, learn the right model • Use active sampling to acquire the right data • Ongoing experiments demonstrate the importance/potential of guided active sampling • 2% sample space, >= 90% model accuracy • Upcoming VLDB paper…

More Related