1 / 42

Siu Yau Dissertation Defense, Dec 08

Using Application-Domain Knowledge in the Runtime Support of Multi-Experiment Computational Studies. Siu Yau Dissertation Defense, Dec 08. Multi-Experiment Study (MES). Simulation software rarely run in isolation Multi-Experiment Computational Study

kele
Télécharger la présentation

Siu Yau Dissertation Defense, Dec 08

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Application-Domain Knowledge in the Runtime Support of Multi-Experiment Computational Studies Siu Yau Dissertation Defense, Dec 08

  2. Multi-Experiment Study (MES) • Simulation software rarely run in isolation • Multi-Experiment Computational Study • Multiple executions of a simulation experiment • Goal: Identify interesting regions in input space of simulation code • Examples in engineering, science, medicine, finance • Interested in aggregate result • Not individual experiment

  3. MES Challenges • Systematically cover input space • Refinement + High dimensionality Large number of experiments (100s or 1000s) and/or user interaction • Accurate individual experiments • Spatial + Temporal refinement Long-running individual experiments (days or weeks per experiment) • Subjective goal • Require study-level user guidance

  4. MES on Parallel Architectures • Parallel architecture maps well to MES • Dedicated, local access to small- to medium-sized parallel computers • Interactive MES User-directed coverage of exploration space • Massively-parallel systems • Multiple concurrent parallel experiments exploit power of massively parallel systems • Traditional systems lack high-level view

  5. Thesis Statement To meet the interactive and computational requirements of Multi-Experiment Studies, a parallel run-time system must view an entire study as a single entity, and use application-level knowledge that are made available from the study context to inform its scheduling and resource allocation decisions.

  6. Outline • MES Formulation, motivating examples • Defibrillator Design, Helium Model Validation • Related Work • Research Methodology • Research Test bed: SimX • Optimization techniques • Sampling, Result reuse, Resource allocation • Contributions

  7. MES Formulation • Simulation Code: maps input to result • Design Space: Space of possible inputs to simulation code • Evaluation Code: maps result to performance metric • Performance Space: Space of outputs of evaluation code • Goal: Find Region of Interest in Design & Performance Space

  8. Example: Defibrillator Design • Help design implantable defibrillators • Simulation Code: • Electrode placements + shock voltage  Torso potential • Evaluation Code: • Torso potential + activation/damage thresholds % activated & damaged heart tissues • Goal: Placement + voltage combination to max activation, min damage

  9. Example: Gas Model Validation • Validate gas-mixing model • Simulation Code: • Prandtl number + Gas inlet velocity Helium plume motion • Evaluation Code: • Helium plume motion Velocity profile deviation from real-life data • Goal: Find Prandtl number + inlet velocity to minimize deviation

  10. Example: Pareto Optimization • Set of inputs that cannot be improved in all objectives Activation Damage

  11. Example: Pareto Optimization • Set of inputs that cannot be improved in all objectives

  12. Challenge: Defibrillator Design • Interactive Exploration of Pareto Frontier • Change set up (voltage, back electrode, etc.)  new study • Interactive exploration of “study space” • One user action  one aggregate result • Need study-level view, interactive rate

  13. Challenge: Model Validation • Multiple executions of long-running code • 6x6 grid = 36 experiment • ~3000 timesteps per experiment @ 8 seconds per timestep • 6.5 hours per experiment  10 days per study • Schedule and allocate resource as a single entity: how to distribute parallel resources?

  14. Related Work: Grid Schedulers • Grid Schedulers • Condor, Globus • Each experiment treated as a “black box” • Application-aware grid infrastructures: • Nimrod/O and Virtual Instrument • Take advantage of application knowledge – but in ad-hoc fashion • No consistent set of APIs reusable across different MESs

  15. Related Work: Parallel Steering • Grid-based Steering • Grid-based: RealityGrid, WEDS • Steer execution of inter-dependent tasks • Different focus: Grid Vs cluster • Parallel Steering Systems • Falcon, CUMULVS, CSE • Steers single executions (not collections) on parallel machines

  16. Methodology • Four example MESs, varying properties

  17. Methodology (cont’d) • Identify application-aware system policies • Scheduling, Resource allocation, User interface, Storage support • Construct research test bed (SimX) • API to import application-knowledge • Implemented on parallel clusters • Conduct example MESs • Implement techniques, measure effect of application-aware system policies

  18. Test bed: SimX • Parallel System for Interactive Multi-Experiment Studies (SIMECS) • Support MESs on parallel clusters • Functionality-based components • UI, Sampler, Task Queue, Resource Allocator, Simulation container, SISOL • Each component with specific API • Adapt API to the needs of the MES

  19. Test bed: SimX User Interface: Visualisation & Interaction Simulation Container Simulation code Data Server Sampler SISOL API Data Server FUEL Interface FUEL Interface Evaluation code Dir Server Data Server Task Queue Data Server SISOL Server Pool Resource Allocator Worker Process Pool Front-end Manager Process

  20. Optimization techniques • Reduce number of experiments needed: • Automatic sampling • Study-level user steering • Study-level result reuse • Reduce run time of individual experiments: • Reuse results from another experiment: checkpoints, internal states • Improve resource utilization rate • Min. parallel. overhead & max. reuse potential • Preemption: claim idle resources

  21. Active Sampling • If MES is optimization study (i.e., region of interest is to optimize a function) • Incorporate search algorithm in scheduler • Pareto optimizations: Active Sampling • Cover design space from coarse to fine grid • Use aggregate results from coarse level to identify promising regions • Reduce number of experiments needed

  22. Active Sampler (cont’d) 2nd Refinement First Refinement Initial Grid 3rd level results 1st level results 2nd level results

  23. Support for Sampling void setStudy(StudySpec) void registerResult(experiment, performance) experiment getNextPointToRun () SimX Sampler API User Interface: Visualisation & Interaction Simulation Container Simulation code Data Server Sampler Active (Pareto) Sampler Random Sampler Custom Sampler Naïve (Sweep) Sampler SISOL API Data Server FUEL Interface FUEL Interface Evaluation code Dir Server Data Server Task Queue Data Server SISOL Server Pool Resource Allocator Worker Process Pool Front-end Manager Process

  24. Evaluation: Active Sampling • Helium validation study • Resolve Pareto frontier on 6x6 grid • Reduce no. of experiments from 36 to 24 • Defibrillator study • Resolve Pareto frontier on 256x256 grid • Reduce no. of experiments from 65K to 7.3K • Non-perfect scaling due to dependencies • At 128 workers: Active sampling: 349 secs; Grid sampling: 900 secs

  25. Result reuse • MES: many similar runs of simulation code • Share information between experiments • speed up experiment that reuse information • only need to calculate deltas • Many types: depends on information used • varying degrees of generality • Reduce individual experiment run time • except study-level reuse

  26. Result reuse types

  27. Intermediate Result Reuse • Defibrillator simulation code solves 3 systems, linearly combine solutions • Same system needed by different experiments • Cache the solutions Aax=ba Acx=bc Abx=bb Store Ac-1bc and Ab-1bb Adx=bd

  28. Support for Result Reuse User Interface: Visualisation & Interaction SISOL API: object StartRead(objSet, coord) void EndRead(object) object StartWrite(objSet, coord) void EndWrite(objSet, object) Simulation Code Aa-1ba Data Server Sampler Ab-1bb SISOL API Data Server FUEL Interface FUEL Interface Dir Server Data Server Task Queue Data Server Evaluation code SISOL Server Pool Resource Allocator Worker Process Pool Front-end Manager Process

  29. Checkpoint Result Reuse • Helium code terminates when KE stabilizes • Start from another checkpoint – stabilize faster • Must have same inlet velocities

  30. Study-level Result Reuse • Interactive study: two similar studies • Use Pareto frontier from first study as a guide for next study

  31. Evaluation: Result Reuse • Checkpoint reuse in Helium Model Study: • No reuse: 3000 timesteps; with reuse: 1641 • 18 experiments out of 24 able to reuse • 28% improvement overall • Defibrillator study • No reuse: 7.3K experiments @ 2 secs each = 349 secs total on 128 procs • With reuse: 6.5K experiments @ 1.5 secs = 123 secs total on 128 procs • 35% improvement overall

  32. Resource Allocation • MES made up of parallel simulation codes • How to divide cluster among experiments? • Parallelization overhead • fewer processes per experiment • Active sampling + reuse • Some experiments more important; more processes for those experiments • Adapt allocation policy to MES: • Use application knowledge to decide which experiments are prioritized

  33. Resource Allocation • Batching strategy: select subset (batch) and assign high priority, run concurrently • Considerations for batching policies • Scaling behavior: maximize batch size • Sampling policy: prioritize “useful” samples • Reuse potential: prioritize experiments with reuse • Preemption strategy: • claim unused processing elements and assign to experiments in progress

  34. Resource Allocation: Batching • Batch for Active Sampling • Identify independent experiments in sampler • Max. parallelism while allowing active sampling First Batch Inlet Velocity 1st Pareto-Optimal Second Batch 1st & 2nd Pareto Opt. 3rd Batch 1st to 3rd Pareto Opt. 4rd Batch Pareto Frontier Prantl Number

  35. Resource Allocation: Batching • Active Sample batching 3rd Batch 2nd Batch 4th Batch 1st Batch

  36. Resource Allocation: Batching • Batch for reuse class • Sub-divide each batch into 2 smaller batches: • 1st sub-batch: first in reuse class; no two belong to same reuse class • No two concurrent from- scratch experiments can reuse each other’s checkpoints(max. reuse potential) • Experiments in samebatch have comparable run times (reduce holes) Inlet Velocity Prantl Number

  37. Resource Allocation: Batching • Batching for reuse classes 4th Batch 5th Batch 6th Batch 3rd Batch 1st Batch 2nd Batch

  38. Resource Allocation: Preemption 5th Batch • With preemption 4th Batch 3rd Batch 1st Batch 6th Batch 2nd Batch

  39. Support for Resource Allocation TaskQueue::AddTask(Experiment) TaskQueue:: CreateBatch(set<Experiment>&) TaskQueue::GetIdealGroupSize() TaskQueue:: AssignNextTask(GroupID) Reconfigure(const int* assignment) User Interface: Visualisation & Interaction Simulation Container Simulation code Data Server Sampler SISOL API Data Server FUEL Interface FUEL Interface Evaluation code Dir Server Data Server Task Queue Data Server SISOL Server Pool Resource Allocator Worker Process Pool Front-end Manager Process

  40. Evaluation: Resource Allocation

  41. Contributions • Demonstrate the need to consider the entire end-to-end system • Identify system policies that can benefit from application-level knowledge • Scheduling (Sampling): for optimization MESs • User steering: for MESs with subjective goals and MES with high design space dimensionality • Result reuse: for MESs made up of similar executions of simulation code • Resource allocation: for MESs made up of parallel simulation codes

  42. Contributions • Demonstrate with prototype system • API to import relevant application-knowledge • Quantify the benefits of application-aware techniques • Sampling: orders of magnitude improvement in bridge design and defibrillator study; 33% improvement in Helium model validation study • User steering: enable interactivity in animation design study and defibrillator study • Result reuse: multi-fold improvement in bridge design, defibrillator, and helium model validation studies • Application-aware resource allocation: multi-fold improvement in Helium model validation study

More Related