Explicit Control in a Batch-aware Distributed File System - PowerPoint PPT Presentation

explicit control in a batch aware distributed file system n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Explicit Control in a Batch-aware Distributed File System PowerPoint Presentation
Download Presentation
Explicit Control in a Batch-aware Distributed File System

play fullscreen
1 / 33
Explicit Control in a Batch-aware Distributed File System
180 Views
Download Presentation
Rita
Download Presentation

Explicit Control in a Batch-aware Distributed File System

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Explicit Control in a Batch-aware Distributed File System John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of Wisconsin, Madison

  2. Grid computing Physicists invent distributed computing! Astronomers develop virtual supercomputers!

  3. Grid computing Internet Home storage If it looks like a duck . . .

  4. Are existing distributed file systems adequate for batch computing workloads? • NO. Internal decisions inappropriate • Caching, consistency, replication • A solution: Batch-Aware Distributed File System (BAD-FS) • Combines knowledge with external storage control • Detail information about workload is known • Storage layer allows external control • External scheduler makes informed storage decisions • Combining information and control results in • Improved performance • More robust failure handling • Simplified implementation

  5. Outline • Introduction • Batch computing • Systems • Workloads • Environment • Why not DFS? • Our answer: BAD-FS • Design • Experimental evaluation • Conclusion

  6. Batch computing • Not interactive computing • Job description languages • Users submit • System itself executes • Many different batch systems • Condor • LSF • PBS • Sun Grid Engine

  7. Compute node Compute node Compute node Compute node CPU Manager CPU Manager CPU Manager CPU Manager Jobqueue 1 2 3 4 Batch computing Internet Home storage Scheduler 1 2 3 4

  8. “Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003. Batch workloads • General properties • Large number of processes • Process and data dependencies • I/O intensive • Different types of I/O • Endpoint • Batch • Pipeline • Our focus: Scientific workloads • More generally applicable • Many others use batch computing • video production, data mining, electronic design, financial services, graphic rendering

  9. Endpoint Endpoint Batch dataset Endpoint Pipeline Pipeline Batch dataset Batch workloads Endpoint Endpoint Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Endpoint Endpoint Endpoint Endpoint

  10. Cluster-to-cluster (c2c) • Not quite p2p • More organized • Less hostile • More homogeneity • Correlated failures • Each cluster is autonomous • Run and managed by different entities • An obvious bottleneck is wide-area Internet Home store How to manage flow of data into, within and out of these clusters?

  11. Why not DFS? Internet Home store • Distributed file system would be ideal • Easy to use • Uniform name space • Designed for wide-area networks • But . . . • Not practical • Embedded decisions are wrong

  12. DFS’s make bad decisions • Caching • Must guess what and how to cache • Consistency • Output: Must guess when to commit • Input: Needs mechanism to invalidate cache • Replication • Must guess what to replicate

  13. BAD-FS makes good decisions • Removes the guesswork • Scheduler has detailed workload knowledge • Storage layer allows external control • Scheduler makes informed storage decisions • Retains simplicity and elegance of DFS • Practical and deployable

  14. Outline • Introduction • Batch computing • Systems • Workloads • Environment • Why not DFS? • Our answer: BAD-FS • Design • Experimental evaluation • Conclusion

  15. Practical and deployable • User-level; requires no privilege • Packaged as a modified batch system • A new batch system which includes BAD-FS • General; will work on all batch systems • Tested thus far on multiple batch systems SGE SGE SGE SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS SGE SGE SGE SGE Internet Home store

  16. Storage Manager Storage Manager Storage Manager Storage Manager Jobqueue 1 2 3 4 Contributions of BAD-FS Compute node Compute node Compute node Compute node CPU Manager CPU Manager CPU Manager CPU Manager BAD-FS BAD-FS BAD-FS 1) Storage managers 2) Batch-Aware Distributed File System Job queue 3) Expanded job description language 4) BAD-FS scheduler Home storage BAD-FS Scheduler Scheduler

  17. BAD-FS knowledge • Remote cluster knowledge • Storage availability • Failure rates • Workload knowledge • Data type (batch, pipeline, or endpoint) • Data quantity • Job dependencies

  18. Control through volumes • Guaranteed storage allocations • Containers for job I/O • Scheduler • Creates volumes to cache input data • Subsequent jobs can reuse this data • Creates volumes to buffer output data • Destroys pipeline, copies endpoint • Configures workload to access containers

  19. Knowledge plus control • Enhanced performance • I/O scoping • Capacity-aware scheduling • Improved failure handling • Cost-benefit replication • Simplified implementation • No cache consistency protocol

  20. I/O scoping • Technique to minimize wide-area traffic • Allocate storage to cache batch data • Allocate storage for pipeline and endpoint • Extract endpoint Compute node Compute node AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint Internet Steady-state: Only 5 of 705 MB traverse wide-area. BAD-FS Scheduler

  21. Capacity-aware scheduling • Technique to avoid over-allocations • Scheduler runs only as many jobs as fit

  22. Endpoint Endpoint Endpoint Endpoint Batch dataset Batch dataset Batch dataset Pipeline Pipeline Endpoint Endpoint Endpoint Pipeline Pipeline Endpoint Batch dataset Capacity-aware scheduling Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Endpoint Endpoint Endpoint Endpoint

  23. Capacity-aware scheduling • 64 batch-intensive synthetic pipelines • Vary size of batch data • 16 compute nodes

  24. Improved failure handling • Scheduler understands data semantics • Data is not just a collection of bytes • Losing data is not catastrophic • Output can be regenerated by rerunning jobs • Cost-benefit replication • Replicates only data whose replication cost is cheaper than cost to rerun the job • Results in paper

  25. Simplified implementation • Data dependencies known • Scheduler ensures proper ordering • No need for cache consistency protocol in cooperative cache

  26. Real workloads • AMANDA • Astrophysics study of cosmic events such as gamma-ray bursts • BLAST • Biology search for proteins within a genome • CMS • Physics simulation of large particle colliders • HF • Chemistry study of non-relativistic interactions between atomic nuclei and electors • IBIS • Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)

  27. Setup 16 jobs 16 compute nodes Emulated wide-area Configuration Remote I/O AFS-like with /tmp BAD-FS Result is order of magnitude improvement Real workload experience

  28. BAD Conclusions • Existing DFS’s insufficient • Schedulers have workload knowledge • Schedulers need storage control • Caching • Consistency • Replication • Combining this control with knowledge • Enhanced performance • Improved failure handling • Simplified implementation

  29. For more information • http://www.cs.wisc.edu/adsl • http://www.cs.wisc.edu/condor • Questions?

  30. Why not BAD-scheduler and traditional DFS? • Cooperative caching • Data sharing • Traditional DFS • assume sharing is exception • provision for arbitrary, unplanned sharing • Batch workloads, sharing is rule • Sharing behavior is completely known • Data committal • Traditional DFS must guess when to commit • AFS uses close, NFS uses 30 seconds • Batch workloads precisely define when

  31. Is cap aware imp in real world? • Heterogeneity of remote resources • Shared disk • Workloads changing, some are very, very large.

  32. Capacity-aware scheduling • Goal • Avoid overallocations • Cache thrashing • Write failures • Method • Breadth-first • Depth-first • Idleness

  33. Capacity-aware scheduling evaluation • Workload • 64 synthetic pipelines • Varied pipe size • Environment • 16 compute nodes • Configuration • Breadth-first • Depth-first • BAD-FS Failures directly correlate to workload throughput.