270 likes | 286 Vues
Explore the challenges and benefits of comparing resource sharing frameworks like Hadoop, MPI, and Mesos. Learn about granularity, interference, and allocation strategies to optimize resource utilization and avoid conflicts. Discover the importance of fine-grained sharing for improved performance and scalability. Assess the design elements of these frameworks to meet high resource utilization, diverse framework support, and reliability requirements. Dive into Mesos' architecture and Omega's scalable scheduling benefits in cluster environments.
E N D
Why static is bad! Today: static partitioning Want dynamic sharing Hadoop Pregel Shared cluster MPI
Comparing Sharing Frameworks: choice • Choice of resources • Can a framework pick between all resources? • A predefined subset? • Or a random chosen subset? • Why important? • Policies may need to be global --localization • If you can preempt you can get your preference
Comparing Sharing Frameworks: Interference • Can frameworks tray to use the same machines? • Can a framework pick between all resources? • How to avoid this? • Offer resources to machines one at a time • Statically partition • Offer in parallel and arbitrate when conflict arises.
Comparing Sharing Frameworks: Granularity • Allocation Granularity • MPI tasks: gang-schedule, job can’t run until all slots are acquired. • Hadoop: elastic, job can start running when it allocates a few slots • Why important? • If gang-scheduling, then the framework will hoard until it gets all the slots it needs. • The cluster may or may not be underutilized. • Cluster-wide behaviors
Other Benefits of Mesos • Run multiple instances of the same framework • Isolate production and experimental jobs • Run multiple versions of the framework concurrently • Build specialized frameworks targeting particular problem domains • Better performance than general-purpose abstractions
Goals • High utilization of resources • Support diverse frameworks (current & future) • Scalability to 10,000’s of nodes • Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks
Design Elements • Fine-grained sharing: • Allocation at the level of tasks within a job • Improves utilization, latency, and data locality • Resource offers: • Simple, scalable application-controlled scheduling mechanism
Element 1: Fine-Grained Sharing Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos): Fw. 3 Fw. 3 Fw. 2 Fw. 1 Framework 1 Fw. 1 Fw. 2 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 1 Framework 2 Fw. 3 Fw. 3 Fw. 2 Fw. 2 Fw. 2 Fw. 3 Framework 3 Fw. 1 Fw. 3 Fw. 1 Fw. 2 Storage System (e.g. HDFS) Storage System (e.g. HDFS) + Improved utilization, responsiveness, data locality
Element 2: Resource Offers • Option: Global scheduler • Frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions • – Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs
Element 2: Resource Offers • Mesos: Resource offers • Offer available resources to frameworks, let them pick which resources to use and which tasks to launch • Keeps Mesos simple, lets it support future frameworks • Decentralized decisions might not be optimal
Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task
Mesos Architecture MPI job Hadoop job MPI scheduler Hadoop scheduler Resource offer = list of (node, availableResources) E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } Pick framework to offer resources to Mesos master Allocation module Resource offer Mesos slave Mesos slave MPI executor MPI executor task task
Mesos Architecture MPI job Hadoop job Framework-specific scheduling MPI scheduler Hadoop scheduler task Pick framework to offer resources to Mesos master Allocation module Resource offer Launches and isolates executors Mesos slave Mesos slave MPI executor Hadoop executor MPI executor task task
Drawbacks • Poor fairness • Jobs with long tasks can dominate • There is NO preemption!! • Sticky slots • Jobs with higher priority can dominate a set of preferred slots • Mesos uses lottery scheduling, probability of being offered a slot is proportional to the frameworks priority • Head of line blocking • Mesos offers resources one framework at a time • Prevents frameworks from trying to use the same slots • Based on assumptions: scheduling decisions are quick, • Mesos revokes offers if a schedules takes too long • Essentially leads to a queue
Omega • Scales • Central layer only does optimistic conflict resolution • No head of Line blocking • Allows for flexible and evolvable scheduling • Framework can implement any arbitrary form of scheduling • Each framework has global view • Frameworks can preempt each other
Comparing Sharing Frameworks • Choice of resources • Interference • Allocation Granularity • Cluster-wide behaviors