1 / 86

*Director, Intel Research Berkeley

UC Berkeley. Cloud Computing: Past, Present, and Future Professor Anthony D. Joseph*, UC Berkeley Reliable Adaptive Distributed Systems Lab RWTH Aachen 22 March 2010. http://abovetheclouds.cs.berkeley.edu/. *Director, Intel Research Berkeley. RAD Lab 5-year Mission.

kenadia
Télécharger la présentation

*Director, Intel Research Berkeley

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UC Berkeley Cloud Computing: Past, Present, and Future Professor Anthony D. Joseph*, UC BerkeleyReliable Adaptive Distributed Systems Lab RWTH Aachen 22 March 2010 http://abovetheclouds.cs.berkeley.edu/ *Director, Intel Research Berkeley

  2. RAD Lab 5-year Mission Enable 1 person to develop, deploy, operate next -generation Internet application • Key enabling technology: Statistical machine learning • debugging, monitoring, pwr mgmt, auto-configuration, perfprediction, ... • Highly interdisciplinary faculty & students • PI’s: Patterson/Fox/Katz (systems/networks), Jordan (machine learning), Stoica (networks & P2P), Joseph (security), Shenker (networks), Franklin (DB) • 2 postdocs, ~30 PhD students, ~6 undergrads • Grad/Undergrad teaching integrated with research

  3. Course Timeline • Friday • 10:00-12:00 History of Cloud Computing: Time-sharing, virtual machines, datacenter architectures, utility computing • 12:00-13:30 Lunch • 13:30-15:00 Modern Cloud Computing: economics, elasticity, failures • 15:00-15:30 Break • 15:30-17:00 Cloud Computing Infrastructure: networking, storage, computation models • Monday • 10:00-12:00 Cloud Computing research topics: scheduling, multiple datacenters, testbeds

  4. Nexus: A common substrate for cluster computing Joint work with Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Scott Shenker, and Ion Stoica

  5. Recall: Hadoop on HDFS namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node Adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

  6. Problem • Rapid innovation in cluster computing frameworks • No single framework optimal for all applications • Energy efficiency means maximizing cluster utilization • Want to run multiple frameworks in a single cluster

  7. What do we want to run in the cluster? Pregel Apache Hama Dryad Pig

  8. Why share the cluster between frameworks? • Better utilization and efficiency (e.g., take advantage of diurnal patterns) • Better data sharing across frameworks and applications

  9. Solution Nexus is an “operating system” for the cluster over which diverse frameworks can run • Nexus multiplexes resources between frameworks • Frameworks control job execution

  10. Goals • Scalable • Robust (i.e., simple enough to harden) • Flexible enough for a variety of different cluster frameworks • Extensible enough to encourage innovative future frameworks

  11. Question 1: Granularity of Sharing Option: Coarse-grained sharing • Give framework a (slice of) machine for its entire duration • Data locality compromised if machine held for long time • Hard to account for new frameworks and changing demands -> hurts utilization and interactivity Hadoop 1 Hadoop 2 Hadoop 3

  12. Question 1: Granularity of Sharing Nexus: Fine-grained sharing • Support frameworks that use smaller tasks (in time and space) by multiplexing them across all available resources Hadoop 3 Hadoop 3 Hadoop 2 Hadoop 1 • Frameworks can take turns accessing data on each node • Can resize frameworks shares to get utilization & interactivity Hadoop 1 Hadoop 2 Hadoop 2 Hadoop 1 Hadoop 3 Hadoop 2 Hadoop 1 Hadoop 3 Hadoop 3 Hadoop 2 Hadoop 2 Hadoop 2 Hadoop 3 Hadoop 1 Hadoop 3 Hadoop 1 Hadoop 2

  13. Question 2: Resource Allocation Option: Global scheduler • Frameworks express needs in a specification language, a global scheduler matches resources to frameworks • Requires encoding a framework’s semantics using the language, which is complex and can lead to ambiguities • Restricts frameworks if specification is unanticipated Designing a general-purpose global scheduler is hard

  14. Question 2: Resource Allocation Nexus: Resource offers • Offer free resources to frameworks, let frameworks pick which resources best suit their needs • Keeps Nexus simple and allows us to support future jobs • Distributed decisions might not be optimal

  15. Outline • Nexus Architecture • Resource Allocation • Multi-Resource Fairness • Implementation • Results

  16. Nexus Architecture

  17. Overview Hadoop job Hadoop job MPI job Hadoop v19 scheduler Hadoop v20 scheduler MPI scheduler Nexus master Nexus slave Nexus slave Nexus slave MPI executor MPI executor Hadoop v19 executor Hadoop v20 executor Hadoop v19 executor task task task task task

  18. Resource Offers MPI job Hadoop job MPI scheduler Hadoop scheduler Pick framework to offer to Resourceoffer Nexus master Nexus slave Nexus slave MPI executor MPI executor task task

  19. Resource Offers MPI job Hadoop job MPI scheduler Hadoop scheduler offer = list of {machine, free_resources} Example: [ {node 1, <2 CPUs, 4 GB>}, {node 2, <2 CPUs, 4 GB>} ] Pick framework to offer to Resource offer Nexus master Nexus slave Nexus slave MPI executor MPI executor task task

  20. Resource Offers MPI job Hadoop job MPI scheduler Hadoop scheduler Framework-specific scheduling task Pick framework to offer to Resourceoffer Nexus master Launches & isolates executors Nexus slave Nexus slave Hadoop executor MPI executor MPI executor task task

  21. Resource Offer Details • Min and max task sizes to control fragmentation • Filters let framework restrict offers sent to it • By machine list • By quantity of resources • Timeouts can be added to filters • Frameworks can signal when to destroy filters, or when they want more offers

  22. Using Offers for Data Locality We found that a simple policy called delay scheduling can give very high locality: • Framework waits for offers on nodes that have its data • If waited longer than a certain delay, starts launching non-local tasks

  23. Framework Isolation • Isolation mechanism is pluggable due to the inherent perfomance/isolation tradeoff • Current implementation supports Solaris projects and Linux containers • Both isolate CPU, memory and network bandwidth • Linux developers working on disk IO isolation • Other options: VMs, Solaris zones, policing

  24. Resource Allocation

  25. Allocation Policies • Nexus picks framework to offer resources to, and hence controls how many resources each framework can get (but not which) • Allocation policies are pluggable to suit organization needs, through allocation modules

  26. Example: Hierarchical Fairshare Policy Cluster Share Policy Facebook.com 20% 100% 80% 0% Ads Spam User 2 User 1 14% 70% 30% 20% 100% 6% Job 4 Job 3 Job 1 Job 2 CurrTime CurrTime CurrTime

  27. Revocation Killing tasks to make room for other users Not the normal case because fine-grained tasks enable quick reallocation of resources Sometimes necessary: • Long running tasks never relinquishing resources • Buggy job running forever • Greedy user who decides to makes his task long

  28. Revocation Mechanism Allocation policy defines a safe share for each user • Users will get at least safe share within specified time Revoke only if a user is below its safe share and is interested in offers • Revoke tasks from users farthest above their safe share • Framework warned before its task is killed

  29. How Do We Run MPI? Users always told their safe share • Avoid revocation by staying below it Giving each user a small safe share may not be enough if jobs need many machines Can run a traditional grid or HPC scheduler as a user with a larger safe share of the cluster, and have MPI jobs queue up on it • E.g. Torque gets 40% of cluster

  30. Example: Torque on Nexus Facebook.com Safe share = 40% 20% 40% 40% Torque Ads Spam User 2 User 1 MPI Job MPI Job MPI Job MPI Job Job 4 Job 1 Job 1 Job 2

  31. Multi-Resource Fairness

  32. What is Fair? • Goal: define a fair allocation of resources in the cluster between multiple users • Example: suppose we have: • 30 CPUs and 30 GB RAM • Two users with equal shares • User 1 needs <1 CPU, 1 GB RAM> per task • User 2 needs <1 CPU, 3 GB RAM> per task • What is a fair allocation?

  33. Definition 1: Asset Fairness • Idea: give weights to resources (e.g. 1 CPU = 1 GB) and equalize value of resources given to each user • Algorithm: when resources are free, offer to whoever has the least value • Result: • U1: 12 tasks: 12 CPUs, 12 GB ($24) • U2: 6 tasks: 6 CPUs, 18 GB ($24) PROBLEM User 1 has < 50% of both CPUs and RAM User 1 User 2 100% 50% 0% CPU RAM

  34. Lessons from Definition 1 • “You shouldn’t do worse than if you ran a smaller, private cluster equal in size to your share” • Thus, given N users, each user should get ≥ 1/N of his dominating resource (i.e., the resource that he consumes most of)

  35. Def. 2: Dominant Resource Fairness • Idea: give every user an equal share of her dominant resource (i.e., resource it consumes most of) • Algorithm: when resources are free, offer to the user with the smallest dominant share (i.e., fractional share of the her dominant resource) • Result: • U1: 15 tasks: 15 CPUs, 15 GB • U2: 5 tasks: 5 CPUs, 15 GB User 1 User 2 100% 50% 0% CPU RAM

  36. Fairness Properties

  37. Implementation

  38. Implementation Stats 7000 lines of C++ APIs in C, C++, Java, Python, Ruby Executor isolation using Linux containers and Solaris projects

  39. Frameworks Ported frameworks: • Hadoop(900 line patch) • MPI (160 line wrapper scripts) New frameworks: • Spark, Scala framework for iterative jobs (1300 lines) • Apache+haproxy, elastic web server farm (200 lines)

  40. Results

  41. Overhead Less than 4% seen in practice

  42. Dynamic Resource Sharing

  43. Multiple Hadoops Experiment Hadoop 1 Hadoop 2 Hadoop 3

  44. Multiple Hadoops Experiment Hadoop 3 Hadoop 3 Hadoop 2 Hadoop 1 Hadoop 1 Hadoop 1 Hadoop 2 Hadoop 2 Hadoop 1 Hadoop 3 Hadoop 2 Hadoop 1 Hadoop 2 Hadoop 3 Hadoop 3 Hadoop 2 Hadoop 2 Hadoop 2 Hadoop 3 Hadoop 3 Hadoop 1 Hadoop 2 Hadoop 1 Hadoop 3

  45. Results with 16 Hadoops

  46. Web Server Farm Framework

  47. Web Framework Experiment httperf HTTP request HTTP request HTTP request Load calculation Scheduler (haproxy) Load gen framework task resource offer Nexus master status update Nexus slave Nexus slave Nexus slave Load gen executor Web executor Load gen executor Web executor Load gen executor Web executor task task (Apache) task task (Apache) task task task (Apache)

  48. Web Framework Results

  49. Future Work Experiment with parallel programming models Further explore low-latency services on Nexus (web applications, etc) Shared services (e.g. BigTable, GFS) Deploy to users and open source

  50. Cloud Computing Testbeds

More Related