1 / 25

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/con

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001. Outline. Hi-throughput computing and Condor Resource Management in distributed systems Matchmaking Current research/Misc. Power of Computing environments.

sven
Télécharger la présentation

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/con

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001

  2. Outline • Hi-throughput computing and Condor • Resource Management in distributed systems • Matchmaking • Current research/Misc.

  3. Power of Computing environments • Power = Work / Time • High Performance Computing • Fixed amount of work; how much time? • Traditional Performance metrics: FLOPS, MIPS • Response time/latency oriented • High Throughput Computing • Fixed amount of time; how much work? • Application specific performance metrics • Throughput oriented

  4. In other words … • HPC - Enormous amounts of computing power over relatively short periods of time (+) Good for applications under sharp time constraint • HTC - Large amounts of computing power for lengthy periods (+) What if u want to simulate 1000 applications on ur latest DSP chip design over the next 3 months??

  5. The Condor Project • Goal - To develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources

  6. More about Condor • Started in late 80’s • Principal Investigator - Prof.Miron Livny • Latest version 6.3.0 released • Supports 14 different platforms (OS + Arch) including Linux, Solaris and WinNT • Currently employs over 20 students and 5 staff • We write code, debug, port, publish papers and YES, we also provide support !!!

  7. Distributed ownership of resources • Underutilized - 70% of CPU cycles in a cluster go waste • Fragmented - Resources owned by different people • Use these resources to provide HTC, BUT without impacting QOS available to owner • Achieved by allowing the user to set access policy using control expressions

  8. Access policy • Current state of the resource (eg, keyboard idle for 15 minutes or load average less than 0.2) • Characteristics of the request (run only jobs of research associates) • Time of day/night that jobs can be run

  9. What happens when u submit a job Central Manager 2. Submitting machine sends Classad of the job Resources announce their properties periodically 3.Matchmaker Notifies parties of a match Submitting machine Available resource 4.Parties negotiate 1. User submits a job

  10. Important Mechanisms

  11. Condor Architecture • Manager • Collector: Database of resources • Negotiator: Matchmaker • Accountant: Priority maintenance • Startds ( Represent owners of resources) • Implement owner's access control policy • Schedds ( Represent customers of the system) • Maintain persistent queues of resource requests

  12. Condor Architecture, cont.

  13. Power of Condor • Solves NUG30 Quadratic assignment problem, posed in 1968 over a period of 6.9 days, delivering over 96,000 CPU hours by commandeering an average of 650 machines !!! • Compare this with the RSA-155 problem posed in 1977 and solved using 300 computers (over a period of 7 months) in the last 90s. If you were to use the same amount of resources as that used to solve NUG30, this could’ve been done in 2 weeks !!! • “It (Chorus production) was done in parallel on machines in the computer center running XXX, and on the office machines under Condor. The latter did about 90% of the work!” - - Helge MEINHARD (EP division, CERN)

  14. Resource management using Matchmaking • Opportunistic Resource Exploitation • Resource availability is unpredictable • Exploit resources as soon as they are available • Matchmaking performed continuously • As against a centralized scheduler which would’ve to deal with - • Heterogeneity of resources • Distributed Ownership - widely varying allocation policies • Dynamic nature of the cluster

  15. Classified Advertisements • A simple language used by resource providers and customers to express their properties/requirements to the Collector • Uses a semi-structured data model => no specific schema is required by the matchmaker, allowing it to work naturally in a heterogeneous env • Language folds query language into the data model. Constraints may be expressed as attributes of the classad • Should conform to advertising protocol

  16. Matchmaking with Classads • 4 steps to managing resources - • Parties requiring matchmaking advertise their characteristics, preferences, constraints, etc. • Advertisements matched by a Matchmaker • Matched entities are notified • Matched entities establish an allocation through a claiming process - could include authentication, constraint verification, negotiation of terms etc • Method is symmetric

  17. Sample classad of a workstation [ Type = “Machine”; OpSys = “Linux”; Arch = “INTEL”; Memory = 256 M; Constraint = true; ] Classad example Sample classad of a Job [ Type = “Job”; Owner = “run_sim”; Constraint = other.Type ==“Machine” && Arch == “INTEL && Opsys == “Solaris251” && Other.Memory >= Memory; ]

  18. Example Classad (workstation) [ Type = “Machine”; Activity = “Idle”; Name = “crow.cs.wisc.edu”; Arch = “INTEL”; OpSys = “Solaris251”; Kflops = 21893; Memory = 64; Disk = 323496; //KB DayTime = 36107;

  19. Example Classad (contd.) ResearchGrp = {“miron”, “thain”, “john”}; Untrusted = {“bgates”, “lalooyadav”, “thief” }; Rank = member(other.Owner, ResearchGrp)*10; Constraint = !member(other.Owner, Untrusted) && Rank >= 10 ?true : false //To prevent malicious users ]

  20. Example Classad (Submitted job) [ Type = “Job”; QDate = 886799469; Owner = “raman”; Cmd = run_sim; Iwd = /usr/raman/sim2; Memory = 31; Rank = Kflops/1e3 + other.Memory/32; Constraint = other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory; ]

  21. Matchmaking • Evaluates expressions in an environment that allows each classad to access attributes of the other • Other.Memory >= self.Memory; • References to non-existent attribute evaluates to undefined • Considers pairs of ads incompatible unless their Constraint expressions both evaluate to true • Rank is then then used to choose among compatible matches • Both parties are notified about the match - could generate and hand-off session key for authentication and security

  22. Separation of Matching and Claiming • Weak consistency requirements - Claiming allows provider and customer to verify their constraints with respect to their current state • Claiming protocol could use cryptographic techniques (authentication) • Principals involved in a match are themselves responsible for establishing, maintaining and servicing a match

  23. Work outside the Condor kernel- New challenges • Mulitlateral Matchmaking - Gangmatching • IO regulation and Disk allocation - Kangaroo • User interfaces - ClassadView • Grid applications - Globus • Security

  24. Summary • Matchmaking provides a scalable and robust resource management solution for HTC environments • Classads are used by workstations and jobs • Matchmaker forms the match and informs the parties, who in turn invoke the claiming protocol • The parties are responsible for establishing, maintaining and servicing a match • Questions ?

  25. Gangmatch request [ Type = “Job”; Owner = “raj”; Cmd = run_sim; Ports = { [ Label = “cpu”; ImageSize = 28 M; //Rank and constraints ], [Label = “License”; Host = cpu.Name; //Rank and constraints ] } ]

More Related