High-Throughput Computing on Commodity Systems.

High-Throughput Computing on Commodity Systems.

The Good News: Raw computing power is everywhere - on desk-tops, shelves, racks, and in your pockets. It is: • Cheap • Plentiful • Mass-Produced

The Bad News: GFLOPS per year =/= GFLOPS per second * 30,000,000 seconds/year

A variation on a chestnut: What is a benchmark?

Answer: The throughput which your system is guaranteed never to exceed!

Why? • A community of commodity computers can be difficult to manage: • Dynamic : State and availability change over time • Evolving : New hardware and software is continuously acquired and installed • Heterogeneous : Hardware and software • Distributed ownership : Each machine has a different owner with different requirements and preferences.

Why? • Even traditionally “static” systems (such as professionally managed clusters) suffer the same problems when viewed at a yearly scale: • Power failures • Hardware failures • Software upgrades • Load imbalance • Network imbalance

How do we measure computer performance? • High-Performance Computing: • Achieve max GFLOP per second under ideal circumstances. • High-Throughput Computing • Achieve max GFLOP per months or years in whatever conditions prevail.

High-Throughput Computing • Focuses on maximizing… • simulations run before the paper deadline… • crystal lattices per week… • reconstructions per week… • video frames rendered per year… • …without “babysitting” from the user. • Cannot depend on “ideal” circumstances.

High-Throughput Computing • Is achieved by: • Expanding the CPUs available. • Silently adapting to inevitable changes. • Robust software • Is only marginally affected by: • MB, MHz, MIPS, FLOPS… • Robust hardware

Solution: Condor • Condor is software for creating a high-throughput computing environment on a community of workstations, ranging from commodity PCs to supercomputers.

Who are we?

The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces • software engineering challenges in a UNIX/Linux/NT environment, • active interaction with users and collaborators, • daily maintenance and support challenges of a distributed production environment, • and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School .

Users and collaborators • Scientists - Biochemistry, high energy physics, computer sciences, genetics, … • Engineers - Hardware design, software building and testing, animation, ... • Educators - Hardware design tools, distributed systems, networking, ...

National Grid Efforts • National Technology Grid - NCSA Alliance (NSF-PACI) • Information Power Grid - IPG (NASA) • Particle Physics Data Grid - PPDG (DoE) • Grid Physics Network GriPhyN (NSF-ITR)

Condor CPUs on the UW Campus

Some Numbers:UW-CS Pool 6/98-6/004,000,000 hours ~450 years “Real” Users 1,700,000 hours ~260 years CS-Optimization 610,000 hours CS-Architecture 350,000 hours Physics 245,000 hours Statistics 80,000 hours Engine Research Center 38,000 hours Math 90,000 hours Civil Engineering 27,000 hours Business 970 hours “External” Users 165,000 hours ~19 years MIT 76,000 hours Cornell 38,000 hours UCSD 38,000 hours CalTech 18,000 hours

Start slow,but thinkBIG

Start slow, but think big! 1000 machines in the GRID. 100 machines in your department 1 machine on your desktop One Personal Condor Condor Pool Condor-G

Start slow, but think big! • Personal Condor: • Manage just your machine with Condor. Fault tolerance, policy control, logging. Sleep soundly at night. • Condor Pool: • Take advantage of your friends and colleagues: share cycles, gain ~ 100x throughput. • Condor-G: • Jobs from your pool migrate to other computational facilities around the world. Gain 1000x throughput. (Record-breaking results!)

Key Condor User Services • Local control - jobs are stored and managed locally by a personal scheduler. • Priority scheduling - execution order controlled by priority ranking assigned by user. • Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed. • Local executing environment preserved - re-linked jobs can have their I/O re-directed to submission site.

More Condor User Services • Powerful and flexible means for selecting execution site (requirements and preferences) • Logging of job activities. • Management of large (10K) numbers of jobs per user. • Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager) • Support for dynamic MW (PVM and File) applications

How does it work?

Basic HTC Mechanisms • Matchmaking - enables requests for services and offers to provide services find each other (ClassAds). • Fault tolerance - Checkpointing enables preemptive resume scheduling (go ahead and use it as long as it is available!). • Remote execution – enables transparent access to resources from any machine in the world. • Asynchronicity - enables management of dynamic (opportunistic) resources.

Every Communityneeds a Matchmaker!

Why? Because ... .. someone has to bring together community members who have requests for goods and services with members who offer them. • Both sides are looking for each other • Both sides have constraints • Both sides have preferences

ClassAd - Properties Type = “Machine”; Activity = “Idle”; KbdIdle = ‘00:22:31’; Disk = 2.1G; //2.1 Gigs Memory = 64M; // 6.4 Megs State = “Unclaimed”; LoadAverage = 0.042969 Arch = “INTEL”; OpSys = “SOLARIS251”;

ClassAd - Policy RsrchGrp = { “raman”, “miron”, “solomon” }; Friends = { “dilbert”, “wally” }; Untrusted = { “rival”, riffraff”, TPHB” }; Tier = member(RsrchGroup, other.Owner) ? 2 : ( member(Friends, other.Owner) ? 1 : 0 ) Requirements = !member(Untrusted, other.Owener) && (Tier == 2 ? True : Tier == 1 ? LoadAvg < 0.3 && KbdIdle > ‘00:15’ ) : DayTime() <‘08:00’ || DayTime()>’18:00’ )

Advantages of Matchmaking • Hybrid (Centralized+Distributed) resource allocation algorithm • End-to-end verification • Bilateral specialization • Weak consistency requirements • Authentication • Fault tolerance • Incremental system evolution

Fault-Tolerance • Condor can checkpoint a program by writing its image to disk. • If a machine should fail, the program may resume from the last checkpoint. • Ifa job must vacate a machine, it may resume from where it left off.

Remote Execution • Condor might run your jobs on machines spread around the world – not all of them will have your files. • Condor provides an adapter – a library – which converts your job’s I/O operations into remote I/O back to your home machine. • No matter where your job runs, it sees the same environment.

Asynchronicity • A fact of life in a system of 1000s of machines. • Power on/off • Lunch breaks • Jobs start and finish • Condor never depends on a fixed configuration – work with what is available.

Does it work?

An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality. The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

NUG30 Personal Condor … For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

It works!!! Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> To: Miron Livny <miron@cs.wisc.edu> Subject: Re: Priority This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great… Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Still rolling along. Over three billion nodes in about 1 day!

Up to a Point … Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Hi Gang, The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

Back in Business Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Hi Gang, We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

The First 600K seconds …

We made it!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

C High Throughput Computing ondor Do not be picky, be agile!!!

High-Throughput Computing on Commodity Systems.