The Computational Grid: Aggregating Performance and Enhanced Capability

The Computational Grid: Aggregating Performance and Enhanced Capability from Federated Resources Rich Wolski University of California, Santa Barbara

The Goal • To provide a seamless, ubiquitous, and high-performance computing environment using a heterogeneous collection of networked computers. • But there won’t be one, big, uniform system • Resources must be able to come and go dynamically • The base system software supported by each resource must remain inviolate • Multiple languages and programming paradigms must be supported • The environment must be secure • Programs must run fast • For distributed computing…The Holy Grail++

For Example: Rich’s Computational World umich.edu wisc.edu ameslab.gov osc.edu harvard.edu anl.gov ncsa.edu wellesley.edu ksu.edu uiuc.edu lbl.gov indiana.edu virginia.edu ncni.net utk.edu ucsb.edu titech.jp isi.edu vu.nl csun.edu caltech.edu utexas.edu ucsd.edu npaci.edu rice.edu

Zoom In CT94 SDSC IBM SP HPSS Desktops Sun T-3E C The Internet UCSB

The Landscape • Heterogeneous • Processors: X86, SPARC, RS6000, Alpha, MIPS, PowerPC, Cray • Networks: GigE, Myrinet, 100baseT, ATM • OS: Linux, Solaris, AIX, Unicos, OSX, NT, Windows • Dynamically changing • Completely dedicated access is impossible => contention • Failures, upgrades, reconfigurations, etc. • Federated • Local administrative policies take precedence • Performance?

The Computational Grid • Vision: Application programs “plug” into the system to draw computational “power” from a dynamically changing pool of resources. • Electrical Power Grid analogy • Power generation facilities == computers, networks, storage devices, palm tops, databases, libraries, etc. • Household appliances == application programs • Scale to national and international levels • Grid users (both power producers and application consumers) can join and leave the Grid at will.

The Shape of Things to Come? • Grid Research Adventures • Infrastructure • Grid Programming • State of the Grid Art • What do Grids look like today? • Interesting developments, trends, and prognostications of the Grid future

Fundamental Questions • How do we build it? • software infrastructures • policies • maintenance, support, accounting, etc. • How do we program it? • concurrency, synchronization • heterogeneity • dynamism • How do we use it for performance? • metrics • models

General Approach • Combine results from distributed operating systems, parallel computing, and internet computing research domains • Remote procedure call/ remote invocation • Public/private key encryption • Domain decomposition • Location independent naming • Engineering strategy: Implement Grid software infrastructure as middleware • Allows resource owners maintain ultimate control locally over the resources they commit to the Grid • Permits new resources to be incorporated easily • Aids in developing a user community

Middleware Research Efforts • Globus (I. Foster and K. Kesselman) • Collectionof independent remote execution and naming services • Legion (A. Grimshaw) • Distributed object-oriented programming • NetSolve (J. Dongarra) • Multi-language brokered RPC • Condor(M. Livny) • Idle cycle harvesting • NINF (S. Matsuoka) • Java-based brokered RPC

Commonalities • Runtime systems • All current infrastructures are implemented as a set of run time services • Resource is an abstract notion • Anything with an API is resource: operating systems, libraries, databases, hardware devices • Support for multiple programming languages • legacy codes • performance

Infrastructure Concerns • Leverage emerging distributed technologies • Buy it rather than build it • Network infrastructure • Web services • Complexity • Performance • Installation, configuration, fault-diagnosis • Mean time to reconfiguration is probably measured in minutes • Bringing the Grid “down” is not an option • Who operates it?

NPACI • National Partnership for Advanced Computational Infrastructure • high-performance computing for the scientific research community • Goal: Build a production-quality Grid • Leverage emerging standards • Harden and deploy mature Grid technologies • Packaging, configuration, deployment, diagnostics, accounting • Deliver the Grid to scientists

PACI-sized Questions • If the national infrastructure is managed as a Grid... • What resources are attached to it? • X86 is certainly plentiful • Earth Simulator is certainly expensive • Mutithreading is certainly attractive • What is the right blend? • How are they managed? • How long will you wait for your job to get through the queue? • Accounting • What are the units of Grid allocation?

Grid Programming • Two models • Manual: Application is explicitly coded to be a Grid application • Automatic: Grid software “Gridifies” a parallel or sequential program • Start with the simpler approach: build programs that can adapt tochanging Grid conditions • What are the current Grid conditions? • Need a way to assess the available performance • For example: • What is the speed of your ethernet?

Ethernet Doesn’t Have a Speed -- it Has Many TCP/IP throughput mb/s

More Importantly • It is not what the speed was, but what the speed will be that matters • Performance prediction • Analytical models remain elusive • Statistical models are difficult • Whatever models are used, the prediction itself needs to be fast

The Network Weather Service • On-line Grid system that • monitors the performance that is available from distributed resources • forecasts future performance levels using fast statistical techniques • delivers forecasts ‘‘on-the-fly’’ dynamically • Uses adaptive, non-parametric time series analysis models to make short-term predictions • Records and reports forecastingerror with each prediction stream • Runs as any user (no privileged access required) • Scalable and end-to-end

NWS Predictions and Errors Red = NWS Prediction, Black = Data MSE = 73.3, FED = 8.5 mb/s, MAE = 5.8 mb/s

Clusters Too MSE = 4089, FED = 63 mb/s, MAE = 56 mb/s

Many Challenges, No Waiting • On-line predictions • Need it better, faster, cheaper, and more accurate • Adaptive programming • Even if predictions are “there” they will have errors • Performance fluctuates at machines speeds, not human speeds • Which resource to use? When? • Can programmers really manage a fluctuating abstract machine?

GrADS • Grid Application Development Software (GrADS) Project (K. Kennedy, PI) • Investigates Grid programmability • Soup-to-nuts integrated approach • Compilers, Debuggers, libraries, etc. • Automatic Resource Control strategies • Selection and Scheduling • Resource economies (stability) • Performance Prediction and Monitoring • Applications and resources • Effective Grid simulation • Builds upon middleware successes • Tested with “real” applications

Four Observations • The performance of the Grid middleware and services matters • Grid fabric must scale even if the individual applications do not • Adaptivity is critical • So far, only short-term performance predictions are possible • Both application and system must adapt on same time scale • Extracting performance is really really hard • Things happen at machine speeds • Complexity is a killer • We need more compilation technology

Grid Compilers • Adaptive compilation • Compiler and program preparation environment needs to manage complexity • The “machine” for which the compiler is optimizing is changing dynamically • Challenges • Performance of the compiler is important • Legacy codes • Security? • GrADS has broken ground, but there is much more to do

Grid Research Challenges • Four foci characterize Grid “problems” • Heterogeneity • Dynamism • Federalism • Performance • Just building the infrastructure makes research questions out of previously solved problems • Installation • Configuration • Accounting • Grid programming is extremely complex • New programming technologies

Okay, so where are we now?

DISCOM SinRG APGrid IPG … Rational Exuberance

For Example -- TeraGrid • Joint effort between • San Diego Supercomputer Center (SDSC) • National Center for Scientific Applications (NCSA) • Argonne National Laboratory (ANL) • Center for Advanced Computational Research (CACR) • Stats • 13.6 Teraflops (peak) • 600 Terabytes on-line storage • 40 gb/s full connectivity, cross country, between sites • Software Infrastructure is primarily Globus based • Funded by NSF last year

574p IA-32 Chiba City 256p HP X-Class 128p Origin 128p HP V2500 HR Display & VR Facilities Caltech: Data collection and analysis applications 92p IA-32 HPSS HPSS ANL: Visualization SDSC: Data-orientedcomputing Myrinet UniTree HPSS Myrinet 1024p IA-32 320p IA-64 1176p IBM SP Blue Horizon 1500p Origin Sun E10K NCSA: Compute-Intensive Non-trivial Endeavor

It’s Big, but there is Room to Grow • Baseline infrastructure • IA64 processors running Linux • Gigabit ethernet • Myrinet • The Phone Company • Designed to be heterogeneous and extensible • Sites have “plugged” their resources in • IBM Blue Horizon • SGI Origin • Sun Enterprise • Convex X and V Class • Caves, imersadesks, etc.

Middleware Status • Several research and commercial infrastructures have reached maturity • Research: Globus, Legion, NetSolve, Condor, NINF, PUNCH • Commercial: Globus, Avaki, Grid Engine • By far, the most prevalent Grid infrastructure deployed today is Globus

Globus on One Slide • Grid protocols for resource access, sharing, and discovery • Grid Security Infrastructure (GSI) • Grid Resource Allocation Manager (GRAM) • MetaDirectory Service (MDS) • Reference implementation of protocols in toolkit form

Increasing Research Leverage • Grid research software artifacts turn out to be valuable • Much of the extant work is empirical and engineering focused • Robustness concerns mean that the prototype systems need to “work” • Heterogeneity implies the need for portability • Open source impetus • Need to go from research prototypes to nationally available software infrastructure • Download, install, run

Packaging Efforts • NSF Middleware Initiative (NMI) • USC/ISI, SDSC, U. Wisc., ANL, NCSA, I2 • Identifies maturing Grid services and tools • Provides support for configuration tools, testing, packaging • Implements a release schedule and coordination • R1 out 8/02 • Globus, Condor-G, NWS, KX509/KCA • Release every 3 months • Many more packages slated • The NPACkage • Use NMI technology for PACI infrastructure

State of the Art • Dozens of Grid deployments underway • Linux cluster technology is the primary COTS computing platform • Heterogeneity is built in from the start • Networks • Extant systems • Special-purpose devices • Globus is the leading Middleware • Grid services and software tools reaching maturity and mechanisms are in place to maximize leverage

What’s next?

Grid Standards • Interoperability is an issue • Technology drift is starting to become a problem • Protocol zoo is open for business • The Global Grid Forum (GGF) • Modeled after IETF (e.g working groups) • Organized at a much earlier stage of development (relatively speaking) • Meetings every 4 months • Truly an international organization

Webification • Open Grid Service Architecture (OGSA) • “The Physiology of the Grid,” I. Foster, K. Kesselman, J. Nick, S. Tuecke • Based on W3C standards (XML, WSDL, WSIL, UDDI, etc.) • Incorporates web service support for interface publication, multiple protocol bindings, and local/remote transparency • Directly interoperable with Internet-targeted “hosting environments” • J2EE, .NET • The Vendors are excited

Grid@Home • Entropia (www.entropia.com) • Commercial enterprise • Peer-2-Peer approach • Napster for compute cycles (without the law suits) • Microsoft PC-based instead of Linux/Unix based • More compute leverage -- a lot more • Way more configuration support, deployment support, fault-management built into the system • Proprietary technology • Deployed at NPACI on 250+ hosts

Thanks and Credit • organizations • NPACI, SDSC, NCSA, The Globus Project (ISI/USC), The Legion Project (UVa), UTK, LBL • support NSF, NASA, DARPA, USPTO, DOE

Entropia http://www.entropia.com Globus http://www.globus.org GrADS http://hipersoft.cs.rice.edu/grads NMI http://www.nsf-middleware.org NPACI http://www.npaci.edu NWS http://nws.cs.ucsb.edu TeraGrid http://www.teragrid.org More Information http://www.cs.ucsb.edu/~rich

The Computational Grid: Aggregating Performance and Enhanced Capability

The Computational Grid: Aggregating Performance and Enhanced Capability

Presentation Transcript

The Goal

The Goal

The Goal Loop

The Goal

The Goal

The Ultimate Goal

Making the Goal

The Main Goal

The Goal

The big goal:

The big goal:

The Goal

The Goal: Maturity

The Goal:

The Goal

The Goal

The Goal

Goal Title Logo for the Goal

The Ultimate Goal

The general goal

The big goal:

The Goal