Introduction to Grid Computing

Introduction to Grid Computing Ann Chervenak and Ewa Deelman USC Information Sciences Institute

Outline • Motivation • Definition and characteristics of Grids • Example Grid applications • Grid Architecture • How a Grid Is Assembled • Overview of the Globus Toolkit • Security Tools • Monitoring and Discovery System • Computing/Execution Tools • Data Tools • A more detailed example: The Earth System Grid

Motivation: Supporting Scientific Applications • Computation intensive • Large-scale simulation and analysis (climate modeling, galaxy formation, gravity waves, event simulation) • Engineering (parameter studies, linked models) • Data intensive • Experimental data analysis (high energy physics) • Image & sensor analysis (astronomy, climate) • Distributed collaboration • Online instrumentation (microscopes, x-ray) • Remote visualization (climate studies, biology) • Engineering (large-scale structural testing) • Large, complex scientific problems • Require people in several organizations to collaborate • Share computing resources, data, instruments

The Grid Problem • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource (From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”) • Enable communities (“Virtual Organizations”) to share geographically distributed resources as they pursue common goals • Assuming the absence of… • central location • central control • omniscience • existing trust relationships

An Old Idea … • “The time-sharing computer system can unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.” • Fernando Corbato and Robert Fano, 1966 • “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” • Len Kleinrock, 1967

A Few Grid Application Examples

Earth System Grid objectives To support the infrastructural needs of the national and international climate community, ESG is providing crucial technology to securely access, monitor, catalog, transport, and distribute data in today’s Grid computing environment. HPChardware running climate models ESG Portal ESGSites Slide Courtesy of Dave Bernholdt, ORNL 7 Bernholdt_ESG_0611

IPCC Downloads (10/12/06) Nov 2004 – Oct 2006 Worldwide ESG user base ESGFacts and Figures Slide Courtesy of Dave Bernholdt, ORNL

NSF’s TeraGrid* • TeraGrid DEEP: Integrating NSF’s most powerful computers (60+ TF) • 2+ PB Online Data Storage • National data visualization facilities • World’s most powerful network (national footprint) • TeraGrid WIDE Science Gateways: Engaging Scientific Communities • 90+ Community Data Collections • Growing set of community partnerships spanning the science community. • Leveraging NSF ITR, NIH, DOE and other science community projects. • Engaging peer Grid projects such as Open Science Grid in the U.S. as peer Grids in Europe and Asia-Pacific. • Base TeraGrid Cyberinfrastructure:Persistent, Reliable, National • Coordinated distributed computing and information environment • Coherent User Outreach, Training, and Support • Common, open infrastructure services UC/ANL PSC PU NCSA IU ORNL UCSD UT • A National Science Foundation Investment in Cyberinfrastructure $100M 3-year construction (2001-2004) $150M 5-year operation & enhancement (2005-2009) * Slide courtesy of Ray Bair, Argonne National Laboratory

~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 France Regional Centre Germany Regional Centre Italy Regional Centre FermiLab ~4 TIPS ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids forHigh Energy Physics Image courtesy Harvey Newman, Caltech

Elements of a Grid • Resource sharing • Computers, storage systems, sensors, networks,… • This sharing is always conditional: issues of trust, policy, negotiation, payment, etc. • Coordinated problem solving • Distributed data analysis, computation, simulation, collaboration, … • Dynamic, multi-institutional virtual organizations • Community overlays on classic organizational structures • May be large or small, static or dynamic

Two Rules or Principles of the Grid • Can’t rely on homogeneity of resources • In practice, resources in a large, distributed environment will be heterogeneous • STRATEGY - Plan for diverse systems and use mechanisms to manage heterogeneity • Can’t rely on trust among participants • Sites will not be willing to share their resources if they cannot trust clients from other sites • STRATEGY - Provide a security model that can express complicated social networks • STRATEGY - Use full disclosure when making requests (who is requesting, authorizing, and authenticating the request) and give service owners tools to enforce local policies.

Grid Infrastructure • Provides distributed management • Of physical resources • Of software services • Of communities and their policies • Unified treatment • Build on Web Services framework • Use Web Services Resource Framework (WS-RF), Web Services Notification (WS-Notification), etc. to represent and access state associated with a service • Common management abstractions & interfaces

Elements of the End-to-End Problem Include … • Massively parallel petascale simulation • High-performance parallel I/O • Remote visualization • High-speed reliable data movement • Terascale local analysis • Data access and analysis by external users • Troubleshooting problems in end-to-end system • Security • Orchestration of these various activities Slide Courtesy of Ian Foster

Layered Grid Architecture

Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture)

Protocols, Services,and APIs Occur at Each Level Applications Languages/Frameworks Collective Service APIs and SDKs Collective Service Protocols Collective Services Resource APIs and SDKs Resource Service Protocols Resource Services Connectivity APIs Connectivity Protocols Local Access APIs and Protocols Fabric Layer

Important Points • Built on Internet protocols & services • Communication, routing, name resolution, etc. • “Layering” here is conceptual, does not imply constraints on who can call what • Protocols/services/APIs/SDKs will, ideally, be largely self-contained • Some things are fundamental: e.g., communication and security • But, advantageous for higher-level functions to use common lower-level functions

The Hourglass Model • Focus on architecture issues • Propose set of core services as basic infrastructure • Use to construct high-level, domain-specific solutions • Design principles • Keep participation cost low • Enable local control • Support for adaptation • “IP hourglass” model A p p l i c a t i o n s Diverse global services Core services Local OS

Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture)

Connectivity LayerProtocols & Services • Communication protocols • Internet protocols: IP, DNS, routing, etc. • Security protocols and infrastructure • Uniform authentication, authorization, and message protection mechanisms in multi-institutional setting • Single sign-on, delegation, identity mapping • E.g., Public key technology, SSL, X.509, GSS-API • Supporting infrastructure: Certificate Authorities, certificate & key management, … GSI: www.gridforum.org/security

Resource LayerProtocols & Services • Job submission and management tools • Remote allocation, advance reservation, control of compute resources • Data Transport Tools • High-performance data access & transport • Information Provider • Collects information about the current state of a resource, makes available to higher-level service

Collective LayerProtocols & Services • Information Services • Aggregate and publish information about resource characteristics • Monitor current status of resources • Resource brokers • Resource discovery and allocation • Metadata and Replica Catalogs • Data Management Services (e.g., replication) • Co-reservation and co-allocation services • Workflow management services

Example:High-ThroughputComputing System App High Throughput Computing System Collective (App) Dynamic checkpoint, job management, failover, staging Collective (Generic) Brokering, certificate authorities Access to data, access to computers, access to network performance data Resource Communication, service discovery (DNS), authentication, authorization, delegation Connect Storage systems, schedulers Fabric

Example: Grid Servicesfor Data-Intensive Applications App Discipline-Specific Data Grid Application Collective (App) Coherency control, replica selection, task management, data placement services, … Collective (Generic) Replica catalog, replica management, co-allocation, certificate authorities, metadata catalogs, … Access to data, access to computers, access to network performance data, … Resource Communication, service discovery (DNS), authentication, authorization, delegation Connect Storage systems, clusters, networks, network caches, … Fabric

Introduction to Grid Computing