350 likes | 452 Vues
Grid and e-Science Technologies. Simon Cox Technical Director Southampton Regional e-Science Centre. Summary. The Grid problem : Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations
E N D
Grid and e-Science Technologies Simon CoxTechnical DirectorSouthampton Regional e-Science Centre
Summary • The Grid problem: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • Grid architecture: Protocol, service definition for interoperability & resource sharing • Grid Middleware • Globus Toolkit a source of protocol and API definitions—and reference implementations • Open Grid Services Architecture represents next step in evolution • Condor High throughput Computing • Web Services & W3C leveraging e-business • e-Science Projects applying Grid concepts to applications
The Grid Problem “Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource” - “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals - assuming the absence of … • central location • central control • omniscience • existing trust
Why Grids? (1) e-Science • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • 1,000 physicists worldwide pool resources for peta-op analyses of petabytes of data • Civil engineers collaborate to design, execute, & analyze shake table experiments • Climate scientists visualize, annotate, & analyze terabyte simulation datasets • An emergency response team couples real time data, weather model, population data
~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Grid Communities & Applications:Data Grids for High Energy Physics www.griphyn.org www.ppdg.net www.eu-datagrid.org
Network for EarthquakeEngineering Simulation • NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other • On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
Online Access to Scientific Instruments Advanced Photon Source wide-area dissemination desktop & VR clients with shared controls real-time collection archival storage tomographic reconstruction DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago
Why Grids? (2) e-Business • Engineers at a multinational company collaborate on the design of a new product • A multidisciplinary analysis in aerospace couples code and data in four companies • An insurance company mines data from partner hospitals for fraud detection • An application service provider offloads excess load to a compute cycle provider • An enterprise configures internal & external resources to support e-Business workload
Grids: Why Now? • Moore’s law Þ highly functional end-systems • Ubiquitous Internet Þ universal connectivity • Network exponentials produce dramatic changes in geometry and geography • 9-month doubling: double Moore’s law! • 1986-2001: x340,000; 2001-2010: x4000? • New modes of working and problem solving emphasize teamwork, computation • New business models and technologies facilitate outsourcing
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Heterogeneity of device, mechanism, policy • Sharing conditional: negotiation, payment, … • Coordinated problem solving • Integration of distributed resources • Compound quality of service requirements • Dynamic, multi-institutional virtual organisations • Dynamic overlays on classic org structures • Map to underlying control mechanisms http://www.globus.org/research/papers/anatomy.pdf
The Grid World: Current Status • Dozens of major Grid projects in scientific & technical computing/research & education • Deployment, application, technology • Some consensus on key concepts and technologies • Open source Globus Toolkit™ a de facto standard for major protocols & services • Far from complete or perfect, but out there, evolving rapidly, and large tool/user base • Global Grid Forum a significant force • Industrial interest emerging rapidly http://www.gridforum.org
Grid Middleware (coordinate and authenticate use of grid services) • Globus (and GGF grid-computing protocols) • Security Infrastructure (GSI) • Resource Allocation Mechanism (GRAM) • Resource Information System (GRIS) • Index Information Service (GIIS) • Grid-FTP • Metadirectory service (MDS 2.0+) coupled to LDAP server • Condor (distributed high performance throughput system) • Condor-G allows us to handle dispatching jobs to our Globus system • Active collaboration from with the Condor development team at University of Wisconsin (Miron Livny)
The Globus ProjectMaking Grid computing a reality • Close collaboration with real Grid projects in science and industry • Development and promotion of standard Grid protocols to enable interoperability and shared infrastructure • Development and promotion of standard Grid software APIs and SDKs to enable portability and code sharing • The Globus Toolkit: Open source, reference software base for building grid infrastructure and applications • Global Grid Forum: Development of standard protocols and APIs for Grid computing http://www.gridforum.org http://www.globus.org
Four Key Protocols • The Globus Toolkit centers around four key protocols • Connectivity layer: • Security: Grid Security Infrastructure (GSI) • Resource layer: • Resource Management: Grid Resource Allocation Management (GRAM) • Information Services: Grid Resource Information Protocol (GRIP) • Data Transfer: Grid File Transfer Protocol (GridFTP)
MDS-2 (Meta Directory Service) Soft state registration; enquiry Reliable remote invocation GSI (Grid Security Infrastruc-ture) User Reporter(registry +discovery) GIIS: GridInformationIndex Server (discovery) Gatekeeper(factory) Authenticate & create proxy credential Other GSI-authenticated remote service requests Create process Register User User process #1 process #2 Other service(e.g. GridFTP) Proxy Proxy #2 GRAM (Grid Resource Allocation & Management) The Globus Toolkit in One Slide • Grid protocols (GSI, GRAM, …) enable resource sharing within virtual orgs; toolkit provides reference implementation ( = Globus Toolkit services) • Protocols (and APIs) enable other tools and services for membership, discovery, data mgmt, workflow, …
Globus Toolkit: Evaluation (+) • Good technical solutions for key problems, e.g. • Authentication and authorization • Resource discovery and monitoring • Reliable remote service invocation • High-performance remote data access • This + good engineering is enabling progress • Good quality reference implementation, multi-language support, interfaces to many systems, large user base, industrial support • Growing community code base built on tools
Globus Toolkit: Evaluation (-) • Protocol deficiencies, e.g. • Heterogeneous basis: HTTP, LDAP, FTP • No standard means of invocation, notification, error propagation, authorization, termination, … • Significant missing functionality, e.g. • Databases, sensors, instruments, workflow, … • Virtualization of end systems (hosting envs.) • Little work on total system properties, e.g. • Dependability, end-to-end QoS, … • Reasoning about system properties
What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. • Condor uses ClassAd Matchmaking to make sure that everyone is happy. • Features • Unix and NT • Operational since 1986 • Manages more than 1300 CPUs at UW-Madison • Software available free on the web • More than 150 Condor installations worldwide in academia and industry • Non-dedicated resources • Job checkpoint and migration
What is High-Throughput Computing? • High-performance: CPU cycles/second under ideal circumstances. • “How fast can I run simulation X on this machine?” • High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. • “How fast can I run simulation X on this machine?” • “How many times can I run simulation X in the next month using all available machines?”
Some HTC Challenges • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else
What is ClassAd Matchmaking? • Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.”
Mathematicians Solve NUG30 • Looking for the solution to the NUG30 quadratic assignment problem • An informal collaboration of mathematicians and computer scientists • Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) • 14,5,28,24,1,3,16,15, • 10,9,21,2,4,29,25,22, • 13,26,17,30,6,20,19, • 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
What Is Condor-G? • Enhanced version of Condor that provides robust job management for Globus Toolkit • Robust replacement for globusrun • Provides extensive fault-tolerance • Brings Condor’s job management features to Globus jobs • Two Parts • Globus Universe • GlideIn • Excellent example of applying the general purpose Globus Toolkit to solve a particular problem (i.e. high-throughput computing) on the Grid
Why Use Condor-G • Condor • Designed to run jobs within a single administrative domain • Globus Toolkit • Designed to run jobs across many administrative domains • Condor-G • Combine the strengths of both
Web Services • Increasingly popular standards-based framework for accessing network applications • W3C standardization; Microsoft, IBM, Sun, others • XML and XML Schema • Representing data in a portable format • WSDL: Web Services Description Language • Interface Definition Language for Web services • SOAP: Simple Object Access Protocol • XML-based RPC protocol; common WSDL target • WSDL (/ WS-Inspection) • Conventions for locating service descriptions • UDDI: Universal Description, Discovery, & Integration • Directory for Web services
“New” GlobusOpen Grid Services Architecture (OGSA) • Service orientation to virtualize resources • From Web services: • Standard interface definition mechanisms: multiple protocol bindings, multiple implementations, local/remote transparency • Building on Globus Toolkit: • Grid service: semantics for service interactions • Management of transient instances (& state) • Factory, Registry, Discovery, other services • Reliable and secure transport • Multiple hosting targets: J2EE, .NET, “C”, … http://www.globus.org/research/papers/ogsa.pdf http://www.globus.org/research/papers/gsspec.pdf
OGSA Service Model • System comprises (a typically few) persistent services & (potentially many) transient services • All services adhere to specified Grid service interfaces and behaviours • Reliable invocation, lifetime management, discovery, authorization, notification, upgradeability, concurrency, manageability • Interfaces for managing Grid service instances • Factory, registry, discovery, lifetime, etc. => Reliable, secure management of distributed state
(a) Simple Hosting Environment (b) Virtual Hosting Environment (c) Compound Services Registry Registry E2E Factory Factory E2E Reg Service Service Factory ... ... H2R H2R E2E H2R Factory Factory Mapper Mapper Mapper ... ... ... Service Service Service Service Service Service E2E S E2E S E2E S F F R R R R F F 1 2 M M M M F F S S S S S S S S S S S S Using OGSAto Construct Grid Environments In each case, Registry handle is effectively the unique name for the virtual organization.
Evolution of Globus • Initial exploration (1996-1999; Globus 1.0) • Extensive application experiments; core protocols • Data Grids (1999-??; Globus 2.0+) • Large-scale data management and analysis • Open Grid Services Architecture (2001-??, Globus 3.0) • Integration with Web services, hosting environments, resource virtualization • Databases, higher-level services • Radically scalable systems (2003-??) • Sensors, wireless, ubiquitous computing
Summary • The Grid problem: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • Grid architecture: Protocol, service definition for interoperability & resource sharing • Grid Middleware • Globus Toolkit a source of protocol and API definitions—and reference implementations • Open Grid Services Architecture represents next step in evolution • Condor High throughput Computing • Web Services & W3C leveraging e-business • e-Science Projects applying Grid concepts to applications