Condor- G

FATIH UNIVERSITY Computer Engineering Condor-G A Computation Management Agent for Multi-Institutional Grids Helton MALAMBANE

Outline • INTRODUCTION • Large-scale sharing of computational resources • GridProtocols • Computation management (Condor-G Core) • GlideInmechanism

1. INTRODUCTIONGrid user requirements • They want to be able to discover, acquire, and reliably manage computational resources dynamically, in the course of their everyday activities • They do not want to be bothered with the location of these resources, the mechanisms that are required to use them, with keeping track of the status of computational tasks operating on these resources, or with reacting to failure • They do care about how long their tasks are likely to run and how much these tasks will cost

Solution: The Condor-G Leverages software from Globus and Condor. “allows the user to control multi-domain resources as if they all belong to one personal domain “ Globus Toolkit : inter-domainresource management protocols. Condor: intra-domain resource management methods.

2. Large-scale sharing of computational resources How to build and manage a multi-site computation that uses resources that belong to different sites? DIFFICULTIES: • Different sites may feature different authentication and authorization mechanisms, schedulers, hardware architectures, operating systems, file systems, etc. • The user has little knowledge of the characteristics of resources at remote sites, and no easy means of obtaining this information • Due to the distributed nature of the multi-site computing environment, computers, networks, and subcomputations can fail in various ways. • Keeping track of the status of different elements of a computation involves tedious bookkeeping, especially in the event of failure and dependencies among subcomputations.

2. Large-scale sharing of computational resources How to build and manage a multi-site computation that uses resources that belong to different sites? APPROACH: • Remote resource access issues are addressed by requiring that remote resources speak standard protocols for resource discovery and management. • Computation management issues are addressed via the introduction of a robust, multi-functional user computation management agent responsible for resource discovery, job submission, job management, and error recovery. From Condor • Remote execution environment issues are addressed via the use of mobile sandboxing technology that allows a user to create a tailored execution environment on a remote node.

3. GridProtocols - Outline Protocolsused in the Condor-G system: 3.1. GSI (Grid Security Infrastructure) 3.2. GRAM (Grid Resource Allocation and Management) 3.3. MDS-2 (Monitor and Discovery System) 3.4. GASS (Global Access to Secondary Storage)

3.1. GIS The Globus Toolkit’s Grid Security Infrastructure • Makes it possible to authenticate a user just once. • Uses Public Key Infrastructure (PKI) • GSI employs the user’s private key to create a proxy credential, which serves as a new private-public key pair that allows a proxy (such as the Condor-G agent) to make remote requests on behalf of the user

3.2. GRAM protocol The Grid Resource Allocation and Management • The Grid Resource Allocation and Management protocol supports remote submission monitoring and control of a computational request to a remote computational resource. Eg: “run program P”. • Uses GSI for authentication/authorization. • Two-phase commit (using requests sequences and commit command) . • Logs details of all active jobs (useful for crash recovery).

3.3. MDS protocols Monitor and Discovery System • Allows discovering and disseminating information about the structure and state of Grid resources. • Uses GSI for access control. The idea: • A resource uses the Grid Resource Registration Protocol (GRRP) to notify other entities that it is part of the Grid. • Those entities can then use the Grid Resource Information Protocol (GRIP) to obtain information about resource status

3.4. GASS service The Globus Toolkit’s Global Access to Secondary Storage • Provides mechanisms for transferring data between a remote HTTP, FTP, or GASS server • In the current context, we use these mechanisms to stage executables and input files to a remote computer • GSI mechanisms are used for authentication

4. Computation management The Condor-G agent: • 4.1. User interface • 4.2. Supporting remote execution • 4.3. Credential management • 4.4. Resource discovery and scheduling

4.1. User interface • The Condor-G agent allows the user to treat the Grid as an entirely local resource, with an API and command line tools that allow the user to perform the following job management operations: • Submit jobs, indicating an executable name, input/output files and arguments; • Query a job’s status, or cancel the job; • Be informed of job termination or problems, via callbacks or asynchronous mechanisms such as email; • Obtain access to detailed logs, providing a complete history of their jobs’ execution.

4.1. User interface • The innovation in Condor-G is that these capabilities are provided by a personal desktop gent and supported in a Grid environment, while guaranteeing fault tolerance and exactly-once execution semantics. • providing the user with a familiar and reliable single access point to all the resources he/she is authorized to use.

4.2. Supporting remote executionJob Submission Process • User indicates jobs to the scheduler. • Schedulercreates a GridManagerdaemon. • For each job the GriManagercreates a JobManagerusingtwo-phasecommit GRAM. • GASS is usedto transfer job executables, input files andtoprovide output. • JobManagersubmits the jobs to the localscheduling system.

4.2. Supporting remote executionCrash Tolerance Condor-G is built to tolerate four types of failure: 1. Crash of the Globus JobManager: • The GridManager then probes the GateKeeper. • If Gatekeeper respondsthen a new JobManager is started.

4.2. Supporting remote executionCrash Tolerance Condor-G is built to tolerate four types of failure: 2 & 3. Resource Management Machine Or Network Failure: • The GridManager waits until connection is re-established. • Thenreconnectsto the jobManager.

4.2. Supporting remote executionCrash Tolerance Condor-G is built to tolerate four types of failure: 4. Job Submission Machine: • The GridManager gives the jobManager its New IP and PORT.

4.3. Credential Management • GSI proxy credentialis usedtoauthenticatewithresorces. • Because Proxy credentialsexpire the agent periodically checks user creentials. • Whencredentialsexpire the jobs are put on holdand the user is notified. • Problem: long taskswillrequire frequent proxy updates.

4.3. Credential Management Solution: MyProxySystem (Long-lived proxy credentials) Remote services acting on behalf of the user can then obtain short-lived proxies (e.g. 12 hours) from the server.

4.4. Resource discovery and scheduling • The Simple Approach: • a user-supplied list of GRAM servers. • The resource broker: • gathers information about available GRAM servers using the Monitor and Discovery System (MDS). • User Canthenchoosefrom the list of available servers. For the case of high throughput computations“flooding” is applied.

5. GlideIn mechanism What happens when a job executes on a remote platform where required files are not available and local policy may not permit access to local file systems? Solution: Sandboxing

5. GlideIn mechanism The Idea: • Starts a daemon on the remote computer thatlearnsabout the availablesettingsand resources. • Runs eachuser task in a “sandbox”: where system calls are redirectedto the local system.

FATIH UNIVERSITY Computer Engineering THANKS QUESTIONS? Helton MALAMBANE

Condor- G

Condor- G

Presentation Transcript

Condor-G and DAGMan An Introduction

Condor

Condor-G Operations

CONDOR

Condor-G: A Case in Distributed Job Delegation

CONDOR-G Installation

Condor-G and DAGMan An Introduction

Condor

CONDOR

OGF 19 Condor Software Forum Condor-G

Condor-G: An Update

Condor

Condor-G Stork and DAGMan An Introduction

Condor-G - Your Window to the Grid

Introduction to Globus with Condor-G

CONDOR

Condor-G Making Condor Grid Enabled

Condor-G: A Case in Distributed Job Delegation

What’s New in Condor-G

Condor-G - Your Window to the Grid

Condor-G and DAGMan An Introduction

Condor-G: An Update