Grid Job, Information and Data Management for the Run II Experiments at FNAL

Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team

Plan of Attack • Brief History, D0 and CDF computing • Grid Jobs and Information Management • Architecture • Job management • Information management • JIM project status and plans • Globally Distributed data handling in SAM and beyond • Summary Igor Terekhov, FNAL

History • Run II CDF and D0, the two largest, currently running collider experiments • Each experiment to accumulate ~1PB raw, reconstructed, analyzed data by 2007. Get the Higgs jointly. • Real data acquisition – 5 /wk, 25MB/s, 1TB/day, plus MC Igor Terekhov, FNAL

Globally Distributed Computing • D0 – 78 institutions, 18 countries. CDF – 60 institutions, 12 countries. • Many institutions have computing (including storage) resources, dozens for each of D0, CDF • Some of these are actually shared, regionally or experiment-wide • Sharing is good • A possible contribution by the institution into the collaboration while keeping it local • Recent Grid trend (and its funding) encourages it Igor Terekhov, FNAL

Goals of Globally Distributed Computing in Run II • To distribute data to processing centers – SAM is a way, see later slide • To benefit from the pool of distributed resources – maximize job turnaround, yet keep single interface • To facilitate and automate decision making on job/data placement. • Submit to the cyberspace, choose best resource • To provide an aggregate view of the system and its activities and keep track of what’s happening • To maintain security • Finally, to learn and prepare for the LHC computing Igor Terekhov, FNAL

SAM Highlights • SAM is Sequential data Access via Meta-data. http://{d0,cdf}db.fnal.gov/sam • Presented numerous times, prev CHEPS • Core features: meta-data cataloguing, global data replication and routing, co-allocation of compute and data resources • Global data distribution: • MC import from remote sites • Off-site analysis centers • Off-site reconstruction (D0) Igor Terekhov, FNAL

Now that the Data’s Distributed: JIM • Grid Jobs and Information Management • Owes to the D0 Grid funding – PPDG (an FNAL team), UK GridPP (Rod Walker, ICL) • Very young – started 2001 • Actively explore, adopt, enhance, develop new Grid technologies • Collaborate with the Condor team from The University of Wisconsin on Job management • JIM with SAM is also called The SAMGrid T<10min? Igor Terekhov, FNAL

Igor Terekhov, FNAL

Job Management Strategies • We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster) • We distinguish structured jobs from unstructured. • Structured jobs have their details known to Grid middleware. • Unstructured jobs are mapped as a whole onto a cluster • In the first phase, we want reasonably intelligent scheduling and reliable execution of unstructured data-intensive jobs. Igor Terekhov, FNAL

Job Management Highlights • We seek to provide automated resource selection (brokering) at the global level with final scheduling done locally (environments like CDF CAF, Frank’s talk) • Focus on data-intensive jobs: • Execution time is composed of: • Time to retrieve any missing input data • Time to process the data • Time to store output data • In the Leading Order, we rank sites by the amount of data cached at the site (minimize missing input data) • Scheduler is interfaced with the data handling system Igor Terekhov, FNAL

Job Management – Distinct JIM Features • Decision making is based on both: • Information existing irrespective of jobs (resource description) • Functions of (jobs,resource) • Decision making is interfaced with data handling middleware rather than individual SE’s or RC alone: this allows incorporation of DH considerations • Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability Igor Terekhov, FNAL

User Interface User Interface Submission Client Submission Client Job Management Match Making Service Match Making Service Broker Queuing System Queuing System Information Collector Information Collector JOB Data Handling System Data Handling System Data Handling System Data Handling System Execution Site #1 Execution Site #n Computing Element Computing Element Computing Element Storage Element Storage Element Storage Element Storage Element Storage Element Grid Sensors Grid Sensors Grid Sensors Grid Sensors Computing Element Igor Terekhov, FNAL

Condor Framework and Enhancements We Drove • Initial Condor-G: • Personal Grid agent helping user run a job on a cluster of his/her choice • JIM: True grid service for accepting and placing jobs from all users • Added MMS for Grid job brokering • JIM: from 2-tier to 3-tier architecture • Decouple queing/spooling/scheduling machine from user machine • Security delegation, proper std* spooling, etc • Will move into standard Condor Igor Terekhov, FNAL

Condor Framework and Enhancements We Drove • Classic Matchmaking service (MMS): • Clusters advertise their availability, jobs are matched with clusters • Cluster (Resource) description exists irrespective of jobs • JIM: Ranking expressions contain functions that are evaluated at run-time • Helps rank a job by a function(job,resource) • Now: query participating sites for data cached. Future: estimates when data for the job can arrive etc • Feature now in standard Condor-G Igor Terekhov, FNAL

Monitoring Highlights • Sites (resources) and jobs • Distributed knowledge about jobs etc • Incremental knowledge building • GMA for current state inquiries, Logging for recent history studies • All Web based Igor Terekhov, FNAL

Information Management – Implementation and Technology Choices • XML for representation of site configuration and (almost) all other information • Xquery and XSLT for information processing • Xindice and other native XML databases for database semantics Igor Terekhov, FNAL

Meta-Schema Schema Main Site/cluster Config … Resource Advertisement Monitoring Schema Data Handling Hosting Environment Igor Terekhov, FNAL

Web Browser Web Browser JIM Monitoring Web Server Web Server 1 Web Server N Site N Information System Site 2 Information System Site 1 Information System IP IP IP IP Igor Terekhov, FNAL

JIM Project Status • Delivered prototype for D0, Oct 10, 2002: • Remote job submission • Brokering based on data cached • Web-based monitoring • SC-2002 demo – 11 sites (D0, CDF), big success • April 2003 – production deployment of V1 (Grid analysis in production a reality as of April, 1) • Post V1 – OGSA, Web services, logging service Igor Terekhov, FNAL

Grid Data Handling • We define GDH as a middleware service which: • Brokers storage requests • Maintains economical knowledge about costs of access to different SE’s • Replicates data as needed (not only as driven by admins) • Generalizes or replaces some of the services of the Data Management part of SAM Igor Terekhov, FNAL

Grid Data Handling, Initial Thoughts Igor Terekhov, FNAL

The Necessary (Almost) Final Slide • Run II experiments’ computing is highly distributed, Grid trend is very relevant • The JIM (Jobs and Information Management) part of the SAMGrid addresses the needs for global and grid computing at Run II • We use Condor and Globus middleware to schedule jobs globally (based on data), and provide Web-based monitoring • Demo available – see me or Gabriele • SAM, the data handling system, is evolved towards the Grid, with modern storage element access enabled Igor Terekhov, FNAL

P.S. – Related Talks • F. Wuerthwein, CAF (Cluster Analysis Facility) – job management on a cluster and interface to JIM/Grid • F. Ratnikov, Monitoring on CAF and interface to JIM/Grid • S. Stonjek, SAMgrid deployment experiences • L. Lueking, G. Garzoglio – SAM-related Igor Terekhov, FNAL

Backup Slides

Information Management • In JIM’s view, this includes both: • resource description for job brokering • Infrastructure for monitoring (core project area) • GT MDS is not sufficient: • Need (persistent) info representation that’s independent of LDIF or other such format • Need maximum flexibility in information structure – no fixed schema • Need configuration tools, push operation etc Igor Terekhov, FNAL

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Presentation Transcript

Storage and Data Movement at FNAL

The DØ Experiment and Run II at the FNAL Tevatron

Grid Data Management GridFTP

Grid Data Management

Management Information and Data

The Sequential Access Model for Run II Data Management and Delivery

Virtual Data Management for Grid Computing

Low Level Grid Services (Job Management, Data Management, Monitoring Services)

Job submission into the LHC Grid ( Job Management + JDL)

The CDF Run II Data Catalog and Data Access Modules

The CDF Run II Data Catalog and Data Access Modules

Data and the Grid

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005

Grid Compute Resources and Job Management

Grid Job and Information Management (JIM) for D0 and CDF

Maximizing the information/data ratio at run-time

Prospect for Higgs Discovery at the TeVatron Run II

Job submission into the LHC Grid ( Job Management + JDL)

Grid Data Management

Grid Compute Resources and Job Management