Grid3: Practice and Principles

Grid3: Practice and Principles Rob Gardner University of Chicago rwg@hep.uchicago.edu April 12, 2004

Introduction • Today I’d like to review a little about Grid3, it’s principles and practice • I’d also like to show how we intend to use it for ATLAS Data Challenge 2… • …and, how we will evolve it over the course of the next several months • Acknowledgements • I’m entirely reporting on other’s work! • Much of the project was summarized at the iVDGL NSF review (Feb 04), and US LHC Project reviews (Jan 04) • Especially I am indebted to Marco Mambelli, Ian Fisk, Ruth Pordes, Jorge Rodriguez, Leigh Grundhoefer for slides

Outline • Intro/project background • Grid3 infrastructure and operations • Applications • Metrics and Lessons • Grid3+, development, and ATLAS DC2 • Conclusions

Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link Grid themes: then and now Eg. proposal (simple, naïve?) • Internationally integrated virtual data grid system, interoperable, controlled sharing of data and compute resources • Common grid infrastructure, export to other application domains • Large scale research “Laboratory” for VDT development Now • Many new initiatives, in US & worldwide • Dynamic resources and organizations • Dynamic project priorities • Ability to adapt to change is key factor to success •  Grid2003 (Grid3)

Grid2003 Project history • Joint project with USATLAS, USCMS, iVDGL, PPDG, GriPhyN • Organized as a “Project”: Grid2003 • Developed Summer/Fall 2003 Grid3 “grid” • benefited from STAR cycles and local efforts at BNL/ITD • Use US-developed components • VDT based (GRAM, Gridftp, MDS, Monitoring components) + applications • iGOC monitoring and VO level services • Interoperate, or federate with other Grids like LCG • ATLAS successful use of Chimera+LCG-1 last December • USCMS: storage element interoperability

23 institutes Argonne National Laboratory Jerry Gieraltowski, Scott Gose, Natalia Maltsev, Ed May, Alex Rodriguez, Dinanath Sulakhe Boston University Jim Shank, Saul Youssef Brookhaven National Laboratory David Adams, Rich Baker, Wensheng Deng, Jason Smith, Dantong Yu Caltech Iosif Legrand, Suresh Singh, Conrad Steenberg, Yang Xia Fermi National Accelerator Laboratory Anzar Afaq, Eileen Berman, James Annis, Lothar Bauerdick, Michael Ernst, Ian Fisk, Lisa Giacchetti, Greg Graham, Anne Heavey, Joe Kaiser, Nickolai Kuropatkin, Ruth Pordes*, Vijay Sekhri, John Weigand, Yujun Wu Hampton University Keith Baker, Lawrence Sorrillo Harvard University John Huth Indiana University Matt Allen, Leigh Grundhoefer, John Hicks, Fred Luehring, Steve Peck, Rob Quick, Stephen Simms Johns Hopkins University George Fekete, Jan vandenBerg Kyungpook National University / KISTI Kihyeon Cho, Kihwan Kwon, Dongchul Son, Hyoungwoo Park Lawrence Berkeley National Laboratory Shane Canon, Jason Lee, Doug Olson, Iowa Sakrejda, Brian Tierney University at Buffalo Mark Green, Russ Miller University of California San Diego James Letts, Terrence Martin University of Chicago David Bury, Catalin Dumitrescu, Daniel Engh, Ian Foster, Robert Gardner*, Marco Mambelli, Yuri Smirnov, Jens Voeckler, Mike Wilde, Yong Zhao, Xin Zhao University of Florida Paul Avery, Richard Cavanaugh, Bockjoo Kim, Craig Prescott, Jorge L. Rodriguez, Andrew Zahn University of Michigan Shawn McKee University of New Mexico Christopher T. Jordan, James E. Prewett, Timothy L. Thomas University of Oklahoma Horst Severini University of Southern California Ben Clifford, Ewa Deelman, Larry Flon, Carl Kesselman, Gaurang Mehta, Nosa Olomu, Karan Vahi University of Texas, Arlington Kaushik De, Patrick McGuigan, Mark Sosebee University of Wisconsin-Madison Dan Bradley, Peter Couvares, Alan De Smet, Carey Kireyev, Erik Paulson, Alain Roy University of Wisconsin-Milwaukee Scott Koranda, Brian Moe Vanderbilt University Bobby Brown, Paul Sheldon *Contact authors ~60 people working directly: 8 full time, 10 half time, 20 site admins ¼ time

Grid3 services, roughly • Site Software • VO Services • Information Services • Monitoring • Applications

Operations - Site Software • Design of Grid3 Site software distribution base • Largely based upon successful WorldGrid installation deployment • Create and Maintain Installation Guides • Coordinate upgrades after initial installation (Installation Fests) Pacman iVDGL:Grid3 Site Software Grid3 Site VDT VO service GIIS Reg InfoProv G3 Schema LogMgmt Compute Facility Storage

Operations – Security Base • One of the Virtual Organizations Registration Authorities (VO RA) operating with some delegated authority of the DOE Grids Certificate Authority is the iVDGL Registration Authority • The iVDGL RA is used to check the identity of individuals requesting certificates. • 282 iVDGL Certificates have been issued for iVDGL use.

Operations – Security Model • Provide Grid3 compute resources with automated multi-VO authorization model, using VOMS and mkgridmap • Each VO manages a service and it’s members • Each Grid3 site is able to generate a Globus Auth. file with an authenticated SOAP query to each VO service SDSS VOMS USCMS VOMS Grid3 Sites Globus Auth. USAtlas VOMS BTeVVOMS LSC VOMS iVDGLVOMS

Operations - Support and Policy • Investigation and resolution of grid middleware problems at the level of 16-20 contacts per week • Develop Service Level Agreements for Grid service systems and iGOC support service • and other centralized support points such as LHC Tier1 • Membership Charter completed which defines the process to add new VO’s, sites and applications to the Grid Laboratory • Support Matrix defining Grid3 and VO services providers and contact information

Operations - Index Services • Hierarchical Globus Information Index Service (GIIS) design • Automated Resource registration to Index Service • MDS Grid3 Schema development and Information Provider verification • MDS tuning for large heterogeneous grid VO Index Service (6) GRID3 Site Resources Grid3 Index Service USAtlas GIIS Boston GRIS Boston U GRID3 Location Grid3 Data_DIR UofChicago GRID3 GIIS ANL BNL ANL BNL Boston U UofChicago UFL GRIS USCMS GIIS Grid3 Location Grid3 Data DIR Grid3 Applications Grid3 Temporary DIR UFL FNAL RiceU CalTech UFL FNAL RiceU CalTech

Grid Operations - Site Monitoring • Ganglia • Open source tool to collect cluster monitoring information such as CPU and network load, memory and disk usage • MonA LISA • Monitoring tool to support resource discovery, access to information and gateway to other information gathering systems • ACDC Job Monitoring System • Application using grid submitted jobs to query the job managers and collect information about jobs. This information is stored in a DB and available for aggregated queries and browsing. • Metrics Data Viewer (MDViewer) • analyzes and plots information collected by the different monitoring tools, such as the DBs at iGOC. • Globus MDS • Grid2003 Schema for Information Services and Index Services for Information services

Site Catalog VO GIIS GIIS Ganglia ACDC JobDB MonALISA ML repository IS Clients IS Clients IS Clients Client tools Monitoring services Intermediaries Producers Consumers WWW OS (syscall, /proc) GRIS Reports Log files System config. Job manager MonALISA client User clients MDViewer

Ganglia • Usage information • CPU load • NIC traffic • Memory usage • Disk usage • Used directly and indirectly • Site Web pages • Central Web pages • MonALISA agent

MonALISA • Flexible framework • Java based • JINI directory • Multiple agents • Nice graphic interface • 3D globe • Elastic connection • ML repository • Persistent repository

Metrics Data Viewer • Flexible tool for information analysis • Databases, log files • Multiple plot capabilities • Predefined plots, possibility to add new ones • Customizable, possibility to export the plots

MDViever (2) • CPU provided • CPU used • Load • Usage per VO • IO • NIC • File transfers per VO • Jobs • Submitted, Failed, … • Running, Idle, …

MDViever (3)

Operations - Site Catalog Map

Grid2003 Applications

Application Overview • 7 Scientific applications and 3 CS demonstrators • All iVDGL experiments participated in the Grid2003 project • A third HEP and two Bio-Chemical experiments also participated • Over 100 users authorized to run on Grid3 • Application execution performed by dedicated individuals • Typically 1, 2 or 3 users ran the applications from a particular experiment • Participation from all Grid3 sites • Sites categorized according to policies and resource • Applications ran concurrently on most of the sites • Large sites with generous local use policies where more popular

Scientific Applications • High Energy Physics Simulation and Analysis • USCMS: MOP, GEANT based full MC simulation and reconstruction • Work flow and batch job scripts generated by McRunJob • Jobs generated at MOP master (outside of Grid3) which submits to Grid3 sites via condor-G • Data products are archived at FermiLab: SRM/dcache • USATLAS: GCE, GEANT based full MC simulation and reconstruction • Workflow is generated by Chimera VDS, Pegasus grid scheduler and globus MDS for resource discovery • Data products archived at BNL : Magada and globus RLS are employed • USATLAS: DIAL, Distributed analysis application • Dataset catalogs built, n-tuple analysis and histogramming (data generated on Grid3) • BTeV : Full MC simulation • Also utilizes the Chimera workflow generator and condor G (VDT)

Scientific Applications, cont • Astrophysics and Astronomical • LIGO/LSC: blind search for continuous gravitational waves • SDSS: maxBcg, cluster finding package • Bio-Chemical • SnB: Bio-molecular program, analyses on X-ray diffraction to find molecular structures • GADU/Gnare: Genome analysis, compares protein sequences • Computer Science • Evaluation of Adaptive data placement and scheduling algorithms

CS Demonstrator Applications • Exerciser • Periodically runs low priority jobs at each site to test operational status • NetLogger-grid2003 • Monitored data transfers between Grid3 sites via NetLogger instrumented pyglobus-url-copy • GridFTP Demo • Data mover application using GridFTP designed to meet the 2TB/day metric

Running on Grid3 • With information provided by the Grid3 information system • Composes list of target sites • Resource available • Local site policies • Finds where to install application and where to write data • MDS information system used • Provides pathname for $APP, $DATA, $TMP and $WNTMP • User sends and remotely installs application from a local site • User submit job(s) through globus GRAM • User does not need to interact with local site administrators

BNL IU PDSF US CMS use of Grid3 • history of past three months cpu-days)

Boston U Michigan Chicago- Argonne BNL LBL Indiana HU OU UNM UTA SMU Tier1 Prototype Tier2 Testbed sites Outreach site ATLAS PreDC2 on Grid3 (Fall 2003) US ATLAS Testbed • US ATLAS PreDC2 exercise: • Development of ATLAS tools for DC2 • Collaborative work on Grid2003 project • Gain experience with the LCG grid US ATLAS shared, heterogeneous resources contributed to Grid2003

PreDC2 approach • Monte Carlo production on Grid3 sites • Dynamic software installation of ATLAS releases on Grid3 • Integrated “GCE” client-server based on VDT tools (Chimera, Pegasus, control database, lightweight grid scheduler, Globus MDS) • Measure job performance metrics using MonALISA and MDViewer (metrics data viewer) • Collect and archive output to BNL • MAGDA and Globus RLS used (distributed LRC & RLI at two sites) • Reconstruct fraction of data at CERN and other LCG-1 sites • To pursue Chimera/VDT interoperability issues with LCG-1 • Copy data back to BNL; exercise Tier1-CERN links • Successful integration post SC2003 ! • Analysis: Distributed analysis of datasets using DIAL • Dataset catalogs built, n-tuple analysis and histogramming • Metrics collected and archived on all of the above

US ATLAS Datasets on Grid3 ATLAS • Grid3 resources used • 16 sites, ~1500 CPUs exercised; peak ~400 jobs over three week period • Higgs  4 lepton sample • Simulation and Reconstruction • 2000 jobs ( X 6 subjobs); 100~200 events per job (~ 200K events) • 500 GB output data files • Top sample • Reproduce DC1 dataset: simulation and reconstruction steps • 1200 jobs ( x 6 subjobs); 100 events per job (120K sample) • 480 GB input data files • Data used by PhD student at U. Geneva 11/18/03 ~50% sample ~800 jobs H  4e

SC203 SC203 US ATLAS during SC2003 Total CPU usage (totals) ATLAS Total CMS CPU usage (by day) ATLAS

US ATLAS and LCG • ATLSIM output staged on disk at BNL • Use GCE-Client host to submit to LCG-1 server • Jobs executing on LCG-1: • Input files registered at BNL-RLI • Stage data from BNL to local scratch area • Run Athena-reconstruction using release 6.5.0 • Write data to local storage element • Copy back to disk cache at BNL, RLS register • implementing 3rd-party transfer between LCG SE and BNL gridftp servers • Post job • Magda registration of “ESD” (Combined ntuple output), Verification • DIAL dataset catalog and analysis • Many lessons learned about site architectural differences, service configurations • Concrete steps towards combined use of LCG and US Grids!

Grid2003 Metrics and Lessons

Metrics Summary

Grid3 Metrics Collection • Grid3 monitoring system • MonALISA • MetricsDataViewer • Queries to persistent storage DB • MonALISA plots • MDViewer plots

Grid2003 Metrics Results • Hardware resources • Total of 2762 CPUs • Maximum CPU count • Off project contribution > 60% • Total of 27 sites • 27 administrative domains with local policies in effect • All across US and Korea • Running jobs • Peak number of jobs 1100 • During SC2003 various applications were running simultaneously across various Grid3 sites

Data transfers around Grid3 sites Data transfer metric • GridFTP demo • Data transfer application • Used concurrently with application runs • Target met 11.12.03 (4.4 TB)

Global Lessons • Grid2003 was an evolutionary change in environment • The infrastructure was built on established testbed efforts from participating experiments • However, it was a revolutionary change in scale • The available computing and storage resources increased • Factor of 4 -5 over individual VO environments • The human interactions in the project increased • More participants from more experiments and closer interactions with CS colleagues. • Difficult to find a lot of positive examples of successful common projects between experiments • Grid2003 is an exemplar of how this can work

Architectural Lessons Service Setup on the clusters Processing, Storage, and Information Providing Issues Scale that can be realistically achieved Interoperability issues with other grid projects Identifying Single Point Failures Operational Lessons Software Configuration and Software Updates Trouble Shooting Support and Contacts Information exchange and coordination Service Levels Grid Operations Lessons Categories The Grid2003 Lessons can be divided into two categories These lessons are guiding the next phase of development

Identifying System Bottlenecks • Grid2003 attempted to keep requirements on sites light • Install software on interface nodes • Do not install software on worker nodes • Makes the system flexible and deployable on a lot of architectures. • Maximizes participation • Can place heavy loads on the gateways systems • As these services become overloaded they also become unreliable • Need to improve the scalability of the architecture • Difficult to make larger clusters • Already headnodes need to be powerful systems to keep up with the requirements Worker Node Worker Node LAN Worker Node Storage Worker Node Head Node Grid2003 Packages WAN

Gateway Functionality • Gateway systems are currently expected to • Handle GRAM bridge for processing • Involves querying running and queued jobs regularly • Significant Load • Handle data transfers in and out for all jobs running on the cluster • Allows worker nodes to not need WAN access • Significant Load • Monitor and publish status of cluster elements • Audit and record cluster usage • Small Load • Collect and publish information • Site configuration and GLUE schema elements • Small Load

Processing Requirements • Grid2003 is a diverse set of sites • We are attempting to provide applications with a richer description of resource requirements and collect job execution requirements • We have not deployed resource brokers and have limited use of job optimizers. • Even with manually submitted grid jobs. It is hard to know and the satisfy the site requirements • Periodically have fallen back to e-mail information exchange • Equally hard for sites to know what the requirements and activities of the incoming processes will be. • Grid2003 lesson was that it is difficult to submit jobs with requirement exchange • Need to improve and automate information exchange.

Single Point Failures • The Grid2003 team already knew that single point failures are bad • Fortunately, We also learned that in our environment they are rare. • The grid certificate authorities are, by design, a single point of information. • The only grid wide meltdowns were a result of certificate authority problems. • Unfortunately, this happened more than once. • The Grid2003 information providers were not protected against loss • Even during failures the grid functionality was not lost • The information provided is currently fairly static

Operational Issue: Configuration • One of the first operational lessons Grid2003 learned was the need for better tools to check and verify the configuration • We had good packaging and deployment tools through Pacman, but we spent an enormous amount of time and effort diagnosing configuration problems • Early on Grid2003 implemented tools and scripts to check site status, but there were few tools that allowed the site admin to check site configuration • Too much manual information exchange was required before a site was operational • In the next phase of the project we expect to improve this.

Software Updates • The software updates in Grid2003 varied from relatively easy to extremely painful • Some software packages could self-update • The monitoring framework Mona Lisa could upgrade the entire distribution • This allowed developers to make improvements and satisfy changing requirements without involving site admins • Could be set to update regularly with a cron job or triggered manually • Grid2003 is investigating whether this is a reasonable operating mode • This is functionality is becoming more common

Grid3 Evolution: Grid3+

Grid3  Grid3+ • Endorsement in December for extension of project • US LHC, Trillium Grid projects support adiabatic upgrades to Grid3 and continued operation for next 6 months • Begin as well a development activity focusing on grid/web services, but at a much reduced level • Planning exercise in January-Feb to collect “key issues from stakeholders” • each VO: US ATLAS, US CMS, SDSS, LIGO, BTEV, IVDGL • IGOC coordinating operations • VDT as the main supplier of core middleware • Two key recommendations: • Development of storage element interface and introduction into Grid3 • Support for a common development test grid environment

Grid Deployment Goals for 2004 • Detailed instructions for operating grid middleware through firewalls and other port limiting software. • Introduction of VOMS extended X.509 proxies to allow callout authorization at the resource • Secure web service for User Registration for VO members • Evaluate new methods for VO accounts at sites • Move to new Pacman Version 3 for software distribution and installation. • Adoption of new VDT Core Software which provides install time service configuration • Support and testing for other distributed file system architectures • Install time support for multi- homed systems

iGOC Infrastructure Goals for 2004 • Status of site monitoring to enable a effective job scheduling system - collection and archival of monitoring data. • Reporting of available storage resources at each site. • Status display of current and queued workload for each site. • Integrate Grid3 schema into GLUE schema. • Acceptable Use Policy in development • Grid3 Operational Model in negotiation

New Development Grid Infrastructure Grid3 Common Environment grid3dev Environment • Authentication Service • Approved VOMS servers • Monitoring Service • catalog • MonALISA • ganglia • ACDC • Stable Grid3 software cache • Authentication Service • test VOMS server • Approved VOMS • servers • new VOMS server(s) • Monitoring Service • catalog (test vers) • MonALISA (test vers) • ganglia (test vers) • Development s/w caches

Grid3: Practice and Principles