The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

The Virtual Data Grid:A New Model and Architecture forData-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

GriPhyN:Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together

Acknowledgements:Virtual Data is a Large Team Effort The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams

psearch –t 10 … file1 file8 simulate –t 10 … file1 file1 File3,4,5 file2 reformat –f fz … file7 conv –I esd –o aod summarize –t 10 … file6 Virtual Data Scenario Manage workflow; Update workflow following changes On-demand data generation Explain provenance, e.g. for file8: • psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2

file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • The recorded virtual data “recipe” here is: • Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 • Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate –t 10 … file2 reformat –f fz … Requesteddataset file7 conv –I esd –o aod summarize –t 10 … file6

file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To recreate file 8: Step 1 • simulate > file1, file2 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file8: Step 2 • files 3, 4, 5, 6 derived from file 2 • reformat > file3, file4, file5 • conv > file 6 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file 8: step 3 • File 7 depends on file 6 • Summarize > file 7 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

file1 file1 file1 File3,4,5 file2 reformat –f fz … conv –I esd –o aod file6 Virtual DataDescribes analysis workflow psearch –t 10 … file8 • To re-create file 8: final step • File 8 depends on files 1, 3, 4, 5, 7 • psearch < file1, file3, file4, file5, file 7 > file 8 simulate –t 10 … Requestedfile file7 summarize –t 10 …

Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

VDL: Virtual Data LanguageDescribes Data Transformations • Transformation • Abstract template of program invocation • Similar to "function definition" • Derivation • “Function call” to a Transformation • Store past and future: • A record of how data products were generated • A recipe of how data products can be generated • Invocation • Record of a Derivation execution • These XML documents reside in a “virtual data catalog” – VDC - a relational database

VDL Describes Workflowvia Data Dependencies file1 TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); x1 file2 x2 file3

Workflow example • Graph structure • Fan-in • Fan-out • "left" and "right" can run in parallel • Needs external input file • Located via replica catalog • Data file dependencies • Form graph structure preprocess findrange findrange analyze

Complete VDL workflow • Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

Compound TransformationsEnable Functional Abstractions • Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

Derivation scripts • Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files

Executing VDL Workflows Grid Info Global planner “Pegasus” Concrete DAG Abstract workflow “jit” planner (research) DAGman / Condor-G local planner

GriPhyN-iVDGLApplications to date • ATLAS, BTeV, CMS – HEP event simulation • Argonne Computational Biology – sequence comparison and result capture • LIGO – Pulsar search • Sloan Digital Sky Survey – cluster finding; near-earth object search planned • Quarknet – science education – cosmic rays, HEP analysis

Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS Described in GGF10workshop paper.

Virtual Data Example:Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper

Cluster SearchWorkflow Graphand Execution Trace Workflow jobs vs time

mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper

Using Virtual Data forScience Education • The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education • Its an experiment to give students the means to: • discover and apply datasets, algorithms, and data analysis methods • collaborate by developing new ones and sharing results and observations • learn data analysis methods that will ready and excite them for a scientific career • And in later steps, we may actually use the Grid!

Student/TeacherTeams Student/TeacherTeams Student/TeacherTeams Quarknet Virtual Data Project Quarknet Virtual Data Portal Central High SchoolReston, Virginia Cosmic Ray Detector Locally Collected Data Student Data,Algorithms,Results, Notes, and communications Foothills High SchoolGreat Falls, Montana VirtualData Toolkit CosmicRayDetector Standard Web access LocallyCollected Data Virtual Data Catalog Yale / Middletown High CollaborationHartford, Connecticut CosmicRayDetector LocallyCollected Data Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods

Detector Performance Study

Example: BTeV Event Simulation

Support for Search and Discovery • Goal: make it as easy to use as Google • More advanced capabilities lie below the surface (as with Google) • Understand the structure and meaning of the datasets and their fields. • Advanced search, using SQL-like queries • Find both DATA and TRANSFORMATIONS • Create datasets from queries • Perform calculations on datasets, filtering results to look for patterns

Search byMetadata

Derving a new dataset…to find mass of “z” particle:

Workflow formissing energy calculations

Virtual Provenance:list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job> <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job>  <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>…. (excerpted for display)

Virtual Provenance in XML:control flow graph <child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>… (excerpted for display…)

And writing the results up in a “poster”

Poster describing analysis

Using active data from Web Services

Levels of Interaction • “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. • “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre-developed transforms as building blocks • “Code” – write new transforms in a variety of languages and data models

Observations • A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity • Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation • The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

Vision for Provenance in the Large • Universal knowledge management and production systems • Vendors integrate the provenance tracking protocol into data processing products • Ability to run anywhere “in the Grid”

Virtual Data Grid Vision

Planned Dataset Model <FORM <Title…> /FORM> File Set of files Object closure XML Element Relational query or spreadsheet range New user-defined dataset type: Set of files with relational index Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

Planned Dataset Type Model FileDataset Representational File FileSet Logical MultiFileSet TarFileSet EventCollection (Nonleaf Typesare Superclasses) RawEventSet SimulatedEventSet MonteCarloSimulation DiscreteEventSimulation

Provenance Server Plans • OGSA-based Grid services • Discovery, security, resource management • Supports code and data discoveryand workflow management • Object names (TR, DS, TY, DV, IV) can be used as global cross-server links • Derivations can reference remote transformations and datasets • Structured object namespaces & object-level access control enable large VO collaboration • Generalize transforms to describe service calls, database queries and language interpreters

Provenance Hyperlinks

Indexing Serversto Support Discovery

For Information and Software • Virtual Data System • www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software • Grids and Grid Software • www.ivdgl.org/grid2003 - Using Grid3 • www.griphyn.org/vdt - Virtual Data Toolkit • www.globus.org – The Globus Toolkit • www.cs.wisc.edu/condor - The Condor Project • www.ppdg.net – Particle Physics Data Grid

Acknowledgements GriPhyN, iVDGL, and QuarkNet(in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

Presentation Transcript

Chameleon: A Resource Scheduler in A Data Grid Environment

The Kangaroo approach to Data movement on the Grid

The Kangaroo Approach to Data Movement on the Grid

Parrot: Transparent User-Level Middleware for Data-Intensive Computing

Grid Architecture

Grid Specification

Scaling eCGA Model Building via Data-Intensive Computing

Data Warehouse Architecture

Introduction to Data Management

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Wrapping Relational Data Structures to Object-Oriented Databases in the Data Grid Architecture

Grid Datafarm Architecture for Petascale Data Intensive Computing

Middleware for Grid Computing On Virtual Machines

Data Grids

GriPhyN: Grid Physics Network and iVDGL: International Virtual Data Grid Laboratory

High-Performance Data Transport for Grid Applications

Workflow Management and Virtual Data

Virtual Laboratory: Data Intensive Science during Holiday @ Robinson Village in Italy!

Data and the Grid

PHENIX and the data grid

Storage Tank in Data Grid