190 likes | 316 Vues
The Caltech GriPhyN Analysis Workshop in June 2003 focused on the evolution of virtual data architectures for grid computing. Key topics included the development of robust catalog services, monitoring tools, and the integration of policies for scheduling and security in data management processes. Highlights featured methods for abstract workflow planning and provenance tracking. Participants explored the balance between flexibility and ease of construction in grid applications. Current and future research directions aimed at improving resource utilization and ensuring reliable data transfer were discussed.
E N D
Virtual Data Toolkit R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003
MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Condor-G GSI, CAS Globus GRAM GridFTP; GRAM; SRM Very Early GriPhyN Data Grid Architecture Application = initial solution is operational Catalog Services Monitoring Planner Info Services Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Caltech Analysis Workshop
Currently Evolved GriPhyN Picture Caltech Analysis Workshop Picture Taken from Mike Wilde
Current VDT Emphasis • Current reality • Easy grid construction • Strikes a balance between flexibility and “easibility” • purposefully errs (just a little bit) on the side of “easibility” • Long running, high-throughput, file-based computing • Abstract description of complex workflows • Virtual Data Request Planning • Partial provenance tracking of workflows • Future directions (current research) including: • Policy based scheduling • With notions of Quality of Service (advanced reservation of resources, etc) • Dataset based (arbitrary type structures) • Full provenance tracking of workflows • Several others… Caltech Analysis Workshop
Client Globus Toolkit 2 GSI globusrun GridFTP Client CA signing policies for DOE and EDG Condor-G 6.5.1 / DAGMan RLS 1.1.8 Client MonALISA Client (soon) Chimera 1.0.3 SDK Globus ClassAds RLS 1.1.8 Client Netlogger 2.0.13 Server Globus Toolkit 2.2.4 GSI Gatekeeper job-managers and GASS Cache MDS GridFTP Server MyProxy CA signing policies for DOE and EDG EDG Certificate Revocation List Fault Tolerant Shell GLUE Schema mkgridmap Condor 6.5.1 / DAGMan RLS 1.1.8 Server MonALISA Server (soon) Current VDT Flavors Caltech Analysis Workshop
Chimera Virtual Data System • Virtual Data Language • textual • XML • Virtual Data Catalog • MySQL or PostGreSQL based • File based version available Caltech Analysis Workshop
Virtual Data Language file1 TR CMKIN( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } TR CMSIM( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } DV x1->CMKIN( a2=@{out:file2}, a1=@{in:file1}); DV x2->CMSIM( a2=@{out:file3}, a1=@{in:file2}); x1 file2 x2 file3 Caltech Analysis Workshop Picture Taken from Mike Wilde
Virtual Data Request Planning • Abstract Planner • Graph traversal of (virtual) data dependencies • Generates the graph with maximal data dependencies • Somewhat analogous to Build Style • Concrete (Pegasus) Planner • Prunes execution steps for which data already exists (RLS lookup) • Binds all execution steps in the graph to a site • Adds “housekeeping” steps • Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc • Generates a graph with minimal execution steps • Somewhat analogous to Make Style Caltech Analysis Workshop
Chimera Virtual Data System: Mapping Abstract Workflows onto Concrete Environments VDL • Abstract DAGs (virtual workflow) • Resource locations unspecified • File names are logical • Data destinations unspecified • build style • Concrete DAGs (stuff for submission) • Resource locations determined • Physical file names specified • Data delivered to and returned from physical • locations • make style XML VDC XML Abs. Plan Logical DAX RLS C. Plan. DAG Physical DAGMan In general there is a full range of planning steps between abstract workflows and concrete workflows Caltech Analysis Workshop Picture Taken from Mike Wilde
Supercomputing 2002 mass = 200 decay = bb A virtual space of simulated data is generated for future use by scientists... mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Caltech Analysis Workshop
Supercomputing 2002 Scientists may add new derived data branches... mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Caltech Analysis Workshop
Generator Formator Simulator Digitiser writeESD writeAOD writeTAG POOL Analysis Scripts Example CMS Data/ Workflow Calib. DB writeESD writeAOD writeTAG Caltech Analysis Workshop
Generator Formator Simulator Digitiser writeESD writeAOD writeTAG MC Production Team POOL (Re)processing Team Physics Groups Analysis Scripts Online Teams Data/workflow is a collaborative endeavour! Calib. DB writeESD writeAOD writeTAG Caltech Analysis Workshop
A “Concurrent Analysis Versioning System:” Complex Data Flow and Data Provenance in HEP Plots, Tables, Fits AOD ESD Raw TAG • Family History of a Data Analysis • Collaborative Analysis Development Environment • "Check-point" a Data Analysis • Analysis Development Environment (like CVS) • Audit a Data Analysis Comparisons Plots, Tables, Fits Real Data Simulated Data Caltech Analysis Workshop
Current Prototype GriPhyN “Architecture” (Picture) Caltech Analysis Workshop Picture Taken from Mike Wilde
Post-talk: My wandering mind…Typical VDT Configuration • Single public head-node (gatekeeper) • VDT-server installed • Many private worker-nodes • Local scheduler software installed • No grid-middleware installed • Shared file system (e.g. NFS) • User area shared between head-node and worker-nodes • One or many raid systems typically shared Caltech Analysis Workshop
Default middleware configurationfrom the Virtual Data Toolkit submit host remote host Chimera gatekeeper gahp_server Local Scheduler (Condor, PBS, etc.) DAGman compute machines Condor-G Caltech Analysis Workshop
EDG Configuration(for comparison) • CPU separate from Storage • CE: single gatekeeper for access to cluster • SE: single gatekeeper for access to storage • Many public worker-nodes (at least NAT) • Local scheduler installed (LSF or PBS) • Each worker-node runs a GridFTP Client • No assumed shared file system • Data access is accomplished via globus-url-copy to local disk on worker-node Caltech Analysis Workshop
Why Care? • Data Analyses would benefit from being fabric independent! • But…the devil is (still) in the details! • Assumptions in job descriptions/requirements currently lead to direct fabric-level consequences and vice versa. • Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)? • Really need input from groups like here! • What kind of fabric layer is necessary for “interactive” data analysis using PROOF, JAS? • Does the VDT need multiple configuration flavors? • Production, batch oriented (current default) • Analysis, interactive oriented Caltech Analysis Workshop