1 / 19

Virtual Data Toolkit

Virtual Data Toolkit. R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003. MCAT; GriPhyN catalogs. MDS. MDS. GDMP. DAGMAN, Condor-G. GSI, CAS. Globus. GRAM. GridFTP; GRAM; SRM. Very Early GriPhyN Data Grid Architecture. Application. = initial solution is operational.

Télécharger la présentation

Virtual Data Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtual Data Toolkit R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003

  2. MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Condor-G GSI, CAS Globus GRAM GridFTP; GRAM; SRM Very Early GriPhyN Data Grid Architecture Application = initial solution is operational Catalog Services Monitoring Planner Info Services Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Caltech Analysis Workshop

  3. Currently Evolved GriPhyN Picture Caltech Analysis Workshop Picture Taken from Mike Wilde

  4. Current VDT Emphasis • Current reality • Easy grid construction • Strikes a balance between flexibility and “easibility” • purposefully errs (just a little bit) on the side of “easibility” • Long running, high-throughput, file-based computing • Abstract description of complex workflows • Virtual Data Request Planning • Partial provenance tracking of workflows • Future directions (current research) including: • Policy based scheduling • With notions of Quality of Service (advanced reservation of resources, etc) • Dataset based (arbitrary type structures) • Full provenance tracking of workflows • Several others… Caltech Analysis Workshop

  5. Client Globus Toolkit 2 GSI globusrun GridFTP Client CA signing policies for DOE and EDG Condor-G 6.5.1 / DAGMan RLS 1.1.8 Client MonALISA Client (soon) Chimera 1.0.3 SDK Globus ClassAds RLS 1.1.8 Client Netlogger 2.0.13 Server Globus Toolkit 2.2.4 GSI Gatekeeper job-managers and GASS Cache MDS GridFTP Server MyProxy CA signing policies for DOE and EDG EDG Certificate Revocation List Fault Tolerant Shell GLUE Schema mkgridmap Condor 6.5.1 / DAGMan RLS 1.1.8 Server MonALISA Server (soon) Current VDT Flavors Caltech Analysis Workshop

  6. Chimera Virtual Data System • Virtual Data Language • textual • XML • Virtual Data Catalog • MySQL or PostGreSQL based • File based version available Caltech Analysis Workshop

  7. Virtual Data Language file1 TR CMKIN( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } TR CMSIM( out a2, in a1 ) { argument file = ${a1}; argument file = ${a2}; } DV x1->CMKIN( a2=@{out:file2}, a1=@{in:file1}); DV x2->CMSIM( a2=@{out:file3}, a1=@{in:file2}); x1 file2 x2 file3 Caltech Analysis Workshop Picture Taken from Mike Wilde

  8. Virtual Data Request Planning • Abstract Planner • Graph traversal of (virtual) data dependencies • Generates the graph with maximal data dependencies • Somewhat analogous to Build Style • Concrete (Pegasus) Planner • Prunes execution steps for which data already exists (RLS lookup) • Binds all execution steps in the graph to a site • Adds “housekeeping” steps • Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc • Generates a graph with minimal execution steps • Somewhat analogous to Make Style Caltech Analysis Workshop

  9. Chimera Virtual Data System: Mapping Abstract Workflows onto Concrete Environments VDL • Abstract DAGs (virtual workflow) • Resource locations unspecified • File names are logical • Data destinations unspecified • build style • Concrete DAGs (stuff for submission) • Resource locations determined • Physical file names specified • Data delivered to and returned from physical • locations • make style XML VDC XML Abs. Plan Logical DAX RLS C. Plan. DAG Physical DAGMan In general there is a full range of planning steps between abstract workflows and concrete workflows Caltech Analysis Workshop Picture Taken from Mike Wilde

  10. Supercomputing 2002 mass = 200 decay = bb A virtual space of simulated data is generated for future use by scientists... mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Caltech Analysis Workshop

  11. Supercomputing 2002 Scientists may add new derived data branches... mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Caltech Analysis Workshop

  12. Generator Formator Simulator Digitiser writeESD writeAOD writeTAG POOL Analysis Scripts Example CMS Data/ Workflow Calib. DB writeESD writeAOD writeTAG Caltech Analysis Workshop

  13. Generator Formator Simulator Digitiser writeESD writeAOD writeTAG MC Production Team POOL (Re)processing Team Physics Groups Analysis Scripts Online Teams Data/workflow is a collaborative endeavour! Calib. DB writeESD writeAOD writeTAG Caltech Analysis Workshop

  14. A “Concurrent Analysis Versioning System:” Complex Data Flow and Data Provenance in HEP Plots, Tables, Fits AOD ESD Raw TAG • Family History of a Data Analysis • Collaborative Analysis Development Environment • "Check-point" a Data Analysis • Analysis Development Environment (like CVS) • Audit a Data Analysis Comparisons Plots, Tables, Fits Real Data Simulated Data Caltech Analysis Workshop

  15. Current Prototype GriPhyN “Architecture” (Picture) Caltech Analysis Workshop Picture Taken from Mike Wilde

  16. Post-talk: My wandering mind…Typical VDT Configuration • Single public head-node (gatekeeper) • VDT-server installed • Many private worker-nodes • Local scheduler software installed • No grid-middleware installed • Shared file system (e.g. NFS) • User area shared between head-node and worker-nodes • One or many raid systems typically shared Caltech Analysis Workshop

  17. Default middleware configurationfrom the Virtual Data Toolkit submit host remote host Chimera gatekeeper gahp_server Local Scheduler (Condor, PBS, etc.) DAGman compute machines Condor-G Caltech Analysis Workshop

  18. EDG Configuration(for comparison) • CPU separate from Storage • CE: single gatekeeper for access to cluster • SE: single gatekeeper for access to storage • Many public worker-nodes (at least NAT) • Local scheduler installed (LSF or PBS) • Each worker-node runs a GridFTP Client • No assumed shared file system • Data access is accomplished via globus-url-copy to local disk on worker-node Caltech Analysis Workshop

  19. Why Care? • Data Analyses would benefit from being fabric independent! • But…the devil is (still) in the details! • Assumptions in job descriptions/requirements currently lead to direct fabric-level consequences and vice versa. • Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)? • Really need input from groups like here! • What kind of fabric layer is necessary for “interactive” data analysis using PROOF, JAS? • Does the VDT need multiple configuration flavors? • Production, batch oriented (current default) • Analysis, interactive oriented Caltech Analysis Workshop

More Related