1 / 33

DIAL Distributed Interactive Analysis of Large datasets

DIAL Distributed Interactive Analysis of Large datasets. GriPhyN/IVDG All Hands Meeting ANL. David Adams BNL October 15, 2003. Goals of DIAL What is DIAL? Design Status Development plans. Contents. Lessons learned JDL Datasets Results Interactivity Job policy User base

tyra
Télécharger la présentation

DIAL Distributed Interactive Analysis of Large datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIALDistributed Interactive Analysis of Large datasets GriPhyN/IVDG All Hands Meeting ANL David Adams BNL October 15, 2003

  2. Goals of DIAL What is DIAL? Design Status Development plans Contents • Lessons learned • JDL • Datasets • Results • Interactivity • Job policy • User base • Other projects • Analysis job rates More info: http://www.usatlas.bnl.gov/~dladams/dial DIAL GriPhyN/IVDGL All Hands

  3. Goals of DIAL • 1. Demonstrate the feasibility of interactive analysis of large datasets • How much data can we analyze interactively? • 2. Set requirements for GRID services • In particular those specific to interactive analysis • Job definition: application, task, dataset • Gathering and relaying results • Real time monitoring (partial results) • Resource management: discovery, allocation, sharing • 3. Provide ATLAS with a useful analysis tool • For current and upcoming data challenges • Like to add another experiment to show generality DIAL GriPhyN/IVDGL All Hands

  4. What is DIAL? • DIAL provides a connection between • Interactive analysis framework • Fitting, presentation graphics, … • E.g. ROOT, JAS, … • and Data processing application • Natural to the data of interest • E.g. athena for ATLAS • DIAL distributes processing • Among sites, farms, nodes • To provide user with desired response time • Look to other projects to provide most infrastructure DIAL GriPhyN/IVDGL All Hands

  5. DIAL GriPhyN/IVDGL All Hands

  6. Design • DIAL has the following major components • Dataset describing the data of interest • Application defined by experiment/site • Task is user extension to the application • Job uses application and task to process a dataset • Result is the output of a job • Scheduler creates and manages jobs • Together these define a high-level JDL • (job definition language) • Figure shows how these components interact → DIAL GriPhyN/IVDGL All Hands

  7. 9. fill Job 1 Dataset 1 Dataset 2 Result 7. create 8. run(app,tsk,ds1) Dataset 6. split 10. gather Scheduler 4. select e.g. ROOT User Analysis 1. Create or locate 8. run(app,tsk,ds2) 5. submit(app,tsk,ds) e.g. athena Job 2 2. select 3. Create or select Result Application Task 9. fill Result Code DIAL GriPhyN/IVDGL All Hands

  8. Status • DIAL 0.5 released in September • User interfaces • ROOT • Uses ACLiC to make all dataset/DIAL classes accessible • Command line • Command dial_submit processes one job • Application, task and dataset taken from XML files • Schedulers • LocalScheduler: single local job • MasterScheduler: distributed processing at a site • Fork, LSF, lsrun or Condor • Run at both BNL and CERN DIAL GriPhyN/IVDGL All Hands

  9. Status (cont) • Datasets • CbntDataset describes a single CBNT hbook file • Inherits from EventDataset • EventMergeDataset holds multiple EventDataset • Together these can be used build dataset which corresponds to a collection of CBNT hbook files • A few files with XML descriptions of such datasets are available for demonstration • Results • HbookResult holds an hbook file assumed to hold a collection of histograms • Merge method uses PAW to add histograms DIAL GriPhyN/IVDGL All Hands

  10. Status (cont) • Application • dial_cbnt uses PAW to fill histograms using a user-supplied fortran function • Task • Now a generic collection of named text files • May add collection of logical files • For use with the above, task holds hbook file with empty histograms and fortran code to fill them • Possible to do distributed processing of a collection of CBNT hbook files • See dial root demo 2 DIAL GriPhyN/IVDGL All Hands

  11. Development plans • Add dataset catalog • Enable users to select dataset • Fill with ATLAS DC1 datasets • Grid enable • GCE/Chimera/Pegasus and/or direct Condor-G • Scaling, reliability, response time issues • Grid service interface for DIAL scheduler • Client scheduler to connect from user interface • User could submit jobs to BNL or CERN from anywhere • DIAL grid scheduler • Use DIAL grid-service schedulers to distribute processing over multiple sites DIAL GriPhyN/IVDGL All Hands

  12. Development plans (cont) DIAL GriPhyN/IVDGL All Hands

  13. Development plans (cont) • Python user interface • Python busses can be used as user interface • GANGA, ATLAS interactive, PI, … • ROOT as an application • In particular to support ATLAS CBNT • Athena application • Enable DIAL users run distributed athena • Initial application to fill histograms • Later add support to create new dataset (production) • Need athena-compatible dataset • ATLAS POOL event collection • Zebra, Athena-ROOT obsolete? DIAL GriPhyN/IVDGL All Hands

  14. Lessons learned (and being learned) • JDL • Datasets • Results • Interactivity • Job policy • User base DIAL GriPhyN/IVDGL All Hands

  15. JDL • High level job definition language • Enable users to specify task without reference to executables, data files or sites • Scheduler decides where and how to process data • Analysis implies user is easily able to customize task • Common language • Enable different experiments and non-HEP activities to share schedulers • PPDG activity to define such a language • Led by Gabriele Carcassi (STAR) • Similar to DIAL (application, task. dataset, …) • XML based DIAL GriPhyN/IVDGL All Hands

  16. JDL (DIAL perspective) DIAL GriPhyN/IVDGL All Hands

  17. Datasets • Want to provide a high-level data view • Unit of processing is called “dataset” • Many properties beyond data location • Location is not just a list of files (physical or logical) • Multiple logical file set representations • Representation might be tables in an RDB • Or object list in an ODB • Or … • Properties and categories follow • For more, see “Datasets for the GRID” at • http://www.usatlas.bnl.gov/~dladams/dataset DIAL GriPhyN/IVDGL All Hands

  18. Dataset properties • 0. Identity • Dataset must have an unique index and/or name • 1. Content • Description of the type of data in the dataset • Event or non-event data • Simulation, reconstruction, • ESD, AOD, … • Jets, tracks, electrons,… • 2. Location • Where to find the data • Logical files, physical files, site,… • 3. Mapping • Which content is at which location? DIAL GriPhyN/IVDGL All Hands

  19. Dataset properties (cont) • 4. Provenance • Prescription for creating the data • E.g. input dataset and transformation • 5. History • Details of production beyond provenance • How production was split into jobs, • Processing node and time for each job, … • 6. Labels • Assigned metadata outside other categories, e.g. • Integrated luminosity • Result of quality checks • Flag indicating ok for use in published analyses DIAL GriPhyN/IVDGL All Hands

  20. Dataset properties (cont) • 7. Mutability • May dataset be modified? • Possible states: locked, unlocked, extensible, … • 8. Compositeness • Dataset made up of other datasets. • Two cases: • Construction: provenance is the list of sub-datasets • E.g. the summer dataset is defined to be the union of the June, July and August datasets. • Assignment: factorization into sub-datasets • Typically to reflect data placement • E.g. a representation of a global dataset might include sub-datasets in New York, Paris and Moscow. DIAL GriPhyN/IVDGL All Hands

  21. Dataset categories • Categorize datasets according to the extent of their location information: • Virtual • no location • Logical • Collection of logical files • Physical • Collection of physical files • Inferred from logical DS and file catalog (Magda, RLS, …) • Staged • Collection of “jobs” • each sub-dataset matched to CPU/process • Not important for discussion here DIAL GriPhyN/IVDGL All Hands

  22. Dataset category associations • One-to-many association as we move down these categories • Virtual dataset may map to multiple logical datasets • Optimize file size for local mass store • Copy out only selected events (vs. all plus event list) • Move data into a DB at one site • Composite representation along placement boundaries • Logical dataset maps to many physical datasets • Many combinations inferred from file catalog • No need to record all these datasets • But system (or user) might record LDS used to process one task and reuse it for the next request DIAL GriPhyN/IVDGL All Hands

  23. Dataset category associations (example) VDS 1 Virtual • LDS 1-2 • {LF3} Logical • PDS 1-1-1 • {PF1A PF2A} • PDS 1-2-1 • {PF3A} Physical • PDS 1-1-2 • {PF1B PF2B} • PDS 1-2-2 • {PF3B} • PDS 1-1-3 • {PF1A PF2B} DIAL GriPhyN/IVDGL All Hands LDS 1-1 {LF1 LF2}

  24. Dataset implementation • Dataset implementation might include • Virtual dataset (VDS) • Portable representation of dataset without location • Logical dataset (LDS) • Add location expressed in terms of logical files • Dataset selection catalog (DSC) • Enable users to select a VDS • Dataset replica catalog (DRC) • Enable “system” to locate an LDS representation of a selected VDS • Following table shows mapping of dataset properties to these components. DIAL GriPhyN/IVDGL All Hands

  25. Dataset implementation DIAL GriPhyN/IVDGL All Hands

  26. Results • Analysis task produces a result • Each sub-job produces a sub-result • Scheduler has responsibility to • Gather sub-results • Merge • Return combined result • Result merging • May require significant resources • Danger of bottleneck • Distribute this operation • Collection of sub-results can be a dataset • Merging would be an application DIAL GriPhyN/IVDGL All Hands

  27. Results (cont) • Result in JDL • At present, a concrete DIAL results are described by subclasses that define data and provide means to merge • JDL (XML representation) is data only and we need a non-C++ means to specify means to merge • Option 1: Add merge operation to application used for processing • Option 2: Define dedicated applications for merging • Dataset would be collection of results • Result would be the combined result • Result XML holds application XML • Merging is just another type of job (nice) DIAL GriPhyN/IVDGL All Hands

  28. Interactivity • Definition • Job is interactive if user has patience to wait for result • Response time • User has means to specify response time with task • System configures job accordingly • Faster response may “cost” more • Give up if task cannot be accomplished in the requested time • Monitoring • Time remaining, fraction processed • Partial results • Allow user to monitor sub-jobs • Change job configuration? DIAL GriPhyN/IVDGL All Hands

  29. Job policy • User must be able to specify job policy • Response time • Location for new data • Site • File catalog • Resource usage limits • (What user is willing to “pay” for the task) • When to generate partial results • This is absent in DIAL • Add soon to DIAL and JDL DIAL GriPhyN/IVDGL All Hands

  30. User base • User base • Analysis must support all users • Not just production managers • Roles • Need means to specify role • Not just identity • Used to determine authorization and priority DIAL GriPhyN/IVDGL All Hands

  31. Connection to other projects • Relevant projects • ARDA, PPDG, GriPhyN, and more • From DIAL • JDL • Interactive scheduler • DIAL interface for schedulers from other projects • End-to-end analysis system for ATLAS • From other projects • Identification of system components • Most components used in DIAL end-to-end system • See figure DIAL GriPhyN/IVDGL All Hands

  32. Connection to other projects (cont) ROOT AMI? MDS GANGA GSI Ganglia dataset MonaLisa DIAL AMI Chimera EDG RB Magda Condor-G RLS Pacman RLS GRAM ARDA components DIAL GriPhyN/IVDGL All Hands

  33. Analysis job rates • At what rate is a site processing sub-jobs? • Assume 1000 CPU’s at a “site” • For production with 3–30 hours/job: • 30-300 jobs/hour (1 job/minute) • Fine for batch and grid schedulers • For interactive analysis with 1-10 seconds/job • 100-1000 jobs/sec (10000 jobs/minute) • Difficult for grid and batch schedulers • Also expect everything between DIAL GriPhyN/IVDGL All Hands

More Related