1 / 47

Analytical Pipelines

Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer Center. AP 0. AS x. TS 1. AS y. AS z. TS 2. AS r. Parameters w/Semantics. AS x. Library of Analysis steps & Analytical Pipeline.

ranee
Télécharger la présentation

Analytical Pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelines andScientific Workflowswith Ptolemy IIDeana PenningtonUniversity of New MexicoLTER Network OfficeShawn BowersUCSDSan Diego Supercomputer Center

  2. AP0 ASx TS1 ASy ASz TS2 ASr Parameters w/Semantics ASx Library of Analysis steps & Analytical Pipeline Semantic Mediation System Logic Rules Query Processing AP0 ECO Taxon Parameter Ontologies & Taxonomies Analytical Pipelines Analysis Step in an Execution Environment: SAS, MATLAB, etc. ASx TS1 Transformation Step

  3. SW0 ASx TS1 ASy ASz TS2 ASr ASx TS1 ASz TS2 ASr TS2 ASr Scientific Workflows Search for relevant data (Query) Iterative

  4. Benefits • Reusable analysis steps, pipelines, and workflows • Formal documentation of methods • (output in report format) • Reproducibility of methods • Visual creation and communication of methods • Versioning • Automated data typing and transformation

  5. Ptolemy II demo

  6. Biodiversity information … e.g., data from museum specimens ecological niche modeling Model of niche in ecological dimensions Precipitation Geospatial and remotely sensed data Vegetation class precipitation vegetation class Projection back onto geography Native range prediction • Model type: • Linear regression (GRASP) • Genetic algorithms (GARP) Invaded range prediction Ecological Niche Modeling Geographic Space Ecological Space Results used for integration with other data realms (e.g., human populations, public health, etc.) Modified from B. Michener

  7. Excel File Sample 1, lat, long, presence Access File Sample 3, lat, long, absence Vegetation cover type Sample 2, lat, long, presence Integrated data: Elevation (m) P, juniper, 2200m, 16C P, pinyon, 2320m, 14C A, creosote, 1535m, 22C Mean annual temperature (C) Ecological Niche Models

  8. +A2 +A3 +A1 GARP Native-Species Pipeline (informal) Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  9. +A3 +A2 +A1 GARP Native-Species Pipeline (informal) Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation We will look at this analytic step EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  10. +A3 +A2 +A1 Sample Data: Basic Input/Output Species presence points Presence under environmental conditions parameters Dependent-Variable Coordinates Test Sample of Conditioned Data Independent- Variable Coordinates Training Sample of Conditioned Data Sample Data Environmental Layers (temp., vegetation, etc.) input output

  11. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  12. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  13. +A3 +A2 +A1 Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33.454606, 106.789098; 33.454606, 106.789097; … 1, 56.25, 0, 20, …, 44; 0, 57.34, 0, 55, …, 14; … 0, 77.33, 1, 50, …, 44; 1, 56.01, 0, 55, …, 14; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … Sample Data An actual program that implements Sample Data input output

  14. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  15. +A3 +A2 +A1 GARP Native-Species Pipeline (informal) Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation We will look at this analytic step EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  16. +A3 +A2 +A1 Sample Data: Basic Input/Output Species presence points Presence under environmental conditions parameters Dependent-Variable Coordinates Test Sample of Conditioned Data Independent- Variable Coordinates Training Sample of Conditioned Data Sample Data Environmental Layers (temp., vegetation, etc.) input output

  17. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  18. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  19. +A3 +A2 +A1 Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33.454606, 106.789098; 33.454606, 106.789097; … 1, 56.25, 0, 20, …, 44; 0, 57.34, 0, 55, …, 14; … 0, 77.33, 1, 50, …, 44; 1, 56.01, 0, 55, …, 14; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … Sample Data An actual program that implements Sample Data input output

  20. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  21. Logical descriptions Recall that a schema sets the allowable structure for data Employee name : string age : integer ssn : string title : string salary : int These tables are not allowable instances of the logical description Clark 50 555-… Mgr. 75000 Allen Smith 40 555-… 5 Lewis 36 555-… Sales 40000 Young Jones 36 555-… 4 Davis 22 555-… 2 too few columns, wrong datatypes too many columns

  22. +A3 +A2 +A1 Sample Data: Logical Level 2-dimensional matrix Relation of n+1 attributes for n environmental layers parameters sample1(pres, temp, veg, …, zn) matrix[x, y] sample2(pres, temp, veg, …, zn) list(matrix[x, y, z]) Sample Data List of 3-dimensional matrices, one matrix per environmental layer input output

  23. Why have the Logical Level? Data independence Hides the details of how information is represented (text or binary files) from what is represented (a table of integers) Reduced application development time Makes information more easily reusable, for example, by other applications or services – with programs for handling the physical/logical level Can help enable integration Explicit knowledge of the structure and types of data can help automate conversion, for example, by using higher-level languages

  24. +A3 +A2 +A1 Choosing a logical representation 2-dimensional matrix Relation of n+1 attributes for n environmental layers parameters sample1(pres, temp, veg, …, zn) matrix[x, y] sample2(pres, temp, veg, …, zn) list(matrix[x, y, z]) Sample Data Can you see any potential problems with this choice of logical output? List of 3-dimensional matrices, one matrix per environmental layer input output

  25. +A3 +A2 +A3 +A2 +A1 +A1 Choosing a logical representation sample1(pres, z1, z2, …, zn) matrix[x, y] sample2(pres, z1, z2, …, zn) list(matrix[x, y, z]) Sample Data ? avail(pres, temp, veg, elev) The output structure is dependent on the input data… Service

  26. +A3 +A2 +A1 GARP Native-Species Pipeline (informal) Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation We will look at this analytic step EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  27. +A3 +A2 +A1 Sample Data: Basic Input/Output Species presence points Presence under environmental conditions parameters Dependent-Variable Coordinates Test Sample of Conditioned Data Independent- Variable Coordinates Training Sample of Conditioned Data Sample Data Environmental Layers (temp., vegetation, etc.) input output

  28. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  29. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  30. +A3 +A2 +A1 Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33.454606, 106.789098; 33.454606, 106.789097; … 1, 56.25, 0, 20, …, 44; 0, 57.34, 0, 55, …, 14; … 0, 77.33, 1, 50, …, 44; 1, 56.01, 0, 55, …, 14; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … Sample Data An actual program that implements Sample Data input output

  31. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  32. Logical descriptions Recall that a schema sets the allowable structure for data Employee name : string age : integer ssn : string title : string salary : int These tables are not allowable instances of the logical description Clark 50 555-… Mgr. 75000 Allen Smith 40 555-… 5 Lewis 36 555-… Sales 40000 Young Jones 36 555-… 4 Davis 22 555-… 2 too few columns, wrong datatypes too many columns

  33. +A3 +A2 +A1 Sample Data: Logical Level 2-dimensional matrix Relation of n+1 attributes for n environmental layers parameters sample1(pres, temp, veg, …, zn) matrix[x, y] sample2(pres, temp, veg, …, zn) list(matrix[x, y, z]) Sample Data List of 3-dimensional matrices, one matrix per environmental layer input output

  34. Why have the Logical Level? Data independence Hides the details of how information is represented (text or binary files) from what is represented (a table of integers) Reduced application development time Makes information more easily reusable, for example, by other applications or services – with programs for handling the physical/logical level Can help enable integration Explicit knowledge of the structure and types of data can help automate conversion, for example, by using higher-level languages

  35. +A3 +A2 +A1 Choosing a logical representation 2-dimensional matrix Relation of n+1 attributes for n environmental layers parameters sample1(pres, temp, veg, …, zn) matrix[x, y] sample2(pres, temp, veg, …, zn) list(matrix[x, y, z]) Sample Data Can you see any potential problems with this choice of logical output? List of 3-dimensional matrices, one matrix per environmental layer input output

  36. +A3 +A2 +A3 +A2 +A1 +A1 Choosing a logical representation sample1(pres, z1, z2, …, zn) matrix[x, y] sample2(pres, z1, z2, …, zn) list(matrix[x, y, z]) Sample Data ? avail(pres, temp, veg, elev) The output structure is dependent on the input data… Service

  37. +A3 +A2 +A1 Choosing a logical representation sample1(obs, property, value) matrix[x, y] sample2(obs, property, value) list(matrix[x, y, z]) Sample Data avail(obs, property, value) Reusability is easier when the logical representation is known ahead of time… Service

  38. Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

  39. Sample Data: Semantic input/output Statistical Context Statistical Variable Ecological Model hasContext Statistical Model Biodiversity Model Dependent Variable Independent Variable Regression Model EcoNiche Model hasDepVar hasIndVar Logistic Regression Regression Based ENM usesRegressionModel

  40. hasDepVar hasDepVar Dependent Variable Dependent Variable +A3 +A2 Statistical Dataset Statistical Dataset Independent Variable Independent Variable hasIndVar hasIndVar +A1 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … 33.454606, 106.789098, 56.25; 33.454606, 106.789097, 56.37; … Putting it all together Statistical Context hasContext Dependent Variable Grid Coordinate parameters sample1(obs, property, value) matrix[x, y] 1, 56.25, 0, 20, …, 44; 0, 57.34, 0, 55, …, 14; … 33.454606, 106.789098; 33.454606, 106.789097; … Statistical Context Sample Data Independent Variable hasContext Grid Coordinate list(matrix[x, y, z]) sample2(obs, property, value) 1, 56.25, 0, 20, …, 44; 0, 57.34, 0, 55, …, 14; … Physical = Data Logical + Semantic  Metadata input output

  41. +A2 +A3 +A1 Domain Workflow Test sample Species pres. & abs. points Species pres. & abs. points Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Native range prediction map User Map Generation Env. layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  42. +A3 +A2 +A1 Generic Workflow Occurrence Data Binary, Categorical or Numeric Test sample Model quality parameters EcoGrid DataBase Training sample GARP (or other) rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Prediction map User Map Generation Environmental layers Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  43. +A3 +A2 +A1 Temperature Interpolation Workflow Weather station temperature data Test sample Model quality parameters EcoGrid DataBase Training sample GARP rule set PhysicalTransformation Data Calculation Sample Data EcoGrid Query Validation EcoGrid DataBase GARP rule set Integrated layers Environmental layers: elevation, aspect, land cover Prediction map: Interpolated temperature grid User Map Generation Integrated layers Selected prediction maps EcoGrid DataBase EcoGrid Query Layer Integration Scaling EcoGrid DataBase Archive To Ecogrid Generate Metadata

  44. ASx ASx TS1 TS1 ASy ASy ASz ASz TS2 TS2 ASr ASr Prediction model from native area Changed environmental layers: Prediction maps under changed conditions Compare to get predicted effect of environmental change on species Extending Workflows: Climate Current environmental layers: Prediction maps under current conditions

  45. ASx ASx TS1 TS1 ASy ASy ASz ASz TS2 TS2 ASr ASr Prediction model from native area Invasion area environmental layers: Prediction maps in invasion area Extending Workflows: Invasion Native area occurrence and environmental layers: Prediction maps in native area

  46. Process • Create the domain workflow at a conceptual level • Define the physical and logical data types for each step • Define the ontological data types for each step, for both the domain and a generic ontology • Map the domain workflow to a generic workflow • Map the generic workflow to other domain workflows

  47. Exercise • Divide into two groups (roughly half in each): • Climate change • Invasive species • Download generic workflow from: ftp://ftp.lternet.edu/pub/outgoing/penningd • Work on conceptual workflows that: • Reuse the generic pipeline • Extend the generic pipeline • Create new pipelines • Use Power Point, Visio, or paper tablets…your choice!

More Related