Semantic Extensions for Scientific Workflows on the Grid

San Diego Supercomputer Center Semantic Extensions for Scientific Workflows on the Grid Bertram Ludäscher (ludaesch@ucdavis.edu) UC DAVIS Department of Computer Science Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow San Diego Supercomputer Center University of California, San Diego

Overview • Science Environment for Ecological Knowledge (SEEK) • Scientific Workflows • What are they? • Why do we need them? • The Kepler Scientific Workflow System • Adding Semantics to Scientific Workflows

Large collaborative NSF/ITR (2002-2007) Bringing together ecologists, IT experts, CS researchers, … SEEK.ecoinformatics.org Science Environment for Ecological Knowledge

SEEK: Multidisciplinary research to facilitate … • Access to ecological, environmental, and biodiversity data • Enable data sharing & re-use • Enhance data discovery at global scales • Scalable analysis and synthesis • Taxonomic, Spatial, Temporal, Conceptual integration of data, addressing data heterogeneity issues • Enable communication and collaboration for analysis • Enable re-use of analytical components

SEEK Main Components • Kepler • Problem-solving environment for scientific data analysis and visualization  “scientific workflows” • EcoGrid* • Distributed data network for environmental, ecological, and systematics data • Making diverse environmental data systems interoperate • Semantic Mediation System • “Smart” data discovery and integration • Knowledge Representation WG • Taxon WG • BEAM WG • Education, Outreach, Training *name-clash: cf. other Eco-Grid project!

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Ecology Scientific Workflow: Invasive Species Prediction Source: NSF SEEK (Deana Pennington et. al, UNM)

Scientific Workflows • Model the way scientists work with their data and tools • Mentally coordinate export and import of data among software systems • Scientific workflows emphasize data flow (≠ business workflows) • Metadata (incl. provenance info, semantic types etc.) is crucial for automated data ingestion, data analysis, … • Goals: • SWF automation, • SWF & component reuse, • SWF design & documentation • making scientists’ data analysis and management easier!

Commercial & Open Source Scientific Workflow” (Dataflow) Systems Kensington Discovery Edition from InforSense Triana Taverna

Kepler Starting Point: UC Berkeley’s Ptolemy II Large, polymorphic component (“Actors”) and Directors libraries (drag & drop) “Directors” define the component interaction & executionsemantics

Kepler Scientific Workflows e.g. from Web Services 1 2 4 3 •  “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components

Job Management (here: with NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics)

Some Kepler Actor Additions

Ecological Niche Model in Kepler (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) = 833 to 2083 days

KeplerGrid for Biodiversity (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) / 100 nodes = 8 to 20 days Grid-enabled Kepler • Utilize distributed computing resources • Execute single steps or sub-workflows on distributed machines • Initially, focus on ‘trivially parallel’ workflows • Support collaboration through the formation of ad-hoc grids • Implementations • Peer to peer using JXTA • Traditional HPC-based batch job submission (e.g., NIMROD, Condor) KeplerGrid for Niche Modeling

A GEON Data Analysis Workflow

Statistics Packages (here: R) in Kepler Source: Dan Higgins, Kepler/SEEK

ORB

Ilkay Altintas SDM, Resurgence, NLADR,… Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN,ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• KEPLER: An OPEN SOURCE, cross-project collaboration Ptolemy II www.kepler-project.org Your Logos & Names HERE!!!

GEON Dataset Generation & Registration (a co-development in KEPLER) % Makefile $> ant run SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al.(Ptolemy)

Kepler today • Supports scientific workflows • Ecology, molecular bio, geology, … • Variety of analytical components (including spatial data transformations) • Support for R scripts and Matlab scripts • Real-time data access via Antelope ORB • EcoGrid access to heterogeneous data • EML Data support • Experimental data, survey data, spatial raster and vector data, etc. • DarwinCore Data support • Museum collections • EcoGrid registry to discover data sources • Ontology-based browsing for analytical components • Exploit semantics to improve the user experience • Demonstration workflows • Ecology: Ecological Niche Modeling, Biodiversity Analysis, … • Genomics: Promoter Identification Workflow • Geology: Geologic Map Integration, Rock-type distribution analysis • Oceanography: Real-time Revelle example of data access

Kepler soon (this year mostly …) • Usability engineering • Full evaluation and user-oriented customization of all UI components • Distributed computing/grid computing • Large jobs, lots of machines • Detached execution • “Smart” data and component discovery • Support annotating data sources • Component repository / downloadable components • Automated data and service integration and transformation using ontologies • Complete EcoGrid access • Full EML support • Support for “large” data and 3rd-party transfer • More data sources and types of data sources (e.g., JDBC, GEON data) • Provenance and metadata propagation

Joint Ptolemy/Kepler Meeting (in eigener Sache ;-)

Kepler Actor-Library w/ Concept Index • How do you find the right component (actor)?  Ontology-based actor organization / browsing  Simple text-based and concept-based searching • Next: ontology-based workflow design Workflow Components (MoML) urn ids Semantic Annotations instance expressions Ontologies (OWL) Default + Other

Ecological ontologies • What was measured (e.g., biomass) • Type of measurement (e.g., Energy) • Context of measurement (e.g., Psychotria limonensis) • How it was measured (e.g., dry weight) • SEEK intends to enable community-created ecological ontologies using OWL • Represents a controlled vocabulary for ecological metadata

Ontology O (in Description Logic … cf. OWL-DL)

SEEK KR (Knowledge Representation) Working Group Current Ontologies • Ecological Concepts, Models, Networks • Measurements • Properties • Statistical Analyses • Time and Space • Taxonomic Identifiers • Units • Symbiosis Recent Developments • Biodiversity (measured traits, computation of traits) • Descriptive Terminology for Plant Communities • Ontology documentation Future Goals • “Fill-in” existing concepts, evolve the ontology framework • More domains …

Need for Semantic Annotations of data & actors • Label data with semantic types(concept expressions from an ontology) • Label inputs and outputs of analytical components with semantic types Example: Data has COUNT and AREA; workflow wants DENSITY •  via ontology, system “knows” that data can still be used (because DENSITY := COUNT/AREA) • Use reasoning engines to generate transformation steps • Use reasoning engine to discover relevant components Data Ontology Workflow Components

A Scientist’s “Semantic” View of Actors P2 P3 P5 P1 S1(life stage property) S2(mortality rate for period) [(nymphal, 0.44)] P4 k-value for each periodof observation life stage periods observations Phase Observed Period Phases Nymphal {Instar I, Instar II, Instar III, Instar IV} Eggs Instar I Instar II Instar III Instar IV Adults 44,000 3,513 2,529 1,922 1,461 1,300 Periods of development in terms of phases Population samples for life stages of the common field grasshopper [Begon et al, 1996] Source: [Bowers-Ludaescher, DILS’04]

root population = (sample)* elem sample = (meas, lsp) elem meas = (cnt, acc) elem cnt = xsd:integer elem acc = xsd:double elem lsp = xsd:string Structural Type (XML DTD) Annotations structType(P2) structType(P3) root cohortTable = (measurement)* elem measuremnt = (phase, obs) elem phase = xsd:string elem obs = xsd:integer <population> <sample> <meas> <cnt>44,000</cnt> <acc>0.95</acc> </meas> <lsp>Eggs</lsp> </sample> … <population> <cohortTable> <measurement> <phase>Eggs</cnt> <obs>44,000</acc> </measurement> … <cohortTable> P2 P3 P5 P1 S1(life stage property) S2(mortality rate for period) P4 Source: [Bowers-Ludaescher, DILS’04]

Semantic Type Annotations • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports • Application: e.g., design support: • smart/semi-automatic wiring, generation of “adaptor actors” Actor (normalize) pin pout Takes Abundance Count Measurements for Life Stages Returns Mortality Rate Derived Measurements for Life Stages Source: [Bowers-Ludaescher, DILS’04]

(≺) A KR+DI+Scientific Workflow Problem • Services can be semantically compatible, but structurally incompatible Ontologies (OWL) Compatible (⊑) SemanticType Ps SemanticType Pt Incompatible StructuralType Ps StructuralType Pt (⋠)  (Ps) Desired Connection Source Service Target Service Pt Ps Source: [Bowers-Ludaescher, DILS’04]

The Ontology-Driven Framework Ontologies (OWL) Compatible (⊑) SemanticType Ps SemanticType Pt Registration Mapping (Input) Registration Mapping (Output) StructuralType Ps StructuralType Pt Correspondence Source Service Target Service Pt Ps Desired Connection

Correspondence Example Source /population/sample == semType(P2)/population/sample/meas/cnt == semType(P2).itemMeasured/population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount/population/sample/meas/acc == semType(P2).hasProperty/population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue/population/sample/lsp/text() == semType(P2).hasContext.appliesTo Target /cohortTable/measurement == semType(P3)/cohortTable/measurement/obs == semType(P3).itemMeasured/cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount/cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo We want to exploit the semantic information to obtain structural correspondences population cohortTable sample * measurement * meas obs cnt xsd:integer xsd:integer phase acc xsd:string xsd:double lsp xsd:string

Correspondence Example Source /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured/population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount/population/sample/meas/acc == semType(P2).hasProperty/population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue/population/sample/lsp/text() == semType(P2).hasContext.appliesTo /population/sample ==semType(P2) Target /cohortTable/measurement == semType(P3)/cohortTable/measurement/obs == semType(P3).itemMeasured/cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount/cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo /cohortTable/measurement ==semType(P3) population cohortTable sample * measurement * meas obs cnt xsd:integer xsd:integer phase acc xsd:string xsd:double lsp These fragments correspond xsd:string

Correspondence Example Source /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured/population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount/population/sample/meas/acc == semType(P2).hasProperty/population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue/population/sample/lsp/text() == semType(P2).hasContext.appliesTo /population/sample/meas/cnt ==semType(P2).itemMeasured Target /cohortTable/measurement == semType(P3)/cohortTable/measurement/obs == semType(P3).itemMeasured/cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount/cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo /cohortTable/measurement/obs ==semType(P3).itemMeasured population cohortTable sample * measurement * meas obs cnt xsd:integer xsd:integer phase acc xsd:string xsd:double lsp These fragments correspond xsd:string

Correspondence Example Source /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured/population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount/population/sample/meas/acc == semType(P2).hasProperty/population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue/population/sample/lsp/text() == semType(P2).hasContext.appliesTo /population/sample/meas/cnt/text() ==semType(P2).itemMeasured.hasCount Target /cohortTable/measurement == semType(P3)/cohortTable/measurement/obs == semType(P3).itemMeasured/cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount/cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo /cohortTable/measurement/obs/text() ==semType(P3).itemMeasured.hasCount population cohortTable sample * measurement * meas obs cnt xsd:integer xsd:integer phase acc xsd:string xsd:double lsp These fragments correspond xsd:string

Correspondence Example Source /population/sample == semType(P2) /population/sample/meas/cnt == semType(P2).itemMeasured/population/sample/meas/cnt/text() == semType(P2).itemMeasured.hasCount/population/sample/meas/acc == semType(P2).hasProperty/population/sample/meas/acc/text() == semType(P2).hasProperty.hasValue/population/sample/lsp/text() == semType(P2).hasContext.appliesTo /population/sample/lsp/text() ==semType(P2).hasContext.appliesTo Target /cohortTable/measurement == semType(P3)/cohortTable/measurement/obs == semType(P3).itemMeasured/cohortTable/measurement/obs/text() == semType(P3).itemMeasured.hasCount/cohortTable/measurement/phase/text() == semType(P3).hasContext.appliesTo /cohortTable/measurement/phase/text() ==semType(P3).hasContext.appliesTo population cohortTable sample * measurement * meas obs cnt xsd:integer xsd:integer phase acc xsd:string xsd:double lsp These fragments correspond xsd:string

Ontology-Guided Data Transformation Ontologies (OWL) Compatible (⊑) SemanticType Ps SemanticType Pt Structural/Semantic Association Structural/Semantic Association StructuralType Ps StructuralType Pt Correspondence (Ps) Generate Source Service Target Service Transformation Pt Ps Desired Connection Source: [Bowers-Ludaescher, DILS’04]

Linking Structural and Semantic Types  : S O Ontology / Semantic type O Schema elements/ Structural type S

Propagating Semantic Annotations • Given: • structural schemas S (input) and S’ (output), and an ontology O • a semantic annotation: S  O • a query annotationq: S S’ • Problem: compute ’

Applications • WF design time: • Actor Actor connections • Data binding time: • Actor Data connections (“data binding”) • WF runtime: • “semantic tagging” of derived data products

Semantic Propagation Infer annotations for derived products: • When a (partial) specification of an actor is given (e.g., as a query q), then exploit this to propagate semantic annotations from S to T • minimize costly semantic annotation • check for consistency S q T r u annotated annotated query maps to new target annotationn sourceannotation Chase & Backchase, e.g., via MARS Traditional LAV query answering

Biodiversity Workflow w/ Query Annotations q

Annotation Constraint  : S O  = x (s(x)  c(x))  y o(z) % z = x  y • s links the variables x to schema elements of S • c is conjunction of comparisons over x and constants • o“populates” the ontology structure O X : biom[seas=S], S = ‘w’ X : observation[temporalContext = S : WinterSeason] c(x) s(x) o(z)

Semantic Extensions for Scientific Workflows on the Grid