Improving the Reuse of Scientific Workflows and their By-products

Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University of North Carolina - Chapel Hill, and North Carolina State University Gregory Madey Department of Computer Science and Engineering University of Notre Dame 2007 IEEE International Conference on Web Services (ICWS 2007) Salt Lake City, Utah, July 2007 Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund

Collaborators: Xiaorong Xiang & Jeanne Romero-Severson

Outline: two parts • Production system (MoGServ) for bioinformatics workflow • Bioinformatics application • Productivity improvement • Prototype system exploring ideas for end-user composition • Workflow reuse • Knowledge management/discovery

Bioinformatics today From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 • Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc. • Non-standard independently developed heterogeneous data sources • Data sharing and security • Productivity Problem!

SOA in Bioinformatics Middleware projects Large public databases and bioinformatics tools MORE • Community efforts needed to provide more shared and reliable services • More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. Recent exposure of data & analysis tools as services Others Others Provide infrastructure to compose, manage, execute, connect the distributed services

Mother of Green (MoG) project • Biological science • In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame. • Study the deep phylogeny of plastid • Computer science • Provide an environment to support scientists’ investigations • A case study of using SOA for data and application integration • A prototype for future research in service-oriented architecture domain

Mother of Green • Malaria causes 1.5 - 2.7 million deaths every year • 3,000 children under age five die of malaria every day • Plasmodium falciparum (a protozoan parasite)causes human malaria • Drug resistance a world-wide problem • Targeted drug design through phylogenomics P. falciparum

Mother of Green • P. falciparum has three genomes • Nuclear, mitochondrial, plastid • Animals and insects have only two • Target the third genome • No harm to animals • New antimalarial drug • High risk, high tech, high payoff J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering

Mother of Green • Plastids are the third genome • Intracellular organelles • Terrestrial plants, algae, apicomplexans • Functions in plants and algae • Photosynthesis • Oxidation of water • Reduction of NADP • Synthesis of ATP • Fatty acid biosynthesis • Aromatic amino acid biosynthesis • Functions in apicomplexans ? Chloroplast in plant cell plastid Apicoplast in P. falciparum Plastid in Toxoplasma sp.

Mother of Green • The apicoplast appears to code for <30 proteins. • Repair, replication and transcription proteins • Why is the apicoplast essential?

Mother of Green Phylogenomics • Find the ancestors of the apicoplast • Identify genes in the ancestors • Determine gene function • Look for these genes in the P. falciparum nucleus • Then study regulatory mechanisms in candidate genes

Phylogenomics of plastids • Very old lineage (> 2.5 billion years) • Cyanobacterial ancestor • Three main plastid lineages • Glaucophytes • Group of freshwater algae • Chloroplast resembles intact cyanobacteria • Chlorophytes • Green plant lineage • Chloroplast genome reduced • Many chloroplast genes now in nuclear genome • Rhodophytes • Red algal lineage • Chloroplast genome bigger than in green plants • Oomycetes • Apicomplexans

One plastid origin Phylogenomics of plastids • One cyanobacterial ancestor ? • Many? • Lineages are not linear Multiple plastid origins

Nucleus The process of endosymbiosis. Horizontal Gene Transfer (arrows) from the plastid to the nucleus. The nucleomorph is a remnant of the original endosymbiont nucleus. Cyanobacteria Primitive eukaryote Endosymbiont plastid Nucleus Second eukaryote Nucleomorph Secondary endosymbionts Plastid disappears Secondary nonphotosynthetic endosymbiont

Tertiary endosymbiosis. Horizontal Gene Transfer Secondary endosymbiont Third eukaryote Tertiary endosymbionts Plastid disappears Tertiary nonphotosynthetic endosymbiont P. falciparum

The information gathering problem • Rapid accumulation of raw sequence information • ~100 sequenced chloroplast genomes • ~57 sequenced cyanobacterial genomes • Rate of accumulation is increasing • Information accumulates faster than analyses finish • Information in forms not readily accessible • Solution • Semi-automated web-services • “Smart” web-services • Semantic web

A typical in-silico investigation – Data driven research B: Query protein coding genes for each genome sequence A: Query complete genome sequences given a taxa C: Eliminate vector sequences D: Sequences alignment E: Phylogenetic analysis

Time consuming manual web-based operations • Data collection • Copy & paste! • Analysis tool usage • Copy & paste! • Experiment data recording • Copy & paste! • Repetitive experiments for scientific discovery • Copy & paste! • Repeat as new data becomes available • Copy & paste!

MoGServ system architecture • MoGServ interface • Web interface • Application interface • MoGServ middle layer • Data access storage • Data and analysis services • Service and workflow registry • Indexing and querying metadata • Service and workflow enactment • Acting in two roles: service requester and service provider

Web Interface Applications Services Access Client Application Server Data Access Services Job Manager Service/Workflow Registry MoGServ System Architecture MoGServ Middle Layer Data Analysis Services Metadata Search Job Launcher Local Data Storage Workflow/Soap Engines Services Data/Services Providers NCBI DDBJ EMBL Others

Data storage and access services • Local database • Integrating data from multiple data sources with scientists interests • Supporting repetitive investigations against several subsets of sequences • Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources • Accessing the data in the local database by services

Service and workflow registry • A table-based description with necessary properties • Text description • Service location • Input/output • Provider • Version • Algorithm • Invocation method • Not intended for supporting service discovery or composition • To answer end-users questions about their results • Provenance: “Which algorithm was used to generate the data and what is the source of the input data?” • A repository of service and workflow used for local application developers

Indexing and querying metadata • Metadata • Service and workflow description • Description of sequence data in order to track the origination of data • Experimental data output, input, and intermediate data • Indexing and querying with keyword • Lucene • Implemented as services

Service/Workflow Registry INPUT Parameters Task Name Timer Find the service/workflow definition using the task name Job Launcher Job Manager Job Information Form a Job Description Instances of Workflow/Service Engines Output Job ID Service and workflow enactment

Implementation • Development and deployment • J2EE, JSP, XSLT • Tomcat 5.0.18 / Axis 1.2 • Database • PostgresSQL 8.1 • Index and search of metadata • Apache Lucene library • Service implementation • Java2WSDL • Wrap command line applications with JLaunch library • Workflow • Taverna workbench, part of myGrid project • Freefluo workflow engine

Data and services

Taverna workbench

A workflow created using the Taverna workbench tool

Improvement opportunities • Use existing domain ontology in bioformatics community to describe services, workflows, and data • Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain • Support users with limited knowledge of scientific processes • Record various workflow representations • Facilitate the discovery and reuse of prior workflows • Knowledge management • Knowledge discovery

Service Composition and workflows • Service composition • Ad-hoc • Semi-automate • Semantic annotation + reasoning • Automated • Semantic annotation + planning • Scientific workflows • Workflows composed based on service-oriented architecture for assisting scientists in accessing and analyzing data.

Current workflow management systems • Existing workflow management system and bioinformatics middleware • Taverna, Kepler, Triana, Pegasus • Design, execute, monitor, re-run • Support ad-hoc, semi-automated and automated service discovery and composition from scratch

Our approach • Reuse the verified knowledge and workflow in the community • Increase the correctness of composed workflows over time • Provide more accurate guidelines for users • A four level hierarchical workflow structure • An enhanced workflow system

Three user-defined workflows from different views Question: “are gene genealogies for ATP subunits α, β,andγ different?” Retrieving queryGene queryGene queryGene Aligning setIds Workflow A defined by a less experienced user using the functional definition of services setFilter queryGene clustalW clustalW clustalW clustalW Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process Workflow B defined by an intermediate user with executable services

MogServ Workflow composer (software agent/experienced users) Knowledge base management concrete workflow Workflow execution engine Knowledge discovery Collect and manage information about data origination Semantics enabled service discovery Find appropriate service Semantics enabled service registry Data provenance management Service matchmaking Abstract workflow Annotate services using ontology OWL DL reasoner Create abstract workflow using ontology Service Annotator User Ontology Enhanced workflow system

Task A Task B Encode, convert the High level definition To low-level executable Abstract workflow Service A Service B Service D Service C Replace individual Services with their optimal alternatives Concrete workflow Service A Service B Pegasus workflow structure Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Service D Service C’ Optimal workflow input Service A Service B output Our hierarchical workflow structure Service D Service C’ Workflow instance

Reusable knowledge • Connectivity • Helps to convert from abstract workflow to concrete workflow • Alternative services • Helps to convert from concrete workflow to optimal workflow • Quality profile of services • Helps discover optimal workflows • Mapping of abstract workflow and concrete workflow • Helps to choose reusable workflows

Connectivity identification(Match detection) Service: QueryLocal Operation: createSet performTask: mygrid:retrieving inputPara: Settype(String, mog:gene) Queryterm(String, null) outputPara: Setid(string, mog:geneset) useResource: MoG Service: ClustalW Operation: runClustalWdf performTask: mygrid:aligning inputPara: Setid(String, mog:set ) Sequencetype(String, mog:sequence) outputPara: filen(string, mygrid:sequence _alignment_report) useResource:EBI Service: FormatConversion Operation: convert performtask: mygrid: translating inputPara: filen(String, mygrid:sequence _alignment_report ) outputPara: Out(String, mygrid:nexus _paup_format) useResource:MoG Parameter (data type, semantic type) Matching rule: opertation ij→ operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)

Need for verified service connectivity The mismatching problem Real match Inaccurate annotation Lack semantic annotation Inaccurate reasoning Accurate annotation Yes No May be detected by experts at design time or after run Yes Match Detection output No Accurate annotation Inaccurate annotation Lack of semantic annotation Inaccurate reasoning Can be detected automatically TN X Blastp In: protein sequence FP GenBankService Out:GenBank record NCBI blast In: sequence data record DDBJ-XML Out: sequence data record X Mediator, adaptor, shim Self-defined format fasta format

Workflow Translation / Service composition process registry Registration process Automatically Identify the connectivity Store the connectivity Refine, update, decompose the workflow Knowledge base Connectivity Graph Implementation Connectivity between services is converted to finding a path between two nodes in a graph connect(servicea, operationai, parameterc, serviceb, operationbi, parameterd) identifyConnect (Single service, rdf repository) Search at syntactic level: search path between two nodes search next available service automatic composition base on input, output Implementation: shortest path algorithm Dijkstra

Ontological modules used for semantic description of data, services & workflows MoGServ application Domain Ontology (MoGServ) Generic Service Description Ontology (myGrid/Feta model) Service Domain Ontology (myGrid) Software components for annotation RDF Store Data Services Workflows

MoGServ Application Domain Ontology Example concepts and properties defined in MoGServ • To better track the data origination • To support the automation of workflow creation • To better share the data on the web in the future

Sample service/workflow annotation Question: Which service has an operation that accepts nucleotide_sequence as a parameter Answer: Uri: http://www.ebi.ac.uk …/alignment:blastn_ncbi OperationName: Run Displayed by Rdf-Gravity

Implementation of annotation and query components for data, services & workflows Annotation Templates (Service) Annotation components Sesame RDF store Annotation Templates (Data) • Sesame 1.2.6 library • Supports files, RDBMS, SeRQL Query Components Query templates Service: http:host.cse.nd.edu/ axis/services/ClustalW?wsdl Operation: runClustalWdf inputParameter: setid result SeRQL Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set} using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>

Experiment • Used 418 concepts from domain ontology for semantic type, defined 10 concepts for data type. • Randomly generate service annotation. 1 input, 1 output • 1000 services connectivity graph (right side) • Intel Pentium mobile 1.5GZ Length 0 = 724, length 1= 587, length 2=448, length 3= 281, Length 4=114, length 5=71 Length 6 =28, length 7=16 Length 8 = 4, length 9 = 2 Conclusion: Feasible solution.

Reuse of workflows query_term Graph view SUBDUE input format hasParameter • Reuse of abstract workflows • Reuse of concrete workflows • Compare structural similarity of two workflows • Implementation: SUBDUE algorithm • SUBDUE is has a graphy match utility that is part of its data mining system • Given workflow is converted to a graph and fed to the SUBDUE match algorithm • Abstract example … v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasNext e 3 1 hasInput e 4 2 hasOutput e 3 6 performTask e 4 7 performTask e 1 5 hasParameter e 2 8 hasParameter input hasInput task performTask hasNext retrieving task performTask hasOutput output aligning hasParameter multiple_alignment_report

Conclusion • Pro • Increase the correctness of the formed workflow over time • Avoid the incorrect, inaccurate semantic annotations • Take advantage of verified knowledge • Avoid the ontological reasoning process • Better support for semi-automated and automated service composition over time • Provide more accurate guideline to users over time • Con • The connectivity graph can be big • Number of parameters • Number of services • Search the connectivity of a service when a service is registered in the system may take relative long time • More complex matching rule • Number of parameters • May not have high accuracy at the beginning

Future work • Integrate the GridSam into the MoGServ for execution, monitoring • Integrate the Grid computing technology for resource allocation • Refine the MoGServ application domain ontology • Create interface for end-user workflow creation • Create interface for individual workspace • Evaluate the scalability, accuracy of connectivity graph approach and the graph matching approach with large number real workflows and services

Thank you Questions?

Improving the Reuse of Scientific Workflows and their By-products