Advancements in Workflow Systems for Bioinformatics: Taverna 1.4 and Future Directions

Workflows, Semantics & future eScience Integrative Bioinformatics Workshop, Tom Oinn – tmo@ebi.ac.uk, 6th September 2006

Workflows • Data driven workflow system • Graph of operations (nodes) and data transfer (edges) • Operations are services, databases, command line tools, scripts… • Workflow engine software (enactor) responsible for coordination of operations • Enactor is data agnostic, apart from collection structures (for Taverna)

Taverna 1.4 • Rewritten generic web service client • Rewritten BioMart client • Provenance capture system • Performance and usability enhancements • Groundwork for new architecture and build system

Web Service Support • Enhanced support for document / literal style services • i.e. NCBI eUtils services • More robust invocation • Copes better with various broken service types • Support for wsdl:documentation tags • Now shows free text service docs • Not ideal but it’s all web services give us

BioMart Support • UI changed to reflect current website • Ease ‘techno-shock’ to users • Supports all Mart features • Data set linking and federation • Uses Mart service • Connects over HTTP reducing firewall issues • Mart providers no longer need to open JDBC access ports, fewer ports open so better security for service providers.

Provenance Capture & Browse • Observes events from the workflow engine • Populates a triple store with information from these events • Presents a simple browse interface over this metadata • Replicates Taverna’s existing result and status browser • Allows for more complex query interfaces in the future

Taverna is now part of OMII-UK • Taverna 1.4 production target : Sept 2006 • Packaging, Installation, Deployment, Maintenance, Testing • GridSAM, GRIMOIRES, BioMOBY registry integration • Semantic content for registry • Integration of discovery and metadata management • Security AA for KAVE data and metadata management • Taverna 2.0 : Spring 2007 • Redevelopment of the plug in and enactor framework, improved iteration events, data management • Close collaboration with pioneers • Incremental rollouts to early adopters

myGrid Alliance Source-forge community Ingest OMII-UK Release myGrid Release myGrid Pre-release Evaluation Software Engineering Quality & Test OMII Software Engineering Quality & Test Software Engineering XP Prioritise & Plan Applications & Professional Services Production Conservatives Early adopters Pioneers Early adopters Pioneers Pioneers

Evolving challenges • Long running data intensive workflows • Manipulation of confidential or otherwise protected information • Use with classical grid systems • Interaction with users during workflows • Workflow authoring, service discovery and composition • Fine grained runtime updates • Data comprehension, provenance and visualization – the rest of this talk!

‘Data playground’ exploratory tool Manual use of tools, web pages Scripted tool invocation Naïve workflow systems Basic ‘discovery’ style service annotations Guided workflow construction Workflow design with annotation overlays Automated hypothesis generation (really!) Knowledge driven visualization Hypothesis validation And now, the future… Better Semantics and Understanding Increasing Automation

Service Annotations • Immediate problem – too many services! • At workflow construction time users cannot isolate the services they need • Multiple levels of annotation • Interface and syntactic definitions i.e. WSDL • Free text descriptions • Semantic annotation of operations

Automated hypothesis generation (really!) Manual use of tools, web pages Scripted tool invocation Naïve workflow systems Basic ‘discovery’ style service annotations Guided workflow construction Knowledge driven visualization Hypothesis validation Service annotations output{score} is_distance_between pair {input{sequence a}, input{sequence b}} Also needs workflow level annotation! Requires type ontology or ontologies! input_type{seq_a} : sequence… output_type{score} : d_value Better Semantics and Understanding performs_task : alignment ‘A tool to compare multiple protein structures’ Natural language ‘myalignscript.pl’ Increasing Automation

Building the semantic network • Workflow engine uses service annotations to annotate the results of invocations of those services. • For example : ID Fetch Structure Fetch Sequence ExtractMotifRanges InterproScan GetGO (cellular location)

No service annotations ID Fetch Sequence Fetch Structure ID ID InterproScan InterproScan ID ID GetGO GetMotifRanges ID ID GetGO GetMotifRanges GetMotifRanges GetGO GetMotifRanges ID ID ID ID ID

Input / Output type annotation ID protein_identifier Fetch Sequence protein_sequence Fetch Structure ID ID InterproScan InterproScan 3d_structure ipro_identifier ID ID ipro_identifier GetGO GetMotifRanges ID ID GetGO GetMotifRanges GetMotifRanges range_set go_term GetGO GetMotifRanges ID ID go_term range_set ID ID ID go_term range_set range_set

Full static semantics ID protein_identifier has_sequence protein_sequence has_structure ID ID has_ipro_hit has_ipro_hit 3d_structure ipro_identifier ID ID ipro_identifier has_go contains_domain ID ID has_go contains_domain contains_domain range_set go_term has_go contains_domain ID ID go_term range_set ID ID ID go_term range_set range_set

Dynamic semantics ID protein_identifier has_sequence protein_sequence has_structure ID 3d_structure ID has_ipro_hit location_prediction ID has_evidence contains_domain ID ipro_identifier range_set has_evidence contains_domain ID predicts_location range_set has_go Driven by workflow level annotation ID (nodes omitted to prevent further insanity) go_term

Visualization • Naïve rendering of the graph isn’t good enough • Any scientific domain already has vizualization mechanisms • Create an ecosystem of visualization agents • Iteratively consume the semantic network • Replace node(s) with markers into the visualizer’s space • Render any remaining edges using graph layout

Rendering Agents ID protein_identifier has_sequence protein_sequence has_structure ID ID 3d_structure has_ipro_hit location_prediction ID has_evidence contains_domain ID ipro_identifier range_set has_evidence contains_domain ID Sequence + Feature Renderer 3D Structure Renderer predicts_location range_set has_go ID Gene Ontology Subgraph Renderer go_term

Hypothesis Validation • Express hypothesis as a pattern that can match the semantic network topology • Combination of structure and node values • Need to use a rich graph aware query language, various options • For each object of a certain class test whether the structure around it matches • Link back to the visualization to show exceptions 

Hypothesis Generation (!) • Use genetic algorithms to ‘evolve’ a suitable match for the previous stage • Relatively easy to create a fitness function (precision, specificity, match percentage) • Easy to ‘mutate’ patterns • ‘Tell me anything interesting you’ve noticed about protein structures in this workflow’ capability 

Obtaining Taverna • Taverna is available under the LGPL from our project site on Sourceforge.net • http://taverna.sourceforge.net • Release 1.4 as of May 2006 • Win32, Solaris / Linux & OS-X • Includes online and downloadable user manual, examples etc. • Support via project mailing lists

myGrid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer • OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble. • Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. • Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. • User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell. • Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. • IndustrialDennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. • Funding EPSRC, Wellcome Trust.

Advancements in Workflow Systems for Bioinformatics: Taverna 1.4 and Future Directions