BioMOBY the one that almost got away Mark Wilkinson, iCAPTURE Centre UBC, Vancouver, Canada

BioMOBYthe one that almost got awayMark Wilkinson, iCAPTURE CentreUBC, Vancouver, Canada MOBY-S Update for VanBug Vancouver, BC,Canada, 2004

Make some sense of this mess!

Along came web services • Relatively recently added to the bioinformatics tool-belt • Didn’t help the situation much… • A web service that consumes “string” data types might be expecting a fasta sequence, or a keyword. • No clear way for a machine to know which • UDDI/WSDL is not very useful in solving this problem • Biology/Bioinformatics has a lot of data-types!

Who is MOBY’s audience? • Information is distributed • Beyond Flybase, MIPS, EnsEMBL and TAIR • MOST data never makes it off of the scientists hard drive • This data should be added to the global scientific archive • Biologists, by and large, are willing and able, but… • The Web was embraced enthusiastically by biologists • Most wet labs run a website in which they present at least some of their results and data through HTML or CGI • Unfortunately, this only adds to the chaos… The interoperability solution we design must be simple enough for a Biologist, with a little bit of computer knowledge, to implement on their own

Define data-types commonly used in bioinformatics • Organize these into an ontology • Ontologically define web service inputs and outputs • Register the inputs and outputs of each service provider in a “yellow pages” registry • Machines can find an appropriate service • Machines can execute that service unattended The MOBY Plan

Overview of MOBY-S Transactions MOBY hosts & services Sequence Express. Protein Alleles … MOBY Central Align Phylogeny Primers Sequence Alignment Gene names

MOBY-S Data Types • My disappointment with web services not being (easily) able to distinguish between a Fasta sequence and a keyword led me to spend a lot of time thinking about data-types. • This consideration became the core focus of MOBY-S • Constraints on MOBY-S are much more severe than on an “archetypcal” computer-science solution • our target audience are not high-level programmers • Defining data types with XML schema is a non-starter: IT WILL NEVER HAPPEN!

MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type • The MOBY-S Service Ontology • The MOBY Central Registry

MOBY-S Semantic Typing: Namespaces • Any identifiable piece of data is an “entity” • Identifiers fall into particular “Namespaces” • NCBI has gi numbers (gi Namespace) • GO Terms have accession numbers (GO Namespace) • Namespaces indicate data’s semantic type. • GO:0003476 represents a Gene Ontology Term, not a sequence • gi|163483 represents a GenBank record • However, we cannot tell if it is protein, RNA, or DNA sequence • Namespace+ID is sufficient to specify a particular “entity” • The namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces

MOBY-S Syntactic Typing: The Object Ontology • Syntactic types are defined by a GO-like ontology • Type (Class) name at each node • Edges define the relationships between one Class and another • Gene Ontology used as a model because of its obvious success and comprehension by the model organism community • Edges define one of three relationship types • ISA • Inheritance relationship • All properties of the parent are present in the child • HASA • Container relationship of ‘exactly 1’ • HAS • Container relationship with ‘1 or more’

A portion of the MOBY-S Object Ontology

ISA inheritance relationship • Classes become more specialized as you move along the ISA relationship hierarchy DNA_Sequence ISA Nucleotide_Sequence ISA Generic_Sequence ISA Virtual_Sequence ISA Object • Objects do not become more complex as a result of ISA relationships alone

HASA & HAS relationships • HASA and HAS relationships make Classes more complex by embedding Classes within Classes • Virtual_SequenceISAObject • Virtual_Sequence HASA Length(Integer) • Generic_SequenceISAVirtual_Sequence • Generic_Sequence HASA Sequence(String) • Annotated_GIF ISA Image(base_64_GIF) • Annotated_GIF HAS Description(String)

Legacy file formats • Inheriting from “String” allows us to define ontological classes that represent legacy data types • NCBI_Blast_Report ISA text-formatted ISA String • <NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’> • TBLASTN 2.0.4 [Feb-24-1998] • Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. • Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman • (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search • programs", Nucleic Acids Res. 25:3389-3402. • Query= gi|1401126 • (504 letters) • Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences • 336,723 sequences; 677,679,054 total letters • Searchingdone • Score E • Sequences producing significant alignments: (bits) Value • gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0 • emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07 • emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05 • gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05 • </NCBI_Blast_Report>

Binaries • We base64 encode binaries, and again define data classes that inherit from String • base64_encoded_jpeg ISA text/base64 ISA text/plain ISA String • <base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx • HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl • bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf • MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt • </base64_encoded_jpeg>

Extending legacy data types • With legacy data-types defined, we can extend them as we see fit • annotated_jpegISAbase64_encoded_jpeg • annotated_jpegHASA2D_Coordinate_set • annotated_jpegHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <CrossReference> • <Object namespace=“TAIR_Allele” id=“ufo-1”/> • </CrossReference> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <CrossReference> • <Object namespace=‘TAIR_Tissue’ id=‘122’/> • </CrossReference> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>

The same object… annotated_jpegISAbase64_encoded_jpegHASA2D_Coordinate_setHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <CrossReference> • <Object namespace=“TAIR_Allele” id=“ufo-1”/> • </CrossReference> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <CrossReference> • <Object namespace=‘TAIR_Tissue’ id=‘122’/> • </CrossReference> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>

The Object Ontology: Defines an XML Schema • Object Ontology terms have semantically rich names, but this is for human intuition only • DNA Sequence • Annotated_GIF • Object Ontology does not define what these data-types mean – NO SEMANTICS • It does define the XML schema of their representation - SYNTAX

The Object Ontology: Defines an XML Schema! • The position of an ontology node precisely defines the syntax by which that node will be represented • End-users can define new data-types without having to write an XML schema! • This was an important aim of the project • Similarly you can, at run-time, determine the schema of any incoming XML by querying the ontology.

The Service Ontology • A simple ISA hierarchy • Rooted in the base “Service” transformation (never instantiated) • Primitive types include: • Analysis • Parsing • Registration • Retrieval • Resolution

ISA ISA ISA ISA ISA ISA ISA The Service Ontology Service Parsing Analysis Parse_NCBI_Blast Alignment Blast WU_Blast NCBI_Blast

MOBY Central: The yellow pages • MOBY Central is a registry for MOBY-compliant services • Not UDDI-based • Services register: • “Service Signature” - a triple of [input, service_type, output] • A human readable description of the service • The URL to the service interface • Provides two types of interfaces: • Register/Deregister • Search/Retrieve

A simple MOBY-S browser isembedded in Gbrowse • gbrowse_moby can be configured to execute MOBY Services in response to mouse-clicks in the Gbrowse sequence viewer. • It isn’t a powerful client, but it reveals some interesting MOBYesque behaviours…

Semantic Web “on the fly”! • This simple browser behaves very much like a semantic web browser • Information from non-coordinated service providers is discovered at run-time in response to queries. • It does so without semantics - Syntax only!

Semantic Web “on the fly”! • Perhaps Interoperability is not a semantic problem? • Data Integration may be more of a semantic problem (??) • Service Discovery, however, definitely is a semantic problem

Ugh…. Tedious! • The simple browser is frustrating in many ways • design once, run once • Analysis of only one data-element at a time • No way to extract the data at the end of the analysis • No provision information is saved • myGrid has been working on similar problems • The BioMOBY project has secretly absconded with one of the myGrid employees, and he now works for us! Shhhhhhh! ;-)

TAVERNA A fantastic client program that can talk to MOBY Central and execute MOBY Services Taverna was written by Tom Oinn with MOBY input by Martin Senger as part of the myGrid project

MOBY-S: On reflection • Two years into the project • >140 services registered and growing • ~20 independent service providers (not part of the BioMOBY project) • Codebase not yet developed beyond a working prototype • myGrid is making great progress, and has 25X more funding than we have! • It is now time to step back and take a critical look at what we achieved, where we failed, and where to go from here

What MOBY got RIGHT • Open source, community driven • Involving the model organism community right from the start has made an enormous impact on the early acceptance and adoption of MOBY • Rapid feedback on success/failure • we had “real” users right from the prototype stage! • The community has been very forgiving of “hiccups” because they are included in the development process

What MOBY got RIGHT • Data typing • Does not attempt to re-structure legacy data-types • passed verbatim in a lightweight XML wrapper. • There are TONS of parsers out there • Entire software projects are built around extracting information from these legacy formats. • Ontology dictates data structure/sub-structure • XML can be parsed, with the “meaning” of each sub-structure encountered being defined by the ontology • Thus MOBY data is more “self-describing” than XML even with an XML schema

What MOBY got RIGHT • Data typing • Provides a foundation for future data-type definitions • New data-types can be defined by end-users • New data-types can be defined in a structured, machine-readable way, rather than by new ad hoc flat-file format. • Unsophisticated data providers have an “environment” that structures their thinking about the data they are providing. • XML schema creation is unnecessary • REMEMBER WHO OUR TARGET AUDIENCE IS!! • Object ontology simplifies creation of visualization tools in an environment where the number/nature of data types is changing daily.

What MOBY got RIGHT • Data typing • Provides a standard way of annotating the data object, and/or any of its sub-structures • Annotations are kept separate from the data itself (versus e.g. hypertext) • Multiple annotations per data component • Mechanism for indicating the semantic relationship between the annotation and the data being annotated • Separation of the semantic data-type from its syntax • The same data “entity” can be instantiated in a wide variety of ways

BioMOBY the one that almost got away Mark Wilkinson, iCAPTURE Centre UBC, Vancouver, Canada