750 likes | 866 Vues
Harnessing the Power Of communities: MOBY & Beyond. Mark Wilkinson PI Bioinformatics iCAPTURE Centre for Cardiovascular and Pulmonary Research Assistant Professor Dept. of Medical Genetics UBC, Vancouver. A brief history of BioMoby.
E N D
Harnessing the Power Of communities: MOBY & Beyond Mark Wilkinson PI Bioinformatics iCAPTURE Centrefor Cardiovascularand Pulmonary Research Assistant Professor Dept. of Medical Genetics UBC, Vancouver
A brief history of BioMoby • Model Organism Bring Your own Database Interface Conference, Sept, 2001 (MOBY-DIC) • May 21, 2002 – Genome Canada Platform Award • May 25, 2002 – API Version 0.1 deployed, including the messaging layer that still exists today • July 18, 2002 – first Moby Client released (now gbrowse_moby, part of gbrowse from GMOD) • June 9, 2003 – API Version 0.5 deployed • Currently, the API is at version 0.86; version 1.0 API in preparation for release end of November
Create an ontology of bioinformatics data-types • Define a serialization of this ontology (data syntax) • Create an open API over this ontology • Define Web Service inputs and outputs v.v. Ontology • Register Services in an ontology-aware Registry • Machines can find an appropriate service • Machines can execute that service unattended • Ontology is community-extensible The MOBY-S Plan
Overview of MOBY-S Transactions MOBY hosts & services Sequence Express. Protein Alleles … MOBY Central Align Phylogeny Primers Sequence Alignment Gene names
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type
Moby Namespaces (from GO) • Any identifiable piece of data is an “entity” • Identifiers for these entities fall under “Namespaces” • NCBI has gi numbers (gi Namespace) • GO Terms have accession numbers (GO Namespace) • Namespaces indicate data’s semantic type. • GO:0003476 a Gene Ontology Term • gi|163483 a GenBank record • Namespace + ID precisely specifies a data “entity” • This differs from an LSID in that our identifiers ARE NOT OPAQUE – they are semantically rich
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type
The MOBY-S Object Ontology • Syntactic types are defined by a GO-like ontology • Data Class name at each node • Edges define the relationships between Classes • GO used as a model because of its familiarity in the community • Edges define one of three relationships • IS A • Inheritance relationship • All properties of the parent are present in the child • HAS A • Container relationship of ‘exactly 1’ • HAS • Container relationship with ‘1 or more’ node Edge node
The Simplest Moby Data-Type <Object namespace=‘NCBI_gi’ id=‘111076’/> The combination of a namespace and an identifier within that namespace uniquely identify a data entity, not its location(s), nor its representation Object
A Primitive Data-type ISA DateTime ISA Float ISA Integer <Integer namespace=‘’ id=‘’>38</Integer> Object ISA String
A Derived Data-Type <VirtualSequence namespace=‘NCBI_gi’ id=‘111076’> <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer> </ VirtualSequence > ISA Integer HASA ISA Object String ISA Virtual Sequence
A Derived Data-Type <GenericSequence namespace=‘NCBI_gi’ id=‘111076’> <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer> <String namespace=‘’ id=‘’ articleName=“SequenceString”> ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC </String> </ GenericSequence > ISA Integer HASA HASA ISA Object String ISA ISA Virtual Sequence Generic Sequence
A Derived Data-Type <DNASequence namespace=‘NCBI_gi’ id=‘111076’> <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer> <String namespace=‘’ id=‘’ articleName=“SequenceString”> ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC </String> </ DNASequence > ISA Integer HASA HASA ISA Object String ISA ISA ISA Virtual Sequence Generic Sequence DNA Sequence
Legacy file formats • Containing “String” allows us to define ontological classes that represent legacy data types (e.g. the 20 existing sequence formats!) • <NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’> • <String namespace=‘’ id=‘’ articleName=‘content’> • TBLASTN 2.0.4 [Feb-24-1998] • Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. • Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman • (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search • programs", Nucleic Acids Res. 25:3389-3402. • Query= gi|1401126 • (504 letters) • Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences • 336,723 sequences; 677,679,054 total letters • Searchingdone • Score E • Sequences producing significant alignments: (bits) Value • gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0 • emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07 • emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05 • </String> • </NCBI_Blast_Report>
Binaries – pictures, movies • We base64 encode binaries, and then define a hierarchy of data classes that Contain String • base64_encoded_jpeg ISA text/base64 ISA text/plain HASA String • <base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’> • <String namespace=‘’ id=‘’ articleName=‘content’> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx • HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl • bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf • MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt • </String> • </base64_encoded_jpeg>
Extending legacy data types • With legacy data-types defined, we can extend them as we see fit • annotated_jpegISAbase64_encoded_jpeg • annotated_jpegHASA2D_Coordinate_set • annotated_jpegHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • <String namespace=‘’ id=‘’ articleName=“content”> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </String> • </annotated_jpeg>
The same object… annotated_jpegISAbase64_encoded_jpegHASA2D_Coordinate_setHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • <String namespace=‘’ id=‘’ articleName=“content”> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </String> • </annotated_jpeg>
<CrossReference> <Object namespace=“TAIR_Allele” id=“ufo-1”/> </CrossReference> <CrossReference> <Object namespace=‘TAIR_Tissue’ id=‘122’/> </CrossReference> The same object… annotated_jpegISAbase64_encoded_jpegHASA2D_Coordinate_setHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>
Data perspective X Data perspective Y Object X Object Y Record in “gi” Namespace (Genbank record) How to think about MOBY Objects and Namespaces
Why define Objects in an ontology? Bioinformatics service providers are not all experienced programmers The Moby Object Ontology provides an environment within which “naïve” service providers can create new complex data-types WITHOUT generating new flatfile formats, and without having to understand XML Schema Minimize future heterogeneity between new data-types to improve interoperability without requiring endless schema-to-schema mapping efforts.
The Object Ontology Defines an XML Schema • Object Ontology terms have “meaningful” names, but this is for human intuition only • DNA Sequence, Annotated_GIF • Object Ontology does not define the biological meaning, however it does define how every XML tag should be interpreted, therefore superior to pure XML/XML-Schema solutions • It does define the representation • SYNTAX
The Object Ontology Defines an XML Schema • The position of an ontology node precisely defines the syntax by which that node will be represented • End-users can define new data-types without having to write XML Schema! • This was an important aim of the project • A machine can “understand” the structure of any incoming message by querying its ontological type
A portion of the MOBY-S Object Ontology …community-built!
Pipeline discovery “on the fly” • No explicit coordination between providers • Run-time discovery of appropriate Services • Automated execution of services • This is happening without semantics • Syntax only… well… almost… :-)
Conclusions from the Behaviourof this Simple Browser • Service discovery is a semantic problem • However interoperability is not • Data integration is still a problem – both syntactic and semantic - and we’ve just made that problem worse! • SYNTAX IS NOT THE PROBLEM!!!!
Some “political” details about BioMoby as we are coming to the end of the current Genome Canada funding period and are trying to get renewal… hint, hint, if there are any GC external reviewers in the audience!
Moby: Breadth • Namespaces (semantic datatypes): 236 • Objects (data syntaxes): 161 • Service Types (analytical categories): 18 • Service Instances: 401 (+ ~200 Soaplab) • Hundreds more in “boutique” Moby registries serving specialized communities worldwide • All continents except Antarctica host Moby services
Moby: Impact • Mailing list count 175 members (84 on developers mailing list) • Google Scholar • ‘BioMOBY’ 147 • Citations of 2002 BioMOBY paper 72
Moby: Developer Activity • MOBY-DIC Chapter 7 meeting • Vancouver, May 6-8, 2005 • 23 Developers attending • Asia • USA • Canada • Germany • Spain • France • Mapped-out the route to the final 1.0 version of the API
Moby Registry Activity PlaNet implements own MOBY Central
Moby: Exemplar Users • PlaNet consortium (7+ sites, 100-130 services) • EBI – SOAPLAB – myGrid • Generation Challenge Programme of the CGIAR (18+ sites) • Genome Espania uses MOBY for much of the bioinformatics service provision in the GE Bioinformatics Platform
Moby: Clients • Gbrowse_moby (M Wilkinson) • Browser-style client • Ahab & Ishmael(B Good, M Wilkinson) • “BLAST” & Semantic Web style clients • PlaNet Locus_View (H Schoof, R Ernst) • Aggregator-style client • Blue-Jay (P Gordon)and Rat Genome Databaseprototype (S Twigger) • Menu-style clients • MOBY Graphs (M Senger) • Auto-workflow discovery tool • Taverna (T Oinn, M Senger, E Kawas), and MOWserv (INB, Spain) • Workflow builder/publisher/execution client • Enhanced support for MOBY currently being built • Eclipse plugins… etc…
Taverna Workbench Tom Oinn and Martin Senger myGrid Project
MOWServ Web interface to the Spanish Instituto Nacional de Bioinformatica MOBY Central installation
INB CollaborationMOBY Enhancements • The INB has made several additions to the MOBY API • Detailed error reporting • Asynchronous service invocation • These will become part of the official API in the coming year.
Future plans for Moby • “Decentralization” and enrichment of the registry through distributed RDF-based service instance annotations + LSID resolution • Complete! • Mirroring of registries • RDF-based messaging • BioMoby pre-dates commodity Semantic Web tools like RDF/OWL by a couple of years…
Future plans for Moby • Mirroring of Services • Enhanced registry usage metadata capture • Ontological markup of Object Ontology Terms • Better support for Web Service tooling if possible • Unfortunately, W3C XML Schema is unable to describe MOBY messages… • Collaboration with the GBIF/DIGiR community – biodiversity information served through MOBY
A weakness of MOBY Automated service discovery is fatally flawed due to insufficiently rich semantics…
The problem with Moby Chickens go in; Pies come out!