990 likes | 1.15k Vues
BioMOBY Interoperability today, Integration Tomorrow Mark Wilkinson, iCAPTURE Centre, UBC, Vancouver, Canada. Presentation to the Australian Centre for Plant Functional Genomics Institute of Molecular Biosciences University of Queensland, Brisbane, Australia February 29 th , 2005.
E N D
BioMOBYInteroperability today, Integration TomorrowMark Wilkinson, iCAPTURE Centre, UBC, Vancouver, Canada Presentation to the Australian Centre for Plant Functional Genomics Institute of Molecular Biosciences University of Queensland, Brisbane, Australia February 29th, 2005.
…and along came Web Services • WWW forms defined in machine-readable terms together with a “yellow pages” • Define inputs and outputs of services as “primitives” in a document called an “XML Schema” • Integer, Date/Time, String • Don’t help the situation much… • A bioinformatics that consumes a “string” might be expecting a FASTA sequence, or a keyword…?? • Web Service registries merely catalogue the chaos! • Bioinformatics has many different ‘strings’!
Who is MOBY’s audience? • Information is distributed • Beyond Flybase, MIPS, EnsEMBL and TAIR • MOST data never makes it off of the scientists hard drive • This data should be added to the global scientific archive • Biologists, by and large, are willing and able, but… • The Web was embraced enthusiastically by biologists • In fact, most wet labs run a website! • Unfortunately, this only adds to the chaos… The interoperability solution must be simple enough for a Biologist, with a little bit of computer knowledge, to implement on their own
Define data-types commonly used in bioinformatics • Organize these into an Ontology • Ontologically define web service inputs and outputs • Register the inputs and outputs in a “yellow pages” • Machines can find an appropriate service • Machines can execute that service unattended The MOBY-S Plan
Overview of MOBY-S Transactions MOBY hosts & services Sequence Express. Protein Alleles … MOBY Central Align Phylogeny Primers Sequence Alignment Gene names
What makes MOBY go? • My disappointment with archetypal web services not being (easily) able to distinguish between a FASTA sequence and a keyword led me to spend a lot of time thinking about data-types. • This consideration became the core focus of MOBY-S • Rich data-typing turns out to be largely sufficient! • Constraints on MOBY-S are much more severe than on the archetypal computer-science solution • our target audience are not high-level programmers • Defining data types with XML schema is a non-starter: IT WILL NEVER HAPPEN!
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type • The MOBY-S Service Ontology • The MOBY Central Registry
Define: Semantic • For a piece of data, its “semantics” are • its intention • its meaning • its raison d’etre • its context • its relationship to other data
MOBY-S Semantic Typing: Namespaces • Any identifiable piece of data is an “entity” • Identifiers fall into particular “Namespaces” • NCBI has gi numbers (gi Namespace) • GO Terms have accession numbers (GO Namespace) • Namespaces indicate data’s semantic type. • GO:0003476 a Gene Ontology Term • gi|163483 a GenBank record • However, we cannot tell if it is protein, RNA, or DNA sequence • Namespace + ID precisely specifies a data “entity” • The Namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type • The MOBY-S Service Ontology • The MOBY Central Registry
Define: Syntax • For a piece of data, its “syntax” are • its representation • its form • its structure • its language (of representation)
MOBY-S Syntactic Typing: The Object Ontology • Syntactic types are defined by a GO-like ontology • Type (“Class”) name at each node • Edges define the relationships between Classes • GO used as a model because of its comprehension & familiarity • Edges define one of three relationships • ISA • Inheritance relationship • All properties of the parent are present in the child • HASA • Container relationship of ‘exactly 1’ • HAS • Container relationship with ‘1 or more’
Female hasGender Mother hasParent Child partnerOf Father Male hasParent hasGender Define: Ontology • A systematic representation of the entities that exist in a domain of discourse, and the relationships between them.
A portion of the MOBY-S Object Ontology …community-built!
The Object Ontology: A small slice Generic Sequence
What’s an “Object”? • The smallest unit of information that can be passed by MOBY-S • Consists simply of • Namespace • ID • Thus an Object is nothing more than a “reference” to a data entity
ISA relationship - inheritance • Classes become more specialized as you move along the ISA relationship hierarchy • DNA_Sequence • ISA • Nucleotide_Sequence • ISA • Generic_Sequence • ISA • Virtual_Sequence • ISA • Object • Classes do not become more complex as a result of ISA relationships alone
HASA & HAS relationships • HASA and HAS relationships make Classes more complex by embedding Classes within Classes • Virtual_SequenceISAObject • Virtual_Sequence HASA Length(Integer) • Generic_SequenceISAVirtual_Sequence • Generic_Sequence HASA Sequence(String) • Annotated_GIF ISA Image(base_64_GIF) • Annotated_GIF HAS Description(String)
The Object Ontology: A small slice Generic Sequence
Legacy file formats • Inheriting from “String” allows us to define ontological classes that represent legacy data types (e.g. the 20 existing sequence formats!) • NCBI_Blast_Report ISA text-formatted ISA String • <NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’> • TBLASTN 2.0.4 [Feb-24-1998] • Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. • Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman • (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search • programs", Nucleic Acids Res. 25:3389-3402. • Query= gi|1401126 • (504 letters) • Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences • 336,723 sequences; 677,679,054 total letters • Searchingdone • Score E • Sequences producing significant alignments: (bits) Value • gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0 • emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07 • emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05 • gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05 • </NCBI_Blast_Report>
Binaries – pictures, movies • We base64 encode binaries, and then define data classes that inherit from String • base64_encoded_jpeg ISA text/base64 ISA text/plain ISA String • <base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx • HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl • bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf • MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt • </base64_encoded_jpeg>
Extending legacy data types • With legacy data-types defined, we can extend them as we see fit • annotated_jpegISAbase64_encoded_jpeg • annotated_jpegHASA2D_Coordinate_set • annotated_jpegHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>
The same object… annotated_jpegISAbase64_encoded_jpegHASA2D_Coordinate_setHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>
<CrossReference> <Object namespace=“TAIR_Allele” id=“ufo-1”/> </CrossReference> <CrossReference> <Object namespace=‘TAIR_Tissue’ id=‘122’/> </CrossReference> The same object… annotated_jpegISAbase64_encoded_jpegHASA2D_Coordinate_setHASADescription • <annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’> • <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> • <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer> • <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer> • </2D_Coordinate_set> • <String namespace=‘’ id=‘’ articleName=“Description”> • This is the phenotype of a ufo-1 mutant under long daylength, 16’C • </String> • MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC • Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV • </annotated_jpeg>
The Object Ontology: Defines an XML Schema! • Object Ontology terms have semantically rich names, but this is for human intuition only • DNA Sequence • Annotated_GIF • Object Ontology does not define the meaning • NO SEMANTICS • (at least, to the machine…) • It does define the XML Schema of their representation • SYNTAX • An interesting discussion ensues from this • Does MOBY-S rely on human-readable semantics? • Does it matter?
The Object Ontology: Defines an XML Schema! • The position of an ontology node precisely defines the syntax by which that node will be represented • End-users can define new data-types without having to write XML Schema! • This was an important aim of the project • A machine can “understand” the structure of any incoming message by querying its ontological type!
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type • The MOBY-S Service Ontology • The MOBY Central Registry
The Service Ontology • A simple ISA hierarchy • Primitive types include: • Analysis • Parsing • Registration • Retrieval • Resolution • Conversion
A slice of the Service Ontology Parse_NCBI_Blast Parsing Service WU_Blast Analysis Alignment Blast NCBI_Blast
MOBY-S in detail • MOBY-S Data typing system: Semantic Type • MOBY-S Data typing system: Syntactic Type • The MOBY-S Service Ontology • The MOBY Central Registry
MOBY Central: The yellow pages • A registry for MOBY-compliant services • Services register: • “Service Signature” - a triple of [input, service_type, output] • A human readable description of the service • The URL to the service interface • Provides two types of interfaces: • Register/Deregister • Search/Retrieve
A Simple MOBY-S Web Browser • It isn’t a particularly powerful program • It does not display the full “power” of the MOBY-S system • However, it reveals some interesting “behaviors” that have never been observed before… ever! • Biologists tend to find this interface “useless!” • Computer scientists think it’s “Neat!!”
Semantic Web “on the fly”! • This simple browser behaves very much like a semantic web browser • No explicit coordination • Dynamic discovery • Automatic retrieval and execution • This is happening without semantics • Syntax only! (well… almost…) • This is nice! • syntactic solutions are easy to build • semantic solutions are very Very VERY hard!
Conclusions from this Simple Browser Behavior • Perhaps service interoperability is not a significantly semantic problem?!? • Service discovery is definitely a semantic problem • Data integration is still a problem, and we’ve just made that problem worse!
Ugh…. Frustrating!! • The simple browser is too frustrating • design once, run once • Analysis of only one data-element at a time • No way to extract the data at the end of the analysis • No provision information is saved • myGrid (UK) is working on similar problems • myGrid has built MOBY-S support into one of their new tools
TAVERNA A fantastic client program that can now talk to MOBY Central and execute MOBY Services Taverna was written by Tom Oinn with MOBY-S input by Martin Senger as part of the myGrid project