700 likes | 862 Vues
Bio-ontologies for Annotation and Service Discovery. Chris Wroe ( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks) University of Manchester, UK. Overview. Example driven tour of the why , what and how of ontologies in life sciences
E N D
Bio-ontologies for Annotation and Service Discovery Chris Wroe ( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks) University of Manchester, UK
Overview • Example driven tour of the why, what and how of ontologies in life sciences • Cover the key features of an ontology • Vocabulary, definitions, hierarchies, grammar & reasoning • Cover the key targets of ontology use • Biological knowledge, service descriptions, (database schema)
Ontology – the discipline • Semantics – the meaning of meaning. • Philosophical discipline, branch of philosophy that deals with the nature and the organisation of reality. • Science of Being (Aristotle, Metaphysics, IV,1) • What is being? • What are the features common to all beings?
In science…ontology the thing • A resource to aid the precise communication and integration of information • Binds a community to communicate information in some domain of interest in a consistent manner.
Gene Ontology – a community effort • Model organism databases need to be integrated • Not possible if they all use a different vocabulary • Gene Ontology Consortium got together to form • “a dynamic controlled vocabulary that can be applied to all eukaryotes”
Gene Ontology – keeping it simple • Provide three separate vocabularies to describe: • The function a gene product is capable of. • The process a gene product takes part in. • The location at which the gene product has been found.
Annotation GOannotations Gene detail page in MGD for the vitamin D receptor gene, Vdr
Annotation Feature 1: Ontologies provide a shared controlled vocabulary of concepts. GOannotations Gene detail page in MGD for the vitamin D receptor gene, Vdr
Gene ontology - definitions • A diverse community, so explicit definitions important. • 60% of GO concepts have a textural definition e.g. • apoptotic nuclear changes GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.
Gene ontology - definitions Feature 2: Ontologies provide an agreed definition for each concept to ensure each concept is used in the same way. • A diverse community so explicit definitions important. • 60% of GO concepts have a textural definition e.g. • apoptotic nuclear changes GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.
Gene ontology – organisation • An alphabetical list of 11000 terms is not enough • Hierarchies allow similar terms to be grouped together. biological process death cell death tissue death necrosis histolysis
Gene ontology – hierarchy use • GO hierarchy is used for • Navigation of concepts by users • Indexing of information in databases • Aggregating information
Taxonomy remark 1 • The world is not a tree, it’s a lattice animal wild vermin domestic pet working rodent dog cow mouse cat
Door Action associated with a door Closing the Door Kind of a door Monumental Door Metalwork of a Door Door-Knocker Something attached to a door Threshold Door-keeper Taxonomy remark 2 • What does the taxonomy mean? • Concept A is a parent of concept B iff every instance of B is also an instance of A • Superset/subset • ICONCLASS
The CelestialEmporium of Benevolent Knowledge, Borges Classification trickiness "On those remote pages it is written thatanimals are divided into: a. those that belong to the Emperor b. embalmed ones c. those that are trained d. suckling pigs e. mermaids f. fabulous ones g. stray dogs h. those that are included in thisclassification i. those that tremble as if they were mad j. innumerable ones k. those drawn with a very fine camel's hairbrush l. others m. those that have just broken a flower vase n. those that resemble flies from a distance"
Classification is task and culture specific Dyirbal classification of objects in the universe, • Bayi: men, kangaroos, possums, bats, most snakes, mostfishes, some birds, most insects, the moon, storms, rainbows, boomerangs, some spears, etc. • Balan:women, anything connected with water or fire,bandicoots, dogs, platypus, echidna, some snakes, some fishes, most birds, fireflies, scorpions, crickets, the stars, shields, some spears, some trees, etc. • Balam: all edible fruit and the plants that bear them, tubers,ferns, honey, cigarettes, wine, cake. • Bala: parts of the body, meat, bees, wind, yamsticks, somespears, most trees, grass, mud, stones, noises, language, etc.
Gene ontology – directed acyclic graphs • Each concept is explicitly grouped either by is-a or part of relationships • Functions are often grouped by type • Cellular components are often grouped by part • Each concept can have multiple parents • A concepts positions is represented by a directed acyclic graph • Hierarchies are handcrafted so as to suit the ‘culture’ of biologists
Feature 3: Ontologies organise concepts in multiple ways for multiple uses. Principle of grouping should be explicit.
Taking it further • GO concepts are often phrases • insulin control element activator complex, insulin processing, insulin receptor, insulin receptor complex, insulin receptor ligand, insulin receptor signalling pathway, insulin secretion, insulin acticated sodium/amino acid transporter, • Components of phrase hidden to computer applications
Explicit conceptualisation • Semantic similarity searching • Automated maintenance of hierarchies. • What we need is.. • A formal grammar with which to compose phrases • Software which can interpret phrases and produce sound and complete hierarchies
The exploding bicycle • ICD-9 (E826) 8 • READ-2 (T30..) 81 • READ-3 87 • ICD-10 (V10-19) 587 • V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income • W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity • X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities
Defusing the exploding bicycle:500 codes in pieces • 10 things to hit… • Pedestrian / cycle / motorbike / car / HGV / train / unpowered vehicle / a tree / other • 5 roles for the injured… • Driving / passenger / cyclist / getting in / other • 5 activities when injured… • resting / at work / sporting / at leisure / other • 2 contexts… • In traffic / not in traffic • V12.24 Pedal cyclist injured in collision with two- or three-wheeled motor vehicle, unspecified pedal cyclist, nontraffic accident, while resting, sleeping, eating or engaging in other vital activities
hand extremity body Lung inflammation infection abnormal normal Coordination: Conceptual Lego gene protein cell expression chronic acute bacterial deletion polymorphism ischaemic
Conceptual Lego “SNPolymorphism of CFTRGene causing Defect in MembraneTransport of ChlorideIon causing Increase in Viscosity of Mucus in CysticFibrosis…” “Hand which isanatomicallynormal”
DAML+OIL • Specifically designed to compose phrases in a compositional manner • Becoming a standard ontology interchange language • Adopted by W3C and will soon become Ontology Web Language (OWL)
Reasoning support • Consistency — check if knowledge is meaningful • Subsumption— structure knowledge, compute taxonomy • Equivalence— check if two classes denote same set of instances • Instantiation— check if individual i instance of class C • Retrieval — retrieve set of individuals that instantiate C Problems all reducibleto consistency (satisfiability)
Gene Ontology Next Generation • Early aim • Proof of concept showing DAML+OIL & description logic can practically help in at least one aspect of GO maintenance. • In cooperation with Mike Ashburner and the GO editorial team • Further aims • Prototype an evolutionary environment in which the benefits can be replicated on a larger scale
Preliminary task • Providing an exhaustive is-a taxonomy • GO is-a poly-hierarchy • It becomes increasingly laborious to make sure that all concepts are linked to all possible is-a parents
Metabolism terms: e.g. heparin biosynthesis [i] (GO:0006024) [chemical] biosynthesis (GO:0009058) [i]carbohydrate biosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycan biosynthesis (GO:0006023) [i] glycosaminoglycan biosynthesis (GO:0006024) [i]heparin biosynthesis (GO:0030210) Axis 2: Process [i]heparin metabolism (GO:0030202) [i]heparin biosynthesis (GO:0030210)
Is this important? • Complete taxonomy not necessary for browsing by biologist (and may actually get in the way) • BUT… improves fidelity of DB record retrieval. • Asking for records annotated with ‘glycosaminoglycan biosynthesis’ or more specific will lead to an additional result O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
How can we support the task? • Step 0. Translate to DAML+OIL syntax • Provided by OilEd • Provide DAML+OIL based definitions of GO concepts – initially in the metabolism area
DAML+OIL definitions for metabolism concepts • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClass heparin (acts_on is unique) • Paraphrase: biosynthesis which acts solely on heparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan
DAML+OIL definitions for metabolism concepts • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClass heparin (acts_on is unique) • Paraphrase: biosynthesis which acts solely on heparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Feature 4: Ontologies provide a formal computer interpretable concept definition.
A chemical ontology • Initially used MESH to create a DAML+OIL ontology from a subset of the chemical taxonomy (using UMLS tools/ API) • Provides the following information carbohydrates [i] polysaccharides [i] glycosaminogylcans [i] heparin
Reason over the combination • Combine GO definitions with chemical ontology using OilEd API • Send to FaCT DL reasoner…
Paraphrased reasoning process • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Is-a
Inferring a new is-a link • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Is-a Is-a
Inferring a new is-a link • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Feature 5: Ontologies can become a dynamic service with reasoning support. Is-a Is-a
Output • OilEd API reports additional inferred is-a relationships.E.g.heparin biosynthesis has new is-a parent glycosaminoglycan biosynthesis • Sanitised version sent to GO editorial team for comment. • They (Jane Lomax) makes changes to GO if appropriate and sends back queries
Results • Carbohydrate metabolism • 22 additional is-a links 17 of which now in GO • Amino acid metabolism • Further 17 additional is-a links now in GO • Currently preparing results for metabolism as a whole
Where next with GONG? • Moving from proof of concept requires dedicated software tools to support the process. • Authoring/ Curation of DAML+OIL definitions • Tracking GO as it evolves • Tracking suggested changes and response to changes.
myGrid & high level ontologies • myGrid: Personalised extensible environments for data-intensive in silico experiments in biology • Higher level services: workflow, databases, knowledge management, provenance… • Bioinformatics services are published as Web services (and soon Grid Services) • http://www.ebi.ac.uk/collab/mygrid/service0/axis/index.html
Ontologies for Service Discovery • Find appropriate type of services • sequence alignment • Find appropriate instances of that service • BLAST (an algorithm for sequence alignment), as delivered by NCBI • Assist in forming an appropriate assembly of discovered services. • Find, select and execute instances of services while the workflow is being enacted. Knowledge in the head of expert bioinformatician
RASMOL Similar Structure Protein Fetch Fetch sequences modelling name View WF An in silico experiment as a workflow
Four-tiered service descriptions Domain “semantic” • Class of service: • a protein sequence alignment, a protein sequence database. • Specific example of an abstract service: • BLAST, SWISS-PROT. • Instance service description of a specific service: • BLAST, SWISS-PROT as offered by the EBI. • Invoked instance service description: • BLAST as offered by the EBI on a particular date, with particular parameters when a service was actually enacted. Business “operational”
Service description phrases • Build up a phrase describing classes of service functionality. • Building blocks for phrase come from a suite of ontologies • Template for the description based on DAML-S specialised for bioinformatics. • Use reasoning to maintain a classification of services