2.05k likes | 2.06k Vues
10:30-12:00 How to Build an Ontology 1-2pm Best Practices and Lessons Learned 2-3pm BIRN Ontologies: An Overview. How to Build an Ontology. High quality shared ontologies build communities.
E N D
10:30-12:00 How to Build an Ontology 1-2pm Best Practices and Lessons Learned 2-3pm BIRN Ontologies: An Overview
High quality shared ontologies build communities • General trend on the part of NIH, FDA and other bodies to consolidate ontology-based standards for the communication and processing of biomedical data. • NCIT / caBIG / NECTAR / BIRN / OBO ...
TWO STRATEGIES:Ad hoc creation of new database schemas for each research group / research hypothesisvs. • Pre-established interoperable stable reference ontologies in terms of which all database schemas need to be defined
How to create the conditions for a step-by-step evolution towards gold standard reference ontologies in the biomedical domain • ... and why we need to create these conditions • OBO Core project
Ontology =def • A representation of the types of entities existing in a given domain of reality, and of the relations between these types
Types have instances • Ontologies are like science texts: they are about types • (Diaries, databases, clinical records are about instances)
The need • strong general-purpose classification hierarchies created by domain specialists • clear, rigorous definitions • thoroughly tested in real cases • ontologies teach us about the instances in reality by supporting cross-disciplinary (cross-ontology) reasoning about types
The actuality (too often) • myriad special purpose ‘light’ ontologies, prepared by ontology engineers and deposited in internet ‘repositories’ or ‘registries’
these light ontologies often do not generalize … • repeat work already done by others • are not interoperable • reproduce the very problems of communication which ontology was designed to solve • contain incoherent definitions • and incoherent documentation
BIRN Ontology Experiences • In the short-term, users will probably download the data or analyses and extract the results using their preferred methods. • In the long term, however, that will become infeasible • the databases will have to be made interoperable with standard datamining software. • This is where the neuroanatomy ontologies come in. • We will need to know what the ROI is and which naming scheme it came from (e.g., a Brodmann’s area, or a sulcal/gyral area, etc.). We’ll need to know how it was defined (Talairach atlas? MNI atlas? LONI atlas? Or subject-specific regions?) and what the statistic is.
BIRN Ontology Experiences • In the short-term, users will probably download the data or analyses and extract the results using their preferred methods. • In the long term that will become infeasible
A methodology for quality-assurance of ontologies • tested thus far in the biomedical domain on: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM
A methodology for quality-assurance of ontologies • accepted need for application of this methodology: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM
A methodology for quality-assurance of ontologies • signs of hope: • FMA • GO + other OBO Ontologies • FuGO • SNOMED • UMLS Semantic Network • NCI Thesaurus • ICF (International Classification of Functioning, Disability and Health) • ISO Terminology Standards • HL7-RIM
We know that high-quality ontologies built according to this methodology can help in creating high-quality mappings between human and model organism phenotypes
“Alignment of Multiple Ontologies of Anatomy: Deriving Indirect Mappings from Direct Mappings to a Reference Ontology”Songmao ZhangOlivier BodenreiderAMIA 2005
We also know that OWL is not enough to ensure high-quality ontologies • and that the use of a common syntax and logical machinery and the careful separating out of ontologies into namespaces does not solvethe problem of ontology integration
A basic distinction • type vs. instance • science text vs. clinical document • man vs. Musen
Instances are not represented in an ontology • It is the generalizations that are important • (but instances must still be taken into account)
Ontology = A Representation of Types • Each node of an ontology consists of: • preferred term (aka term) • term identifier (TUI, aka CUI) • synonyms • definition, glosses, comments
Ontology = A Representation of Types Nodes in an ontology are connected by relations: primarily: is_a (= is subtype of) and part_of designed to support search, reasoning and annotation
substance organism animal cat instances siamese types mammal leaf class frog
Rules for formating terms • Terms should be in the singular • Terms should be lower case • Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’) • Avoid acronyms • Avoid mass terms (‘tissue’, ‘brain mapping’, ‘clinical research’ ...) • Each term ‘A’ in an ontology is shorthand for a term of the form ‘the type A’
Motivation: to capture reality • Inferences and decisions we make are based upon what we know of reality. • An ontology is a computable representation of the underlying biological reality. • Designed to enable a computer to reason over the data we derive from this reality in (some of) the ways that we do.
Concepts • Biomedical ontology integration will never be achieved through integration of meanings or concepts • The problem is precisely that different user communities use different concepts • Concepts are in your head and will change as your understanding changes
Concepts • Ontologies represent types: not concepts, meanings, ideas ... • Types exist, with their instances, in objective reality • – including types of image, of imaging process, of brain region, of clinical procedure, etc.
Rules on types • Don’t confuse types with words • Don’t confuse types with concepts • Don’t confuse types with ways of getting to know types • Don’t confuse types with ways of talking about types • Don’t confuses types with data about types
Univocity • Terms should have the same meanings on every occasion of use. • They should refer to the same kinds of entities in reality • Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies
Positivity • Complements of types are not themselves types. • Hence terms such as • non-mammal • non-membrane • other metalworker in New Zealand • do not designate types in reality • There are also no conjunctive and disjunctive types: • protoplasmic astrocyte and Schwann cell • Purkinje neuron or dendritic shaft
Objectivity • Which types exist is not a function of our knowledge. • Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate types in reality.
Single Inheritance No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
Multiple Inheritance • thing • blue thing • car is_a1 is_a2 • blue car
is_a Overloading • serves as obstacle to integration with neighboring ontologies • The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned. • See “Relations in Biomedical Ontologies”, Genome Biology May 2005. • DISEASE MAPS
General Rule • Formulate universal statements first • Move to A may be B in such and such a context later
Intelligibility of Definitions • The terms used in a definition should be simpler (more intelligible) than the term to be defined; otherwise the definition provides no assistance • to human understanding • to machine processing
Definitions should be intelligible to both machines and humans • Machines can cope with the full formal representation • Humans need clarity and modularity
But • Some terms are primitive (cannot be defined) • AVOID CIRCULAR DEFINITIONS • Avoid definitions of the forms: • An A is an A which is B (person = person with identity documents) • An A is the B of an A (heptolysis = the causes of heptolysis)
Case Study: The National Cancer Institute Thesaurus (NCIT) • does not (yet) satisfy these and other simple principles
The NCIT reflects a recognition of the need • for high quality shared ontologies and terminologies the use of which by clinical researchers in large communities can ensure re-usability of data collected by different research groups
NCIT • “a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research” • “exhibits ontology-like properties in its construction and use”.
Goals • to make use of current terminology “best practices” to relate relevant concepts to one another in a formal structure, so that computers as well as humans can use the Thesaurus for a variety of purposes, including the support of automatic reasoning; • to speed the introduction of new concepts and new relationships in response to the emerging needs of basic researchers, clinical trials, information services and other users.
Formal Definitions • of 37,261 nodes, 33,720 were stipulated to be primitive in the DL sense • Thus only a small portion of the NCIT ontology can be used for purposes of automatic classification and error-checking by using OWL.
Verbal Definitions • About half the NCIT terms are assigned verbal definitions • Unfortunately some are assigned more than one
Disease Progression • Definition1 • Cancer that continues to grow or spread. • Definition2 • Increase in the size of a tumor or spread of cancer in the body. • Definition3 • The worsening of a disease over time. This concept is most often used for chronic and incurable diseases where the stage of the disease is an important determinant of therapy and prognosis.