Aims: - To develop ‘enabling technologies’ for comparative genomics.

The Development of an Ontology for Data Integration and Query in Comparative Genomics.Trevor Paterson and Andy LawRoslin Institute, Scotland

Aims: - To develop ‘enabling technologies’ for comparative genomics. - To integrate disparate resources (genomic mapping, DNA sequence, evolutionary relationships, functional information) across species boundaries. - In order to inform and expedite genomic mapping: particularly in non-model organisms.

Collaborators:Farm animal, crop and microbial genomics; Bioinformatics; Computer Sciences; Statistics.

DISPARATE GENOMIC MAPPING DATA - for individual species - multiple datatypes - in many non-standard formats and databases - archived in many locations, variety of access protocols - data of variable quality and completeness PLUS ONLINE BIOINFORMATICS RESOURCES - DNA sequence and genome projects - Gene structure and function - Protein structure, family, function - Evolutionary history, orthology, homology - Phenotypes (genetic traits and diseases) - Population genetics - Gene expression patterns - Publications Current integration between datasources and across species is largely manual. i.e. difficult, error-prone and veryinefficient.

Why do Biologists want to integrate mapping data across species…? What are they trying to do..? GOAL MAP,IDENTIFY AND UNDERSTAND GENES BEHIND PHENOTYPES (i.e. DISEASES & TRAITS) ComparaGrid aims to assist this process by exploiting existing mapping data across species boundaries.

UNDERLYING BIOLOGICAL PRINCIPAL BEHIND CROSS-SPECIES MAP COMPARISON Conservation of Synteny: “Conservation of (blocks of) gene order throughout chromosomal evolution” As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc - but blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and man). Therefore the known gene order in these blocks in one species can inform/predict the order of evolutionarily related genes (orthologues) in other species.

Ancestral Chromosome Modern Species Speciation Event Duplicative inversion Breakage species B species A Inversion species A’ 10M NOW years ago 20M

Duplicative inversion NOW years ago Ancestral Chromosome Modern Species Speciation Event Sequence Similarity & Conserved Synteny => Orthology Breakage species B HyPOTHESIS species A Inversion species A’ 20M 10M

COMPARATIVE GENOMICS USE CASE Agribusiness wants to map the underlying genetic basis of the ‘Tasty Bacon’ Trait ( a QTL ). Tasty Bacon QTL (Genetic) Map

Linkage Map Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE The position of the QTL is correlated on various types of Pig Genetic maps Tasty Bacon QTL Map

Linkage Map Cytogenetic Map Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE There is a ‘known’ homology between a Pig Marker/Sequence in this region and the human genome Pig Human DNA Sequence Similarity => Homology =>? Orthology QTL Map

BAC1 BAC2 BAC3 Linkage Map Cytogenetic Map Physical Mapping Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE A Physical Map of BAC clones exists for this region of the Human Genome Pig Human QTL Map

BAC1 EST1 BAC2 EST2 BAC3 Linkage Map Cytogenetic Map Physical Mapping Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE There are known chicken expressed sequences homologous to Human Gene Sequences in this region Chicken Pig Human QTL Map

BAC1 EST1 BAC2 EST2 BAC3 Linkage Map Cytogenetic Map Physical Mapping Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’ Chicken Pig Human QTL Map Expression Analysis

BAC1 EST1 BAC2 EST2 BAC3 Linkage Map Cytogenetic Map Physical Mapping Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE The literature may detail Functions of Human genes in this region, and homologies to genes in other species – helping the researcher predict candidate genes in Pigs responsible for tastiness Chicken Pig Human QTL Map Expression Analysis Linked References

COMPARATIVE GENOMICS USE CASE: HOW CAN WE AUTOMATE THIS? Provide Architecture to Link and Traverse Data Sources…. GRID/ Web-services Provide Data Standards to allow this Syntax and Semantics of Data Formalise the Links between Data:  these Relationships are Data too  these are what the Biologists care about

WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES IN A BIOLOGICALLY RELEVANT FASHION? A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL? 1. Lightweight Mapping from RDB Schema to standard Minimally: a data exchange standard (defines structure and vocabulary for data exchange): XML Schema? RDF? (a ‘straightforward’ mapping by data providers, integration logic handling the meaning of ‘relationships’ must be in the Application)

WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES? A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL? 2. More Heavyweight Mapping Capturing the Semantics of the Data  Defined RDFS Vocabulary? (mapping still quite lightweight, data is better defined & more reliably integrated, integration of data can be automatic, Applications can rely on semantics)

WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES? A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL? 3. Heavyweight Mapping Semantically represent the Relationships between Data (and Relationships between Relationships…): Formal Ontology (OWL-DL) (mapping from datasource to Ontology is complex and specialist, Automatic integration and inference is possible over data represented as individuals of the ontology)

DO WE NEED YET ANOTHER ONTOLOGY? • We think comparative genomics is very different from other biological knowledge domains…(SO, OBO, GO…) • We need to integrate both abstract and physical data – experimental observations positioning ‘markers’ on abstract maps, and physical locations of ‘features’ on representations of DNA sequences • Metadata is important – we need to treat mapping data as assertions – that might be accepted or rejected on the basis of quality, provenance and trust • We need to represent evolutionary relationships between mapped objects – these are also assertions – not facts – based through the relatedness of underlying physical objects (sequence similarity). • Integration between datasources depends on accepting these evolutionary assertions!

IDEALIZED COMPARAGRID ARCHITECTURE: The OWL Ontology forms the 'semantic glue' to integrate data sources and express cross species queries. The mapping between the data source schema and the integration schema (the CG OWL Ontology) is critical.

SQL DB CG COMPARAGRID STACK ARCHITECTURE: A publisher service automates mapping DB Schema to OWL Bespoke mapping rules map from DB-OWL to CG-OWL Raw data Syntax Semantics Aggregation Raw data Integrator Publisher service Transformer service

BUILDING THE COMPARAGRID ONTOLOGY Stage I (Biologists & Bioinformaticians input) • Define the Scope of the Domain • Collect the terminology used in the Domain • Interview practising experts • Document some use cases • Observe how the experts perform an analysis • Define the terms and relationships necessary • Model the knowledge domain OUTPUTS: - a model of the knowledge domain - a prototype ontology (in OWL-DL): terms and relationships necessary to represent the data and the relationships between data (Using Protégé).

BUILDING THE COMPARAGRID ONTOLOGY Stage II (Biologists, Bioinformaticians, Ontologists) • Hold workshops for panels of experts across the scope of the domain (animal, plant, microbe). • Confirm the Concepts and Relationships that are required. • Confirm our model of the knowledge domain. • Iterate and refine the prototype model representing this model. OUTPUT: version 1 prototype ComparaGrid OWL Ontology

HIERARCHY OF CONCEPTS IN THE COMPARAGRID ONTOLOGY

COMPARAGRID ONTOLOGY: Simple Relationships = Properties Hierarchy of Object to Object Properties Hierarchy of Object to Value Properties

Simple RDF Statement Representation of a Relationship DomainConcept DomainConcept Chromosome DomainConcept property isMapOf Map DomainConcept relatesTo (property) relatesFrom (property) Unidirectional Relationship Richer Representation as OWL Class Map Chromosome isMapOf property DomainConcept hasEvidence Citation property Value <String of Characters> identifier In OWL-DL complex relationships can be modelled as Concepts

The Importance of Relationships Biologists and Bioinformaticians see an important conceptual difference between: The ‘nuts and bolts’ relationships with in the data (‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’) Vs The biological hypotheses (‘ASSERTIONS’) Hopefully the richness and expressivity of OWL-DL will give us the opportunity to capture the subtleties of the different types of relationships and how they may relate to each other. Critically we want to infer over the data represented as individuals – not merely over properties of the ontology

COMPARAGRID ONTOLOGY: Complex Relationships (as Concepts)

BUILDING THE COMPARAGRID ONTOLOGY Stage III (Expert Ontologists) • Refactor the prototype ontology according to good design principles • Build a core upper-level comparative mapping domain ontology that will integrate with other domains • Incorporate additional modules to represent specific subdomains (Genetic Variation, Abstract Mapping Concepts, Evidence, Evolutionary Relationships etc.) OUTPUT: modularised ComparaGrid OWL Ontology

THE MODULARISED COMPARAGRID ONTOLOGY

BUILDING THE COMPARAGRID ONTOLOGY Timescale • Stage I: 6 months • Stage II: 6 months • Stage III: ongoing / 3 years • Problem • how do we develop the architecture and software, when we don’t have a final Ontology or model? • Use the Prototype version? • Use small hack ontologies for demonstration data? • But can we be sure the principals will work for the final larger, more complex Ontology?

USING THE COMPARAGRID ONTOLOGY: Querying distributed resources through the ComparaGrid Stack Architecture • Tools for converting DB schema to OWL ontology • Tool support for mapping DB ontologies to CG ontology • Automatic query translations up and down the stack • Allows queries to be expressed and resolved in OWL– should allow automated reasoning and inference Under Development… i.e. Fun Time for the Computer Scientists…..

Roslin Ark Database’s experience as Data Providers (and Biologists/Users) • We want to export and import data in reusable format • We could build all our own applications using a common data format……..allowing us to traverse data sets according to assertions made between the data. • ….but want to use ComparaGrid’s ‘clever’ integration and query through OWL • i.e. we want to exchange data as OWL – so have to incorporate mapping from schema to OWL into our service architecture

Roslin Ark Database’s experience as Data Providers • Problems: • We are waiting for the ‘final’ ontology • We are waiting for the stack architecture • (…which is waiting for the ontology) • The ComparaGrid Architecture/Toolset is being designed to map from DB schema to OWL, but our DB schema captures none of our domain model……our mapping should be from Object model to OWL …. • We have to implement our own mapping to OWL…. • We want to progress and ACTUALLY DO SOME BIOLOGY!

Schema Relationship Table Web App Object Table Drawing Applet CG Ibatis Java Objects OWL API XML/RDF OWL SERIALISATION ArkIIDB Download App Object Model Betwixt / XSLT XML/RDF CG-OWL Web Service Jena Model RDFS Vocabulary XML/RDF DB-OWL D Java Application ≠

ComparaGrid Ontology: • Where are we at…and Why? • Prototype OWL Ontology created: - used to demonstrate mapping of ArkDB to Webservices.- Ontology is flabby and poorly designed? - Mapping from Java to OWL/XML is a cumbersome/manual process. • Refactoring/modularising the ComparaGrid OWL Ontology is non trivial (Research Project in its own right!). - We are not able to use a ‘final’ ontology to drive the development of services. • Until we have a working common data format or ontology we can’t start to import and export further datasources

ComparaGrid Ontology: • Where are we at…and Why? • Implementation of Comparagrid stack integration and query architecture is ongoing. • Automated / Assisted mapping tools under development. (DB relational schema  DB-OWL  CG-OWL) [Using hack ontology fragments in the interim.] • We need further tools to support mapping from any adhoc database or object model to OWL

ComparaGrid Ontology: • Where are we at…and Why? • As data providers Roslin ArkDB is dependent on the tools and infrastructure being developed by ComparaGrid – without knowing how much added value an ontology will give…. • We hope that the ontology will allow us to represent the ‘interesting’ biological relationships • That it will facilitate automated integration and data traversal • That it will allow inference of new knowledge automatically • However…the burden is put on the data mapping process – a more lightweight approach would simplify this (e.g. RDF/RDFS), but might require that applications understand the context of information sources. • RDF(S) is becoming quite well supported – and allows some inference over semantic relationships. WOULD IT BE GOOD ENOUGH FOR US?

Aims: - To develop ‘enabling technologies’ for comparative genomics.