Comparative Data Analysis Ontology (CDAO)
180 likes | 201 Vues
CDAO formalizes evolutionary biology knowledge through an ontology for comparative data analysis, facilitating format conversions and automated reasoning. Developed by Prosdocimi, Chisham, Thompson, Pontelli, and Stoltzfus.
Comparative Data Analysis Ontology (CDAO)
E N D
Presentation Transcript
Comparative Data Analysis Ontology (CDAO) Francisco Prosdocimi, Brandon Chisham, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus
Objectives • Develop a framework to formalize knowledge in the evolutionary biology domain • Formalize an ontology for comparative data analysis • Comparative Data Analysis Ontology (CDAO) • Implement and Evaluate the ontology
Motivation • Interoperation • Ontologies formalize knowledge • Overcome ambiguities in data formats (e.g., the multiple interpretations of NEXUS) • Facilitate provably correct format conversions • Reasoning • Beyond relational queries • Automated generation of format converters • Advanced reasoning required for workflow constructions and validation • Miscellaneous • Guide development of new data formats • Lingua franca for knowledge exchange • …
Structure of CDAO • Current Focus: • Taxonomic units • Tree-like networks of relationships • Models of evolutionary changes
Structure of CDAO • Core Components • Representation of Networks and Trees(e.g., NEXUS TREE Block) • Representation of Character Data(e.g., NEXUS CHARACTERS Block) • Imported Components • Amino Acid Ontology • http://www.co-ode.org/ontologies/amino-acid • U. Manchester, 2006 • Nucleotide Ontology • http://www.co-ode.org/ontologies/basic-bio/
CDAO: Core Components • Network/Tree representation • Rooted and Unrooted Trees • Nodes • Edges • Sets of Nodes topology Child Node node rootedtree part_of directededge hasancestor is_a hasdescendant node network is_a Parent Node is_a part_of Unrootedtree edge Represents TU node part_of has_annotation has_annotation haselement mrca_of Annotation: Tree Procedure, Model… Annotation: Transformation,Length… set of nodes is_a lineage
CDAO: Core Components • Representation of a Directed Tree a) D C B E A has_descendantmin 2 Nodes Lineage Subtree MRCA_Node has_child_Node Directed edge or branch EdgeTransformation has_parent_Node Character Ancestor state, Derived state… has_root_node Edge Node Node (Ancestral) Edge Transformation Rooted_tree
CDAO: Core Components • Annotations • Edge Annotations • Length • Transformation • Model Description • Gap Cost • Substitution Model • TU Annotation • Taxonomic Link • Tree Annotation • Tree Procedure EdgeAnnotation transform_character has_left_state has_left_node character state transformation character state has_right_node has_right_state
CDAO: Core Components • Character State Data Matrix • Character • Taxonomic Units • Datum • State Character State Data Matrix has annotation Annotation: Alignment procedures… character statedata matrix part_of part_of Annotation:TAXID, DB-XREF… hasannotation belongs_to taxonomic unit has datum character has datum character state datum belongs_to has represented by node has coordinate character state is_a is_a is_a belongs_to is_a compound aminoacid discrete coordinatesystem is_transformation_of is_a nucleotide continuous
Implementation Details • Formalization • OWL 1.1 • Tools • Protégé 4 [edit] • Swoop 2.3 [validation] • C++ and Perl+Prolog translators • Swoop 2.3 [reasoning] • Pellet [reasoning] • Fact++ [reasoning]
Preliminary Evaluation • We are reaching the stage where concrete evaluation is possible • NEXUS converters • We stumbled on several blocks • A good formalization of CDAO requires sophisticated features (OWL 1.1) • The majority of reasoning engines has not reached OWL 1.1 yet (even if they claim so…)
Some Examples • Simple NEXUS file #NEXUS BEGIN TAXA; DIMENSIONS ntax=10; TAXLABELS Arabidopsis_thaliana_AAD31363.1 Arabidopsis_thaliana_CAB79970.1 Oryza_sativa_BAB21282.1 Dictyostelium_discoideum_AAO51107.1 Caenorhabditis_elegans_CAA92686.1 Drosophila_melanogaster_AAF55117.1 Drosophila_melanogaster_AAF55115.1 Mus_musculus_BAB61955.1 Saccharomyces_cerevisiae_AAB68881.1 Schizosaccharomyces_pombe_CAB16373.1; END; BEGIN CHARACTERS; TITLE dna; LINK taxa=PF00137_47; DIMENSIONS nchar=10; FORMAT datatype=dna gap=- missing=?; MATRIX Arabidopsis_thaliana_CAB79970.1 gtgtggttgc Schizosaccharomyces_pombe_CAB16373.1 tgtatatgct Drosophila_melanogaster_AAF55117.1 tgtacttcgt Arabidopsis_thaliana_AAD31363.1 gt---gtggc Oryza_sativa_BAB21282.1 ct-------- Saccharomyces_cerevisiae_AAB68881.1 tgtacaagct Mus_musculus_BAB61955.1 tctgctacac Dictyostelium_discoideum_AAO51107.1 cacttactcc Caenorhabditis_elegans_CAA92686.1 tgttttacat Drosophila_melanogaster_AAF55115.1 ac------g- ; END; BEGIN TREES; TREE con_50_majrule = (((Arabidopsis_thaliana_AAD31363.1:0.004496,Arabidopsis_thaliana_CAB79970.1:0.009539)inode15:0.090479,Oryza_sativa_BAB21282.1:0.043596)inode14:0.219708,(Dictyostelium_discoideum_AAO51107.1:0.341768,(((Caenorhabditis_elegans_CAA92686.1:0.308884,(Drosophila_melanogaster_AAF55117.1:0.128132,Drosophila_melanogaster_AAF55115.1:0.384443)inode20:0.236060)inode19:0.093887,Mus_musculus_BAB61955.1:0.243982)inode18:0.150844,(Saccharomyces_cerevisiae_AAB68881.1:0.235101,Schizosaccharomyces_pombe_CAB16373.1:0.261646)inode21:0.225955)inode17:0.189073)inode16:0.127974)root; END;
Some Examples • Node: <cdao:Noderdf:ID="node_inode15"> <cdao:part_ofrdf:resource="#Tree"/> <cdao:belongs_to_Edgerdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edgerdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15" /> <cdao:belongs_to_Edgerdf:resource="#edge_Arabidopsis_thaliana_AAD31363_1_inode15" /> <cdao:belongs_to_Edge_as_Childrdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edge_as_Parentrdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15" /> <cdao:belongs_to_Edge_as_Parentrdf:resource="#edge_Arabidopsis_thaliana_AAD31363_1_inode15" /> <cdao:nca_node_ofrdf:resource="#set_nca_44"/> </cdao:Node> • Directed_Edge: <cdao:Directed_Edgerdf:ID="edge_Arabidopsis_thaliana_CAB79970_1_inode15"> <cdao:part_ofrdf:resource="#Tree"/> <cdao:has_Parent_Noderdf:resource="#node_inode15"/> <cdao:has_Child_Noderdf:resource="#node_Arabidopsis_thaliana_CAB79970_1"/> <cdao:has_Annotationrdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15_length"/> </cdao:Directed_Edge> <cdao:Edge_Lengthrdf:ID="edge_Arabidopsis_thaliana_CAB79970_1_inode15_length"> <cdao:has_Valuerdf:datatype="&xsd;float"> 0.009539 </cdao:has_Value> </cdao:Edge_Length>
Some Examples • TU <cdao:TUrdf:ID="Caenorhabditis_elegans_CAA92686_1"> <cdao:belongs_to_Character_State_Data_Matrixrdf:resource="#Matrix"/> <cdao:represented_by_Noderdf:resource="#node_Caenorhabditis_elegans_CAA92686_1"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_0"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_1"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_2"/> … </cdao:TU> • Character <cdao:Nucleotide_Characterrdf:ID="char_2"> <cdao:belongs_to_Character_State_Data_Matrixrdf:resource="#Matrix"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Oryza_sativa_BAB21282_1_char_2"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Arabidopsis_thaliana_CAB79970_1_char_2"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Mus_musculus_BAB61955_1_char_2"/> … </cdao:Nucleotide_Character> • Datum <cdao:Nucleotide_State_Datumrdf:ID="datum_Caenorhabditis_elegans_CAA92686_1_char_6"> <cdao:belongs_to_Characterrdf:resource="#char_6"/> <cdao:belongs_to_TUrdf:resource="#Caenorhabditis_elegans_CAA92686_1"/> <cdao:has_Nucleotide_Staterdf:resource="#value_a"/> </cdao:Nucleotide_State_Datum> • State <cdao:Nucleotiderdf:ID="value_a"> <owl:sameAsrdf:resource="#dA"/> </cdao:Nucleotide>
Simple Reasoning Tasks • Determine what TUs contain a gap in their tables: [Fact++] (has_Datum some (has_State value gap)) and TU • Determine the ancestors of a TU in the tree: has_Descendant value node_Drosophila_melanogaster_AAF55115_1
Simple Reasoning Tasks • Extract the row of a specific TU: SELECT ?z,?yWHERE (base:Arabidopsis_thaliana_AAD31363_1>, cdao:has_Datum, ?x) (?x, cdao:has_State, ?y) (?x, cdao:belongs_to_Character, ?z)USING base FOR <file:/C:/Users/epontell/Documents/Research/Proposals/NEXUS/Research/Perl/inst_matrix.owl#>,cdao FOR <http://www.cs.nmsu.edu/~epontell/CURRENT_matrix.owl#>
Future Work • To facilitate evaluation • Create an OWL 1.0 edition of the ontology (and corresponding NEXUS translator) • Java-level reasoning • Aggregation • Etc. • Large scale NEXUS validation • NeXML Interface • OBO distribution