1 / 39

Mat úš Kalaš University of Bergen, Norway BioHackathon , Kyōto August 21, 2011

EDAM ontology. of bioinformatics data and methods a nd. Bio XSD. Mat úš Kalaš University of Bergen, Norway BioHackathon , Kyōto August 21, 2011 (Extended version for discussions). EDAM ontology. E MBRACE D ata A nd M ethods ontology. An ontology for annotation of

thuyet
Télécharger la présentation

Mat úš Kalaš University of Bergen, Norway BioHackathon , Kyōto August 21, 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDAM ontology of bioinformatics data and methods and BioXSD Matúš Kalaš University of Bergen, Norway BioHackathon, Kyōto August 21, 2011 (Extended version for discussions)

  2. EDAM ontology EMBRACE Data And Methods ontology An ontology for annotation of bioinformatics tools, resources, and data Jon Ison Peter Rice (PI) Hamish McWilliam James Malone EBI, EMBL, Hinxton Matúš Kalaš IngeJonassen (PI) CBU, Uni Bergen Steve Pettifer University of Manchester

  3. Design principles of EDAM: • Bioinformatics specific • with as few exceptions as necessary • Well-defined scope • operations, types of data (including identifiers), topics, formats • Relevant and usable • for users and annotators • Ontologically sane • well-defined concepts and relations, reflecting the reality • Maintainable

  4. Scope of EDAM, and example concepts: Topic “Phylogenetics” “Protein classification” Operation “Multiple sequence alignment” “Molecular dynamics simulation” Data “Sequence trace” “Position frequency matrix” Format “FASTQ” “SBML”

  5. EDAM sub-ontologies and types of relations:

  6. Example annotation with EDAM: Examples: SAWSDL, EMBOSS (similarly also within data-resource annotation in DRCAT)

  7. 2nd example annotation with EDAM: Example: DRCAT

  8. BioXSD An XML exchange format for basic bioinformatics data EditaBartaševičiūtė KristofferRapacki (PI) CBS, DTU, Greater Copenhagen Jon Ison EBI, EMBL, Hinxton AlexandreJoseph Christophe Blanchet (PI) IBCP, CNRS, Lyon Steve Pettifer University of Manchester Matúš Kalaš Jan Christian Bryne* Armin Töpfer** PålPuntervoll (PI) IngeJonassen (PI) CBU, BCCS, Bergen * nowOslo University Hospital ** now Uni Bielefeld, moving to Basel

  9. Goals of BioXSD: • Being an XSD-based XML format • to complement RDF and plain-text formats • Filling the gap between specialised XSD-based exchange formats • (such as SBML, MAGE-ML, PDBML, phyloXML, PSI-MI MIF, GCDML, GLYDE-II, … ) • Compatible with XML libraries for all main programming languages • As lightweight as possible, but fitting everyone • Developed and maintained in an open but organised collaboration • welcoming requests from the community • Detailed structure • in-depth validation, semantic annotation (EDAM), efficient compression (EXI)

  10. BioXSD is an exchange format • for basic bioinformatics data biomolecularsequences sequencealignments sequence and genome features (annotation) references to data, accessions, …

  11. http://EDAMontology.sourceforge.net EDAM at NCBO BioPortal: http://bioportal.bioontology.org/ontologies/1498 • http://BioXSD.org http://drcat.sourceforge.net(a catalogue of databases annotated with EDAM)

  12. Additional stuff

  13. EDAM- & BioXSD-related topics for BH11: • EDAM for new applications & for semantic data • plus what is the best RDF representation for annotation of tools & resources? • BioXSD to RDF, RDF to BioXSD, SPARQLing of BioXSD data • firstly, what is the best RDF representation for sequence/alignment/feature data? • BioXSD support in Open Bio* • import & export of BioXSD into/from BioPython, BioRuby,BioPerl,BioJava • Compatibility of BioXSD with other bioinformatics XSDs • on the conceptual & design level; and on the level of data integration

  14. Acknowledgement Big thanks to the BioHackathon organisers!!! and the sources of funding BH! Projects contributing to EDAM&BioXSD, and their sources of funding:

  15. Example of an EDAM concept:

  16. Example of an EDAM concept: id: EDAM:0001099 name: UniProt accession subset: identifier subset: data namespace: identifier def: "Accession number of a UniProt database entry." regex: "[A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]" "[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]" example: P43353 Q9C199 A5A6J6 synonym: "UniProtKB accession number" EXACT [] synonym: "UniProtKB entry accession" EXACT [] synonym: "UniProt accession number" EXACT [] synonym: "Swiss-Prot accession" EXACT [] is_a: EDAM:0002091 ! Accession

  17. Use cases driving EDAM are: • Searching for tools and resources • and categorising them • Tool & data integration • automation of data handling or even workflow composition • vocabulary for semantically rich data(incl. RDF) • Data provenance • how data was created and processed

  18. How to annotate the tools and resources? • Annotate a formal description of the tool • - WSDL(SOAP Web services) (SAWSDL standard) • - WADL(Web applications, URL & REST services) • Annotate in a dedicated catalogue • - for example DrCAT(online bio databases) • http://drcat.sourceforge.net • - RDF (needs additional vocabulary/ies)

  19. Who should annotate the tools and resourceswith EDAM? • Providers and users • preferably not catalogue curators

  20. How to annotate SOAP Web services? wsdl:definitions wsdl:service wsdl:port wsdl:binding wsdl:portType *wsdl:operation wsdl:input wsdl:fault wsdl:output wsdl:message wsdl:part xs:element xs:complexType xs:sequence *xs:element more types, elements, attributes, enumerations .. SAWSDL Using URIs of concepts

  21. Advantages of XML Schema * Some advantage over OWL/RDF too, in addition to being an advantage over plain-textual or tsv format specifications. (but on the other hand OWL/RDF has some advantages over XSD/XML, of course)

  22. BioXSD1.1 beta1types: SimpleTypes: NucleotideSequence AminoacidSequence GeneralNucleotideSequence GeneralAminoacidSequence Biosequence Accession(s) helper types: Name, Text Uri Integer(s), Decimal(s) … and a few more ComplexTypes: NucleotideSequenceRecord AminoacidSequenceRecord GeneralNucleotideSequenceRecord GeneralAminoacidSequenceRecord BiosequenceRecord ..SequenceAlignment(s) AnnotatedSequence DatabaseReference, EntryReference OntologyReference, OntologyConcept Species, SequenceReference, Method helper types: Score, SequencePosition(s) … a few more

  23. BioXSD:BiosequenceRecord BioXSD:BiosequenceAlignment

  24. BioXSD:AnnotatedSequence BioXSD:AnnotatedSequence

  25. BioXSD can be used: • Directly as an input/output format of tools • BioXSD can be extended, restricted, • or included within other formats • BioXSD can serve as the intermediate canonical format

  26. >sp|P43353|AL3B1_HUMAN Aldehyde dehydrogenase family 3 member B1 OS=Homo sapiens GN=ALDH3B1 PE=1 SV=1 MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL >AL3B1_HUMAN P43353 ALDEHYDE DEHYDROGENASE 3B1 (EC 1.2.1.5). - Homo sapiens (Human). MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL >gi|4502043|ref|NP_000685.1| aldehyde dehydrogenase family 3 member B1 isoform a [Homo sapiens] MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL >sp_ac|P43353 \ID= AL3B1_HUMAN \DE="Aldehyde dehydrogenase family 3 member B1 (Aldehyde dehydrogenase 7)" \NCBITAXID=9606 MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEI

  27. Sequence record in BioXSD1.0: <mySequencexsi:type="AminoacidSequenceRecord"> <sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQ YVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQE MEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPG MEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL</sequence> <species> <databaseName>NCBI Taxonomy</databaseName> <accession>9606</accession> <entryUri>http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606</entryUri> <name>Human</name> </species> <customName>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)</customName> <formalReference> <databaseName>UniProt</databaseName> <accessionxsi:type=“UniprotAccession">P43353</accession> <entryUri>http://www.uniprot.org/uniprot/P43353</entryUri> <sequenceVersion>1</sequenceVersion> <isoformAccessionxsi:type=“ExtendedUniprotAccession">P43353-1</isoformAccession> </formalReference> </mySequence>

  28. Example data in BioXSD1.1 beta1 format: a sequence record <exampleSequenceRecordxsi:type="bx:NucleotideSequenceRecord"> <bx:sequence>gtgcgagaggcccgtgccgccgtgcgcgctgcctacgaggctttctgccgctggagggaggtc</bx:sequence> <bx:species • dbName="NCBI Taxonomy" • dbUri="http://www.ncbi.nlm.nih.gov/taxonomy" • accession="9598" • entryUri="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9598" • speciesName="Chimp" • /> • <bx:reference • dbName="GenBank/Nucleotide" • dbUri="http://www.ncbi.nlm.nih.gov/nuccore" • accession="NM_001008991" • entryUri="http://www.ncbi.nlm.nih.gov/nuccore/NM_001008991" • sequenceVersion="1" • > • <bx:subsequencePosition> • <bx:segment min="282" max="345"/> • </bx:subsequencePosition> • </bx:reference> <bx:name>snippet of aldehyde dehydrogenase 5 family, member A1 (ALDH5A1)</bx:name> <bx:note>nuclear gene encoding mitochondrial protein, mRNA (GI:57113868)</bx:note> </exampleSequenceRecord>

  29. Sequence record in BioXSD1.1 beta1:

  30. Sequence-string restriction in BioXSD: <xs:simpleType name="NucleotideSequence" sawsdl:modelReference="http://purl.org/edam/data/0001211"><xs:annotation><xs:documentation> Nucleotide sequence without ambiguous ("degenerate") bases </xs:documentation> </xs:annotation><xs:restriction base="GenericNucleotideSequence"> <xs:pattern value="[acgt]+"/><xs:pattern value="[acgu]+"/> </xs:restriction></xs:simpleType>

  31. Strategy recommended by theEMBRACE network of excellence (2005 - 2010)

  32. The EMBRACE project partners: EMBL-EBI, Hinxton, UK; EMBL, Heidelberg, Germany; ITB, CNR, Bari, Italy; University of Manchester, UK; SIB, Geneva, Switzerland; SLU, Sweden; CNRS, Clermont-Ferrand and Lyon, France; CBS, DTU, Lyngby, Denmark; CSIC, Madrid, Spain; University of Stockholm, Sweden; INRIA-UCBL, Lyon, France; MPIMG, Berlin, Germany; CSC, Espoo, Finland; UCL, London, UK; The Weizmann Institute, Rehovot, Israel; University of Nijmegen, Netherlands; INTA, Madrid, Spain; CBU, BCCS, Bergen, Norway

More Related