370 likes | 471 Vues
Explore the processes of structural and functional annotation and learn about gene ontology with this comprehensive guide. Discover tools for managing large datasets, bio-ontologies, and resources for functional modeling in genomics.
E N D
Introduction to the GO:a user’s guide Iowa State Workshop 11 June 2009
Genomic Annotation • Genome annotation is the process of attaching biological information to genomic sequences. It consists of two main steps: • identifying functional elements in the genome: “structural annotation” • attaching biological information to these elements: “functional annotation” • biologists often use the term “annotation” when they are referring only to structural annotation
TRAF 1, 2 and 3 TRAF 1 and 2 Structural annotation: DNA annotation CHICK_OLF6 Protein annotation Data from Ensembl Genome browser
Functional annotation: catenin
Structural & Functional Annotation Structural Annotation: • Open reading frames (ORFs) predicted during genome assembly • predicted ORFs require experimental confirmation • the Sequence Ontology (SO) provides a structured controlled vocabulary for sequence annotation Functional Annotation: • annotation of gene products = Gene Ontology (GO) annotation • initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) • functional literature exists for many genes/proteins prior to genome sequencing • GO annotation does not rely on a completed genome sequence!
Provides structural annotation for agriculturally important genomes • Provides functional annotation (GO) • Provides tools for functional modeling • Provides bioinformatics & modeling support for research community
Introduction to GO • pre-GO: managing large datasets • Bio-ontologies • the Gene Ontology (GO) • a GO annotation example • GO evidence codes • literature biocuration & computation analysis • ND vs no GO • sources of GO
AgBase User Support • Functional modeling training • Database ID mapping • approx. 75% of requests • Providing GO annotation for datasets/arrays • Assistance with GO modeling tools • Intermediary with between research community and public databases • NCBI, UniProtKB, GO Consortium • Computational assistance
Converting database accessions • UniProt database • Ensembl BioMart • Online analysis tools • DAVID, g:profiler, etc • AgBase database • ArrayIDer tool More information about these tools is available from the online workshop resources.
2. Ensembl BioMart NOTE: Ensembl is scheduled to add plant & microbe species in 2009.
3. Online analysis tools g:profiler conversion tool http://biit.cs.ut.ee/gprofiler/gconvert.cgi This tool works for all species found in Ensembl.
3. Online analysis tools Database for Annotation, Visualization and Integrated Discovery (DAVID) http://david.abcc.ncifcrf.gov/conversion.jsp This tool works for a wide range of species.
4. AgBase: ArrayIDer Contact AgBase to request additional species.
Bio-ontologies • Bio-ontologies are used to capture biological information in a way that can be read by both humans and computers. • necessary for high-throughput “omics” datasets • allows data sharing across databases • Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined. • The ontology shows how the objects relate to each other.
Bio-ontologies: http://www.obofoundry.org/
Ontologies relationships between terms digital identifier (computers) description (humans)
Functional Annotation • Gene Ontology (GO) is the de facto method for functional annotation • Widely used for functional genomics (high throughput) • Many tools available for gene expression analysis using GO • The GO Consortium homepage: http://www.geneontology.org
NDUFAB1 GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA
NDUFAB1 GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa GO:ID (unique) aspect or ontology Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA GO evidence code GO term name
NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual) ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
P05147 PMID: 2976880 Biocuration of Literature: detailed gene function Find a paper about the protein.
Use most specific term possible Read paper to get experimental evidence of function experiment assayed kinase activity: use IDA evidence code
NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual) • Sequence analysis • rapid (computational) • “breadth” of coverage • less detailed ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
IEA PIPELINE fasta file of sequences (aa or nt) InterPro analysis (domains/motifs) GO2InterPro mapping file domains/motifs in sequence assign GO (IEA) no GO: “ND” ga file Computational GO annotation (“breadth”) ISO PIPELINE accessions from your species (species 1) public orthology prediction tool(s) 1:1 orthologs existing GO annotations transfer GO annotation to your species (ISO) accessions with no ISO ga file (integrate output into one ga file) Ranjit Kumar
Unknown Function vs No GO • ND – no data • Biocurators have tried to add GO but there is no functional data available • Previously: “process_unknown”, “function_unknown”, “component_unknown” • Now: “biological process”, “molecular function”, “cellular component” • No annotations (including no “ND”): biocurators have not annotated
Primary sources of GO: from the GO Consortium (GOC) & GOC members • most up to date • most comprehensive • Secondary sources: other resources that use GO provided by GOC members • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • GO expression analysis tools
Different tools and databases display the GO annotations differently. • Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.
Secondary Sources of GO annotation • EXAMPLES: • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • CONSIDERATIONS: • What is the original source? • When was it last updated? • Are evidence codes displayed?
For more information about GO • GO Evidence Codes:http://www.geneontology.org/GO.evidence.shtml • gene association file information:http://www.geneontology.org/GO.format.annotation.shtml • tools that use the GO:http://www.geneontology.org/GO.tools.shtml • GO Consortium wiki:http://wiki.geneontology.org/index.php/Main_Page All websites are available from the workshop website & handout.