Introduction to the GO: a user’s guide

Introduction to the GO:a user’s guide NCSU GO Workshop 29 October 2009

Genomic Annotation • Genome annotation is the process of attaching biological information to genomic sequences. It consists of two main steps: • identifying functional elements in the genome: “structural annotation” • attaching biological information to these elements: “functional annotation” • biologists often use the term “annotation” when they are referring only to structural annotation

TRAF 1, 2 and 3 TRAF 1 and 2 Structural annotation: DNA annotation CHICK_OLF6 Protein annotation Data from Ensembl Genome browser

Functional annotation: catenin

Structural & Functional Annotation Structural Annotation: • Open reading frames (ORFs) predicted during genome assembly • predicted ORFs require experimental confirmation • the Sequence Ontology (SO) provides a structured controlled vocabulary for sequence annotation Functional Annotation: • annotation of gene products = Gene Ontology (GO) annotation • initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) • functional literature exists for many genes/proteins prior to genome sequencing • GO annotation does not rely on a completed genome sequence!

Introduction to GO • Bio-ontologies • the Gene Ontology (GO) • a GO annotation example • GO evidence codes • literature biocuration & computation analysis • ND vs no GO • sources of GO • Using the GO • The gene association file

1. Bio-ontologies

Bio-ontologies • Bio-ontologies are used to capture biological information in a way that can be read by both humans and computers. • necessary for high-throughput “omics” datasets • allows data sharing across databases • Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined. • The ontology shows how the objects relate to each other.

Bio-ontologies: http://www.obofoundry.org/

Ontologies relationships between terms digital identifier (computers) description (humans)

2. The Gene Ontology

Functional Annotation • Gene Ontology (GO) is the de facto method for functional annotation • Widely used for functional genomics (high throughput) • Many tools available for gene expression analysis using GO • The GO Consortium homepage: http://www.geneontology.org

NDUFAB1 GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA

NDUFAB1 GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa GO:ID (unique) aspect or ontology Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA GO evidence code GO term name

NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual) ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

P05147 PMID: 2976880 Biocuration of Literature: detailed gene function Find a paper about the protein.

Use most specific term possible Read paper to get experimental evidence of function experiment assayed kinase activity: use IDA evidence code

NDUFAB1 GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example • Biocuration of literature • detailed function • “depth” • slower (manual) • Sequence analysis • rapid (computational) • “breadth” of coverage • less detailed ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

Unknown Function vs No GO • ND – no data • Biocurators have tried to add GO but there is no functional data available • Previously: “process_unknown”, “function_unknown”, “component_unknown” • Now: “biological process”, “molecular function”, “cellular component” • No annotations (including no “ND”): biocurators have not annotated • this is important for your dataset: what % has GO?

Sources of GO • Primary sources of GO: from the GO Consortium (GOC) & GOC members • most up to date • most comprehensive • Secondary sources: other resources that use GO provided by GOC members • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • GO expression analysis tools

Different tools and databases display the GO annotations differently. • Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.

Secondary Sources of GO annotation • EXAMPLES: • public databases (eg. NCBI, UniProtKB) • genome browsers (eg. Ensembl) • array vendors (eg. Affymetrix) • CONSIDERATIONS: • What is the original source? • When was it last updated? • Are evidence codes displayed?

For more information about GO • GO Evidence Codes:http://www.geneontology.org/GO.evidence.shtml • gene association file information:http://www.geneontology.org/GO.format.annotation.shtml • tools that use the GO:http://www.geneontology.org/GO.tools.shtml • GO Consortium wiki:http://wiki.geneontology.org/index.php/Main_Page

3. Using the GO

Use GO Browsers for: • searching for GO terms • searching for gene product annotation • filtering sets of annotations and downloading results • creating/using GO slims

GO Browsers • QuickGO Browser (EBI GOA Project) • http://www.ebi.ac.uk/ego/ • Can search by GO Term or by UniProt ID • Includes IEA annotations • AmiGO Browser (GO Consortium Project) • http://amigo.geneontology.org/cgi-bin/amigo/go.cgi • Can search by GO Term or by UniProt ID • Does not include IEA annotations

Use GO for……. • Determining which classes of gene products are over-represented or under-represented. • Grouping gene products by biological function. • Relating a protein’s location to its function. • Focusing on particular biological pathways and functions (hypothesis-driven data interrogation).

http://www.geneontology.org/

However…. • many of these tools do not support non-model organisms • the tools have different computing requirements • may be difficult to determine how up-to-date the GO annotations are… Need to evaluate tools for your system.

Evaluating GO tools Some criteria for evaluating GO Tools: • Does it include my species of interest (or do I have to “humanize” my list)? • What does it require to set up (computer usage/online) • What was the source for the GO (primary or secondary) and when was it last updated? • Does it report the GO evidence codes (and is IEA included)? • Does it report which of my gene products has no GO? • Does it report both over/under represented GO groups and how does it evaluate this? • Does it allow me to add my own GO annotations? • Does it represent my results in a way that facilitates discovery?

4. gene association files

The gene association (ga) file • standard file format used to capture GO annotation data • tab-delimited file containing 15* fields of information: • Information about the gene product (database, accession, name, symbol, synonyms, species) • information about the function: • GO ID, ontology, reference, evidence, qualifiers, context (with/from) • data about the functional annotation • date, annotator * 2 additional fields will soon be added to capture information about isoforms and other ontologies.

(additional column added to this example)

gene product information

metadata: when & who

function information

Gene association files • GO Consortium ga files • many organism specific files • also includes EBI GOA files • EBI GOA ga files • UniProt file contains GO annotation for all species represented in UniProtKB • AgBase ga files • organism specific files • AgBase GOC file – submitted to GO Consortium & EBI GOA • AgBase Community file – GO annotations not yet submitted or not supported • all files are quality checked

Introduction to the GO: a user’s guide