Genome Annotation Continued

Genome Annotation Continued • This week’s lab. • Genome annotation - web based databases for assigning gene function.

Last week’s lab • E-value • Score • Blastx • Taxonomy

Lab • Sequence assembly and analysis • Assemble individual sequence reads • Phred = 30 - good or bad?

Linking Protein Sequence, Structure, and Function Protein sequences Protein CDD: Conserved functional domains in proteins represented by a PSSM Domains PSI-BLAST, RPS-BLAST, CDART 3D Domains NCBI Field Guide

Position Specific Substitution Rates Weakly conserved serine Active site serine

Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile

Hidden Markov Models • A statistical model that can be applied to any system that is represented as a discrete state. • Applies to protein and nt sequences. • Can be thought of much like PSSMs used in PSI-BLAST. • After several interations. • Are used in gene finding and protein profile analysis.

Uses of HMMs in protein function analysis. • TIGRFAMs • Strive to annotate function of an entire protein • PFAMs • Strive to annotate domains of proteins.

Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor. Speciation. • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier, often have diverged in function

Orthologs may differ in function!

TIGRFAM • Curated such that proteins in a TIGRFAM should have the same function if they are equivalogs. • Proteins have identity over their entire length. • Equivalog family = all proteins that are conserved with respect to function since their last common ancestor. • Superfamily - all proteins with homology but may have different biological functions. • Subfamily - incomplete set of proteins with homology - may have diverse biological functions.

PFAM • More likely to describe a protein domain rather than a family. • Pfams will not overlap. • Crosslisted in TIGRFAM page. • ~70% of proteins in SWISS-Prot have a Pfam match.

COGs • Cluster of orthologous groups • Pairwise comparison of orthologs from many bacterial genomes. • Suggests function only (book example).

Gene Ontology (GO) • “The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” • Biological process, Molecular function, Cellular component

Literature Curation • Saccharomyces genome database (SGD) for example. • Manual curation of the literature for experimental evidence linking function to annotation.

Additional databases • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • PRODOM - A database based on PSI-BLAST PSSMs. • InterPro - A database that brings together many of the above databases so that you can search them all at once. • Others.

CDD • Conserved domain database - linking all of this information together. • Consists of SMART, Pfam, and COGs (KOGs). Searchable directly - automatically searched by BLAST. • Linked to CDART - allows the identification of proteins with a similar domain architecture.

Bottom line about databases • Are useful tools in assigning possible functions. • Be careful about annotations • example -proteins in the same COG can be orthologs that have evolved different functions. • Many annotations are not backed up by experimental data. • Some databases are automated - have not been checked for accuracy.

Annotation can not be guaranteed without experimental evidence. • Functional genomics

Genome Annotation Continued

Genome Annotation Continued

Presentation Transcript

Genome analysis and annotation

Genome annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Assembly and Annotation

Genome Annotation

Genome Annotation

Bioinformatics and Genome Annotation

Genome Annotation

Basics of Genome Annotation

microbial genome annotation

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

Genome sequencing and annotation

Genome analysis and annotation

Bioinformatics and Genome Annotation