190 likes | 306 Vues
This week’s lab focuses on web-based databases and tools for annotating gene functions, highlighting the importance of E-value, score, and various alignment tools like BLASTx. We'll explore sequence assembly and analysis through methods like Phred scoring and discuss the roles of protein sequences and conserved functional domains using databases such as CDD and InterPro. Key concepts include homologs, orthologs, and paralogs, along with the integration of Hidden Markov Models in gene finding. Be cautious about the veracity of database annotations, as many lack experimental validation.
E N D
Genome Annotation Continued • This week’s lab. • Genome annotation - web based databases for assigning gene function.
Last week’s lab • E-value • Score • Blastx • Taxonomy
Lab • Sequence assembly and analysis • Assemble individual sequence reads • Phred = 30 - good or bad?
Linking Protein Sequence, Structure, and Function Protein sequences Protein CDD: Conserved functional domains in proteins represented by a PSSM Domains PSI-BLAST, RPS-BLAST, CDART 3D Domains NCBI Field Guide
Position Specific Substitution Rates Weakly conserved serine Active site serine
Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile
Hidden Markov Models • A statistical model that can be applied to any system that is represented as a discrete state. • Applies to protein and nt sequences. • Can be thought of much like PSSMs used in PSI-BLAST. • After several interations. • Are used in gene finding and protein profile analysis.
Uses of HMMs in protein function analysis. • TIGRFAMs • Strive to annotate function of an entire protein • PFAMs • Strive to annotate domains of proteins.
Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor. Speciation. • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier, often have diverged in function
TIGRFAM • Curated such that proteins in a TIGRFAM should have the same function if they are equivalogs. • Proteins have identity over their entire length. • Equivalog family = all proteins that are conserved with respect to function since their last common ancestor. • Superfamily - all proteins with homology but may have different biological functions. • Subfamily - incomplete set of proteins with homology - may have diverse biological functions.
PFAM • More likely to describe a protein domain rather than a family. • Pfams will not overlap. • Crosslisted in TIGRFAM page. • ~70% of proteins in SWISS-Prot have a Pfam match.
COGs • Cluster of orthologous groups • Pairwise comparison of orthologs from many bacterial genomes. • Suggests function only (book example).
Gene Ontology (GO) • “The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” • Biological process, Molecular function, Cellular component
Literature Curation • Saccharomyces genome database (SGD) for example. • Manual curation of the literature for experimental evidence linking function to annotation.
Additional databases • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • PRODOM - A database based on PSI-BLAST PSSMs. • InterPro - A database that brings together many of the above databases so that you can search them all at once. • Others.
CDD • Conserved domain database - linking all of this information together. • Consists of SMART, Pfam, and COGs (KOGs). Searchable directly - automatically searched by BLAST. • Linked to CDART - allows the identification of proteins with a similar domain architecture.
Bottom line about databases • Are useful tools in assigning possible functions. • Be careful about annotations • example -proteins in the same COG can be orthologs that have evolved different functions. • Many annotations are not backed up by experimental data. • Some databases are automated - have not been checked for accuracy.
Annotation can not be guaranteed without experimental evidence. • Functional genomics