780 likes | 990 Vues
Determine 3D Structures of All Protein Families. Data: Genome Projects (Sequencing) ... Protein Structure Initiative: Determine 3D Structures of All Proteins ...
 
                
                
                E N D
Slide 1:       March 28, 2003
       NIH Proteomics Workshop
       Bethesda, MD
    Anastasia Nikolskaya, Ph.D.
     Research Assistant Professor
     Protein Information Resource
       Department of Biochemistry and Molecular Biology
       Georgetown University Medical Center
 
Slide 2:Overview  Role of Bioinformatics/Computational Biology in Proteomics Research
Genomics
Functional Annotation of Proteins
Classification of Proteins
Bioinformatics Databases and Analytical Tools (Dr. Yeh and Dr. Hu)
     
Sequence                   function
 
Slide 3:Functional Genomics/Proteomics     Proteomics studies biological systems based on global knowledge of genomes, transcriptomes, proteomes, metabolomes.  Functional genomics studies biological functions of proteins, complexes, pathways based on the analysis of genome sequences.  Includes functional assignments for protein sequences.
Genome: All the Genetic Material in the Chromosomes
Transcriptome: Entire Set of Gene Transcripts 
Proteome: Entire Set of Proteins
Metabolome: Entire Set of Metabolites 
Slide 4:Proteomics 
Data: Gene Expression Profiling
  - Genome-Wide Analyses of Gene Expression
Data: Structural Genomics
  - Determine 3D Structures of All Protein Families
Data: Genome Projects (Sequencing)
    - Functional genomics
    - Knowing complete genome sequences of a number of organisms is the basis of the proteomics research 
Slide 6:Bioinformatics and Genomics/Proteomics 
Slide 7:Most new proteins come from genome sequencing projects Mycoplasma genitalium - 484 proteins
Escherichia coli - 4,288 proteins
S. cerevisiae (yeast) - 5,932 proteins
C. elegans (worm) ~ 19,000 proteins
Homo sapiens ~ 40,000 proteins 
Slide 8:Advantages of knowing the complete genome sequence All encoded proteins can be predicted and identified 
The missing functions can be 	identified and analyzed
Peculiarities and novelties in each 	organism can be studied 
Predictions can be made and verified 
Slide 9:The changing face of protein science 20th century
Few well-studied	 proteins
Mostly globular           with enzymatic activity
Biased protein		set 21st century
Many “hypotheti-	cal” proteins
Various, often with no enzymatic activity
Natural protein 	set 
Slide 10:Properties of the natural protein set Unexpected diversity of even common		 enzymes (analogous, paralogous, 		xenologous, etc. enzymes )
Conservation of the reaction chemistry, 		but not the substrate specificity
Functional diversity in closely related		proteins 
Abundance of new structures 
Slide 11:Experimentally characterized
    Best annotated protein database: SwissProt
“Knowns” = Characterized by similarity (closely related to experimentally characterized)
 Make sure the assignment is plausible
Function can be predicted
 Extract maximum possible information
 Avoid errors and overpredictions
 Fill the gaps in metabolic pathways
“Unknowns”  (conserved or unique)
 Rank by importance 
Slide 13:Problems in functional assignments for “knowns” Previous low quality annotations 
Slide 15:Problems in functional assignments for “knowns” 
Slide 17:Experimentally characterized
“Knowns” = Characterized by similarity (closely related to experimentally characterized)
 Make sure the assignment is plausible
Function can be predicted
 Extract maximum possible information
 Avoid errors and overpredictions
 Fill the gaps in metabolic pathways
“Unknowns”  (conserved or unique)
 Rank by importance 
Slide 18:Functional Prediction:Dealing with “hypothetical” proteins Computational analysis
Sequence analysis of the new ORFs
Mutational analysis 
Functional analysis 
Expression profiling
Tracking of cellular localization
Structural analysis
Determination of the 3D structure 
Slide 19:Structural Genomics Protein Structure Initiative:  Determine 3D Structures of All Proteins
Family Classification: 
Organize Protein Sequences into Families, collect families without known structures
Target Selection: 
Select Family Representatives as Targets 
Structure Determination: 
X-Ray Crystallography or NMR Spectroscopy 
Homology Modeling: 
Build Models for Other Proteins by Homology
Functional prediction based on structure  Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome. Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome.  
Slide 20:Structural Genomics: Structure-Based Functional Assignments 
 Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome. Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome.  
Slide 22:Detailed manual analysis of 		sequence similarities
Cluster analysis of protein 			families (family databases) 
Use of sophisticated database searches (PSI-BLAST, HMM) Improving functional assignments for “unknowns” (Functional Prediction) 
Slide 23:Those amino acids that are conserved in 		divergent proteins (archaeal and bacterial,	 	hyperthermophilic and mesophilic) are 		likely to be important for catalytic activity.
Comparative analysis allows us to find subtle		sequence similarities in proteins that 			would not have been noticed otherwise 
Prediction of the 3D fold and general function	 	is much easier than prediction of exact 		biological (or biochemical) function. 
Slide 24:For some reason, the reaction chemistry often 	remains conserved even when sequence 	diverges almost beyond recognition
Sequence database searches that use exotic or 	highly divergent query sequences often 	reveal more subtle relationships than those 	using queries from humans or standard 	model organisms (E. coli, yeast, worm, fly).
Sequence analysis complements structural 	comparisons and can greatly benefit 		from them 
Slide 25:Poorly characterized protein families Enzyme activity can be predicted,  the substrate remains unknown 	 (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases)
Helix-turn-helix motif proteins (predicted transcriptional regulators)
Membrane transporters		 
Slide 26:Phylogenetic distribution
 Wide  - 	most likely essential
 Narrow - probably clade-specific
 Patchy  - 	most intriguing, niche-specific
Domain association – Rosetta Stone	 for multidomain proteins	
Gene neighborhood 					(operon organization) Improving functional assignments for “unknowns” 
Slide 28:Problems in functional assignments/predictions Identification of protein-coding regions
Delineation of potential function(s) for 		distant paralogs
Identification of domains in the absense 		of close homologs
Analysis of proteins with low 				sequence complexity 
Slide 29:“Unknown unknowns” Phylogenetic distribution
 Wide  - 	most likely essential
 Narrow - probably clade-specific
 Patchy  - 	most intriguing, niche-specific 
Slide 30:To deal with the ocean of new sequences, need “natural” protein classification Protein families are real and reflect evolutionary relationships
Protein classification systems can be used to
Improve sensitivity of protein identification
Provide new protein sequence annotation, simplifying the search for non-obvious relationships
Detect and correct genome annotation errors systematically 
Drive other annotations (actve site etc)
Provide basis for evolution, genomics and proteomics research 
Slide 31:The ideal system would be: Comprehensive, with each sequence classified either as a member of a family or as an “orphan” sequence, a family of one
Hierarchical, with families united into superfamilies on the basis of distant homology
Allow for simultaneous use of the whole protein and domain information (domains mapped onto proteins)
Allow for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families
Expertly curated (family name, function, evidence attribution (experimental vs predicted), background etc).  This is the only way to avoid annotation errors and prevent error propagation 
Slide 32:The ideal system has yet to be created, but there are several very useful systems 
Slide 33:Levels of Protein Classification 
Slide 34:Protein Evolution Tree of Life & Evolution of Protein Families (Dayhoff, 1978)
Can build a tree representing evolution of a protein family, based on sequences
Othologus Gene Family: Organismal and Sequence Trees Match Well 
Slide 35:Protein Evolution Homolog
Common Ancestors
Common 3D Structure
Common Active Sites or Binding Domains
Ortholog
Derived from Speciation
Paralog
Derived from Duplication Homology: Similarity in DNA or protein sequences between individuals of the same species or among different species. 
Evolutionary approaches, including cross species sequence comparisons and studies on features of genome organization, evolution and conserved synteny are critical. 
Homology: Similarity in DNA or protein sequences between individuals of the same species or among different species. 
Evolutionary approaches, including cross species sequence comparisons and studies on features of genome organization, evolution and conserved synteny are critical. 
 
Slide 36:Orthologs and Paralogs 
Slide 37:Orthologs and Paralogs 
Slide 38:Orthologs and Paralogs 
Slide 39:Orthologs and Paralogs 
Slide 40:Orthologs and Paralogs 
Slide 41:Levels of Protein Classification 
Slide 42:Protein Family-Domain-Motif  Domain: Evolutionary/Functional/Structural Unit
     Domain = structurally compact, independently folding unit that forms a stable three-dimentional structure and shows a certain level of evolutionary conservation.  Usually, corresponds to an evolutionary unit.
     A protein can consist of a single domain or multiple domains. Proteins have modular structure.
Motif: Conserved Functional/Structural Site Here, for example, are diagrammatic representations of 5 superfamilies containing the calcineurin-like phosphoesterase domain, in several cases in association with other known types of domains.Here, for example, are diagrammatic representations of 5 superfamilies containing the calcineurin-like phosphoesterase domain, in several cases in association with other known types of domains. 
Slide 43:Protein Evolution:Sequence Change vs. Domain Shuffling 
Slide 44:Recent Domain Shuffling 
Slide 45:Protein classification: proteins and domains Option 1: classify domains
-   take individual domain sequences, consider them as independently evolving units and build a classification system
allows to go all the way to the deepest possible level, the last point of traceable homology and common origin (fold)
domain databases (Pfam, SMART, CDD)
    allow to map domains onto a query sequence
 
Slide 46:Protein classification: proteins and domains Option 2: classify full-length proteins
In cases of multidomain proteins, does not allow to go deep along the evolutionary tree
All proteins in a family will often have a common biological function, which is very convenient for annotation
Domains will be mapped onto protein families
 
Slide 47:Practical Classification of Proteins:Setting Realistic Goals 
Slide 48:Clasification: current status PIR Superfamilies:
     Proteins in PIRPSD:  283,289
     Proteins  classified:    187,871      
    2/3 of the PIR proteins
COGs: 
 ~ 70% of each microbial genome
~ 50% of each Eukaryotic genome in 3-clade COG
~ 20% ? of each Eukaryotic genome in LSEs
 
Slide 49:PIR Web Site (http://pir.georgetown.edu) 
Slide 50:PIR Superfamily Concept 
Slide 51:PIR Superfamilies Created by automated clustering by % identity with coverage-by-length requirements.  Creation of new Superfamilies is an ongoing process.
Automated classification rules are refined by expert curation:
 -  Evolution rates are very different in different “branches” of the protein universe, so need very different score cutoffs
Verify/add members
Annotation (at level of orthology): Superfamily Name, Description, Bibliography
In some cases, more than one orthologous group will be included into a single Superfamily; these Superfamilies will often be very large and diverse
Depth of hierarchy will be different for single-domain and multidomain proteins
  This is work in progress and will become available through PIR (iProClass) and InterPro
 
Slide 52:CM-Related Superfamilies Chorismate Mutase (CM), AroQ class 
SF001501 – CM (Prokaryotic type) [PF01817]
SF001499 – tyrA bifunctional enzyme (Prok) [PF01817-PF02153]
SF001500 – pheA bifunctional enzyme (Prok) [PF01817-PF00800]
SF017318 – CM (Eukaryotic type) [Regulatory Dom-PF01817]
Chorismate Mutase, AroH class 
SF005965 – CM [PF01817] CSM: Class: All alpha proteins 
Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 
1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) 
Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest 
CSM: Class: All alpha proteins 
Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 
1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) 
Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest 
 
Slide 53:iProClass Superfamily Report (I) 
Slide 54:iProClass Superfamily Report (II) 
Slide 55:InterPro 
Slide 56:InterPro Entry 
Slide 57:PIR Superfamilies are being integrated into InterPro 
Slide 58:complete genomes- reciprocal best hits- no score cutoffs Comparative genomics - a branch of computational biology that uses complete genome sequences  
Slide 60:Construction of COGs: 
Slide 62:Construction of COGs:Add all homologs 
Slide 69:Two Groups of Unusual RRs [Receiver-X]     SF006198,   COG3279 
   1.  AlgR-related
Pseudomonas aeruginosa (AlgR): alginate biosynthesis
Klebsiella pneumoniae (MrkE): formation of adhesive fimbriae
Clostridium perfringens (VirR): virulence factors
   2. Regulators of autoinduced peptide-controlled regulons
Staphylococcus aureus (AgrA): virulence factors 
Lactobacillus plantarum (PlnC, PlnD): bacteriocin production
Streptococcus pneumoniae (ComE): competence
   
   Properties of the CheY- LytTR transcriptional regulators 
Regulate secreted and extracellular factors
Often regulate their own expression 
Bind to imperfect direct repeat sites in -80 to - 40 area (or in UAS) 
Can be phosphorylated by His kinases, but form operons with HisK-type sensor ATPases
Contain a conserved LytTR-type DNA-binding domain 
Slide 71:Domain organization of LytTR proteinsother than CheY-LytTR Stand-alone LytTR	Streptococcus pneumoniae BlpS				Pseudomonas phage D3 Orf50 
40aa - LytTR		Lactococcus lactis L121252					Listeria monocytogenes Lmo0984				Staphylococcus aureus SA2153				Streptococcus pneumoniae SP0161	
ABC - LytTR		Bacillus halodurans BH3894	
MHYT - LytTR 		Oligotropha carboxydovora CoxC, CoxH
3TM - LytTR		Xanthomonas campestris RpfD				Caulobacter crescentus CC1610				Mesorhizobium loti mll0891	
3TM - LytTR		Caulobacter crescentus CC0295 
4TM - LytTR		Caulobacter crescentus CC0330, CC3036 
8TM - LytTR		Caulobacter crescentus CC0551 	
PAS - LytTR		Burkholderia cepacia						Geobacter sulfurreducens	 
Slide 72: 
  Consensus binding site for the LytTR domains 
Slide 73:Predicted LytTR-regulated genes  Expected		  
Bacillus subtilis			natAB  (Na+-ATPase)     
Oligotropha carboxidovorans  	comC, comH  (CO growth) 
Staphylococcus aureus		lrgAB  (autolysis)
Streptococcus pneumoniae	hld  (hemolysin delta)
Unexpected 
Bacillus subtilis 			alr, dinB, rapI, veg, 
					ybaJ, ybbI, yceA, ydbS, 						ydjL, yebB, yfiV, ykuA   
 Staphylococcus aureus 		capO, coa, hsdR, SA0096, 					SA0257, SA0285, SA0302,					SA0357, SA0358, SA0513       
Slide 75:Examples for analysis: 1.  Retrieve one of the following protein sequences: 
     PIR: C69086     D64376    GenBank  GI:15679635. Using analysis tools available on the web, check if the functional annotation is correct, and provide correct annotation without looking at internal PIR or COG annotations (Run BLAST with CDsearch and SMART to start with).  When you are done, look at the PIR curated SF annotation (still at internal site only):
http://pir.georgetown.edu/test-cgi/sf/pirclassif.pl?id=SF006549
http://pir.georgetown.edu/cgi-bin/ipcSF1?id=SF006549  (compare with original automatic SF annotation at the public site), and at COG annotations.  What caused the wrong annotations?  In BLAST outputs for these sequences, do you see other wrongly annotated proteins?
Next, analyze the C-terminal domain of these proteins by PSI-BLAST (and alignment analysis) and suggest any speculations as to its function (homework).
  
Slide 76:Examples for analysis: 2.
Retrieve the following sequence: GI:7019521
Take a look at the associated publication (reference).
Analyze the sequence to see if any additional information can be obtained (run PSI-BLAST, and (as a homework) construct multiple alignment).  
Take a look at taxonomy report: what does it tell you?  
Find experimental paper associated with one of the sequences found by PSI-BLAST.  What annotation is appropriate for this sequence and for the entire family?
 
Slide 77:Examples for analysis: 3.
Predict the function of the following proteins:
GenBank:  GI:  27716853
E. coli  YjeE protein
Verify and/or correct the following functional annotations.  Can you explain why the erroneous annotations were made?
PIR: H87387
GenBank:  GI:15606003  GI:15807219 
PIR: F70338 
 
Slide 78:Examples for analysis:  4.  Homework:  an exercise in transitive relationships:Start with>gi|20093648|ref|NP_613495.1| Uncharacterized membrane protein, conserved in Archaea [Methanopyrus kandleri AV19](this is a short membrane protein); run PSI-BLAST, make sure you have filtering, complexity and CD-search off. There are no good hits but a bunch of sub-threshold ones.  Collect "suspect" relations, use them as queries and expand the net.  You will be able to come up with two proteins:>gi|21227474|ref|NP_633396.1| hypothetical protein [Methanosarcina mazei Goe1] and>gi|14324537|dbj|BAB59464.1| hypothetical protein [Thermoplasma volcanium]When used as a PSI-BLAST query, the first will tie the Methanopyrus protein into a group, while the second will tie this group to the Sec61 subunit of preprotein translocase.Then, of course, you can obtain the same result with CD-search in a single step  ?.