Protein Structure, Classification and Prediction BMI 730

Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Protein Structure Determination Protein Structure Classification • SCOP • CATH Secondary Structure Predication Tertiary Prediction • Structure Prediction Evaluation • - CASP

Chemistry Proteins are linear hetero-polymers of amino acids twenty different amino acids (building blocks) 3-letter code VAL ARG LYS ILE GLU PRO ARG GLU 1-letter code V R K I E P R E

Peptide bond Double bond character of the peptide bond The peptide bond is planar Peptide ~ 2-10 amino acids Polypeptide ~ 10-50 amino acids Protein ~ 50- amino acids 2 angles freely rotatable 1 is fixed http://www.imb-jena.de/~rake/Bioinformatics_WEB/basics_peptide_bond.html

Amino acids Side chain properties • Size • Charge • Polarity http://www.ch.cam.ac.uk/SGTL/Structures/amino/

Hierarchical nature of protein structure Primary structure (Amino acid sequence) ↓ Secondary structure (local conformations:α-helix, β-sheet, and reverse turn and loop） ↓ Tertiary structure （Global conformations: a three-dimensional structure resulted from folding together secondary structures） ↓ Quaternary structure （Structure formed by more than one polypeptide chains）

Basic structural units of proteins: Secondary structure α-helix β-sheet Secondary structures, α-helix and β-sheet, have regular hydrogen-bonding patterns.

Tertiary structure • In globular proteins such as enzymes, the long chain of amino acids becomes folded into a three-dimensional functional shape or tertiary structure. This is because certain amino acids with sulfhydryl or SH groups form disulfide (S-S) bonds with other amino acids in the same chain. Other interactions between R groups of amino acids such as hydrogen bonds, ionic bonds, covalent bonds, and hydrophobic interactions also contribute to the tertiary structure

A few examples of tertiary structure Myoglobin Dihydrofolate reductase

Quaternary structure • non-covalent interactions that bind multiple polypeptides into a single, larger protein. Hemoglobin has quaternary structure due to association of two alpha globin and two beta globin polyproteins.

Structure Stabilizing Interactions Non-covalent • Van der Waals forces (transient, weak electrical attraction of one atom for another) • Hydrophobic (clustering of nonpolar groups) • Hydrogen bonding Covalent • Disulfide bonds

Protein structure determination • X-Ray crystallography • NMR (nuclear magnetic resonance) • Cryo-EM (electron microscopy) • Protein expression • membrane proteins • aggregation

Protein Structure Classification - SCOP • Structure Classification Of Proteins database • http://scop.mrc-lmb.cam.ac.uk/scop/ • Hierarchical Clustering • Family – clear evolutionarily relationship • Superfamily – probable common evolutionary origin • Fold – major structural similarity • Boundaries between levels are more or less subjective • Conservative evolutionary classification leads to many new divisions at the family and superfamily levels, therefore it is recommended to first focus on higher levels in the classification tree.

Protein Structure Classification - SCOP • a/a

Protein Structure Classification - SCOP • b/b

Protein Structure Classification - SCOP • a/b

Protein Structure Classification - SCOP • a+b

Protein Structure Classification - SCOP • Misc HIV Protease complexed with pepstatin T-Cell-receptor/MHC/CD8 complex

Protein Structure Classification - SCOP Scop Classification StatisticsSCOP: Structural Classification of Proteins. 1.69 release25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference(excluding nucleic acids and theoretical models)

Protein Structure Classification - SCOP

Protein Structure Classification - CATH • CATH Protein Structure Classification • http://www.cathdb.info/latest/index.html • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). • Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. • Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. • The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. • The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons.

Protein Structure Classification - CATH http://www.cathdb.info/cgi-bin/cath/GotoCath.pl?link=cath_info.html Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures. All non-proteins, models, and structures with greater than 30% "C-alpha only" are excluded from CATH The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis. Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels:

Protein Structure Classification - CATH

CATH vs. SCOP

Secondary Structure Prediction AGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) Porter - University College Dublin PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Deléage, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN http://us.expasy.org/tools/#secondary

Determining the Residue Environment • Six basic environment classes (E, P1, P2, B1, B2 and B3) • The environment of each residue in the three-dimensional structure is first classified according to the area of the side chain that is buried in the protein. ---- A residue is considered exposed to solvent (environment class E) if the area buried is less than 40 Å2. ---- It is considered partially buried (class P) if the area buried is between 40 and 114 Å2. ---- It is considered buried (class B) if the area buried is greater than 114 Å2. • The buried and partially buried classes are further subdivided according to the fraction of the side chain area that is exposed to polar atoms ("fraction polar", denoted f). ---- For this purpose polar atoms are defined as those of the solvent and the oxygen and nitrogen atoms of the protein. ---- The buried class is subdivided into classes B1 (f < 0.45), B2 (0.45 <= f < 0.58) and B3 (f >= 0.58). ---- The partially buried class is subdivided into classes P1 (f < 0.67) and P2 (f >= 0.67).

Secondary Structure Prediction - HNN • http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl • >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2 (Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase subunit II) MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLYTAWCYWKMFGRITKEDIERNTHSLY

Secondary Structure Prediction - HNN Sequence length : 379 HNN : Alpha helix (Hh) : 209 is 55.15% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% Extended strand (Ee) : 55 is 14.51% Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 115 is 30.34% Ambiguous states (?) : 0 is 0.00% Other states : 0 is 0.00% 10 20 30 40 50 60 70 | | | | | | | MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh ILLYTAWCYWKMFGRITKEDIERNTHSLY hhhhhhhhhhhhhhhcchhhhhhhccccc

Secondary Structure Prediction - HNN

Secondary Structure Prediction - PHD • PHDsec predicts secondary structure from multiple sequence alignments. Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72). • Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993). • PHDsec predictions have three main features: • improved accuracy through evolutionary information from multiple sequence alignments • improved beta-strand prediction through a balanced training procedure • more accurate prediction of secondary structure segments by using a multi-level system

Secondary Structure Prediction - PHD Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Bio. 1993

Motifs Readily Identified from Sequence • Zinc Finger - order and spacing of a pattern for cysteine and histidine. • Leucine zippers – two antiparallel alpha helices held together by interactions between hybrophobic leucine residues at every seventh position in each helix. • Coiled coils – 2-3 helices coiled around each other in a left-handed supercoil (3.5 residue/turn instead of 3.6 – 7/two turns); first and fourth are always hydrophobic, others hydrophilic; 5-10 heptads. • Transmembrane-spanning proteins – alpha helices comprising amino acids with hydrophobic side chains, typically 20-30 residues.

Tertiary Structure Prediction • Comparative modeling • SWISS-MODEL - An automated knowledge-based protein modelling server • 3Djigsaw - Three-dimensional models for proteins based on homologues of known structure • CPHmodels - Automated neural-network based protein modelling server • ESyPred3D - Automated homology modeling program using neural networks • Geno3d - Automatic modeling of protein three-dimensional structure • SDSC1 - Protein Structure Homology Modeling Server • Threading • 3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) • Fugue - Sequence-structure homology recognition • HHpred - Protein homology detection and structure prediction by HMM-HMM comparison • Libellula - Neural network approach to evaluate fold recognition results • LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment • SAM-T02 - HMM-based Protein Structure Prediction • Threader - Protein fold recognition • ProSup - Protein structure superimposition • SWEET - Constructing 3D models of saccharides from their sequences • Ab initio • HMMSTR/Rosetta - Prediction of protein structure from sequence • http://us.expasy.org/tools

Tertiary Structure Prediction – Comparative Modeling • Example: 3Djigsaw - Three-dimensional models for proteins based on homologues of known structure Contreras-Moreira,B., Bates,P.A. (2002) Domain Fishing: a first step in protein comparative modelling. Bioinformatics18: 1141-1142.

3D Protein Sequence Profiles • A 3D profile is based on a 3D structure-specific scoring matrix • A 3D scoring matrix is similar to the 1D scoring matrices we discussed in the multiple sequence alignment lectures, with the additional attribute of the structural environment of the amino acid side chain • There are 6 basic environment classes (E, P1, P2, B1, B2 and B3), differing in the area of the side chain that is buried, and by the fraction of the side chain that is exposed to polar atoms • Since amino acids can assume 3 different secondary structures, there are 3 x 6 = 18 different environmental classes • The log odds of each amino acid in each environment type gives the values for the 3D-1D scoring matrix -- calculated from database of protein structures

Using 3D Profiles in Structure Prediction • The alignment of an amino acid sequence with a 3D profile yields an overall 3D-1D score. The 3D-1D score is a measure of the compatibility of the sequence with the structure described by the profile • Given a amino acid sequence, find compatible structures ---- Useful for finding homologous structures when doing homology modeling • Given a preliminary or model structure, test its validity --- Useful for the final phase of homology modeling • Given a structure, find compatible sequences ---- Useful for analyzing evolutionary relationships among proteins

Homology Modeling • Definition: Predicting the tertiary structure of an unknown protein using a known 3D structure of protein(s) with homologous sequence • Based on assumption that structure is more conserved than sequence • Important to use homologous proteins whose structures were determined by X-ray crystallography or NMR • Homology modeling is an important method since the number of different protein folds (unique structures) is much smaller than the number of different proteins • Likely that homologous protein sequences will share a common protein fold Some of the material from this section is from: http://www.cs.wright.edu/~mraymer/cs790/Homology_Modeling.ppt

Homology Modeling Procedure • Search databases for homologous protein sequences The Protein Data Bank (PDB) is a good choice, since all of the sequences contained in PDB have solved 3D structures • Align homologous protein sequence with the sequence of interest ---- Pair-wise or Multiple Sequence Alignment can be used • Build a model of the structure of the protein of interest using the known structures of homologous proteins. Possible methods include: 1. Modeling by rigid body assembly 2. Modeling by segment matching or coordinate reconstruction 3. Modeling by satisfaction of spatial constraints • Evaluate and refine model structure

Protein Structure, Classification and Prediction BMI 730

Protein Structure, Classification and Prediction BMI 730

Presentation Transcript

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure, Structure Classification and Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure, classification, Prediction and Proteomics

Protein structure prediction

Protein structure prediction

Protein Structure and Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure, classification, Prediction and Proteomics