Protein Structure Prediction and Determination Methods

513441 BIOCHEMISTRY II Chapter 1.1 Proteinstructure prediction Dr. PORNTIP CHAIMANEE Chemistry Department Faculty of Science Silpakorn University

Why predict protein structure? • Structural knowledge = some understanding of function and mechanism of action • It can help us understand the effects of mutations on structure and function • Predicted structures can be used in structure-based drug design • The study of protein structure is not only of fundamental scientific interest in terms of understanding biochemical processes, but also produces very valuable practical benefits Medicine The understanding of enzyme function allows the design of new and improved drugs Agriculture Therapeutic proteins and drugs for veterinary purposes and for treatment of plant diseases Industry Protein engineering has potential for the synthesis of enzymes to carry out various industrial processes on a mass scale

Protein structure: Limitations • Not all proteins or parts of proteins assume a well-defined 3D structure in solution. • Protein structure is not static, there are various degrees of thermal motion for different parts of the structure. • There may be a number of slightly different conformations in solution. • Some proteins undergo conformational changes when interacting with STUFF.

What do we need to know in order to state that the tertiary structure of a protein has been solved? Ideally: We need to determine the position of all atoms and their connectivity. Less Ideally: We need to determine the position of all Cbackbone structure).

Experimental techniques for structure determination • X-ray crystallography • The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow. • Nuclear magnetic resonance (NMR) • Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules. • Cryo-electron-microscopy (TEM) • imaging technology • low resolution

X-ray Crystallography.. • From small molecules (protein., protein/DNA., protein/RNA., protein/small molecule) to viruses • Information about the positions of individual atoms • Limited information about dynamics • Requires crystals

X-ray crystallography • X-ray wavelengths are short - resolves atoms (0.1nm) • When X-ray beam is applied on a crystal, most of it will pass through but a small fraction will be scattered by atoms in the sample. • In a well ordered crystal, the scattered waves reinforce each other and appear as diffraction spots • Diffraction patterns are analyzed into a 3D electron density map • 3D electron density maps, with knowledge of amino acid sequence, can be used to determine a 3D protein structure.

Diffraction pattern

X-Ray Crystallography • crystallize and immobilize single, perfect protein • bombard with X-rays, record scattering diffraction patterns • determine electron density map from scattering and phase via Fourier transform: • use electron density and biochemical knowledge of the protein to refine and determine a model

NMR = Nuclear Magnetic Resonance • Concentrated protein sample placed on a strong magnetic field • NMR measures nuclear magnetism or changes in nuclear magnetism in a molecule • Atomic nuclei of hydrogen atoms exhibit magnetic spin which aligns to a strong magnetic field • Magnetic spin misalign when radio-frequency pulses are applied • Radio-frequency radiation is emitted when H return to aligned state-measured and recoded on a spectrum

NMR = Nuclear Magnetic Resonance • Behavior of any atom is influenced by neighboring atoms – lead to spin-spin coupling • More closely spaced residues are more perturbed than distant residues • Can calculate distances based on perturbation • Combining NMR information with aa sequences makes it possible to compute a 3D structure of a protein.

NMR • Limited to molecules up to ~50kDa (good quality up to 30 kDa) • Distances between pairs of hydrogen atoms • Lots of information about dynamics • Requires soluble, non-aggregating material • Assignment problem

NMR Spectroscopy Spectroscopists Get NOESY for Structures • To determine the arrangement of the atoms in the molecule, scientists use a multi-dimensional NMR technique called NOESY (pronounced “nosy”) for Nuclear Overhauser Effect Spectroscopy. This technique works best on hydrogen atoms, which have the strongest NMR signal and are the most abundant atoms in biological systems. They are also the simplest — each hydrogen nucleus contains just a single proton. • The NOESY experiment reveals how close different protons are to each other in space. A pair of protons very close together (typically within 3 angstroms) will give a very strong NOESY signal. More separated pairs of protons will give weaker signals, out to the limit of detection for the technique, which is about 6 angstroms. • From there, the scientists (or, to begin with, their computers) must determine how the atoms are arranged in space. It’s like solving a complex, three-dimensional Most NMR spectroscopists use magnets that are 500 megahertz to 900 megahertz. This magnet is 900 megahertz. Varian NMR Systems (Those used for high resolution protein structure ) NMR magnets are superconductors, so they must be cooled with liquid helium, which is kept at 4 Kelvin (-452 degrees Fahrenheit). Liquid nitrogen, which is kept at 77 Kelvin (-321 degrees Fahrenheit), helps keep the liquid helium cold.

NMR Spectroscopy • protein in aqueous solution, motile and tumbles/vibrates with thermal motion • NMR detects chemical shifts of atomic nuclei with non-zero spin, shifts due to electronic environment nearby • determine distances between specific pairs of atoms based on shifts, “constraints” • use constraints and biochemical knowledge of the protein to determine an ensemble of models determining constraints using constraints to determine secondary structure

An idealized NMR spectrum Number of peaks=n+1 n=number of nearby H’s

NMR spectrum of a protein 2D NMR can be used to resolve Peaks Kurt Wüthrich Nobel prize 2002

Electron Microscopy/ Diffraction • Low to medium resolution • Limited information about dynamics • Can use very small crystals (nm range) • Can be used for very large molecules and complexes

Protein data bank • http://www.rcsb.org/pdb/

Protein 3D structure data: The structure of a protein consists of the 3D (X,Y,Z) coordinates of each non-hydrogen atom of the protein. Some protein structure also include coordinates of covalently linked prosthetic groups, non-covalently linked ligand molecules, or metal ions. For some purposes (e.g. structural alignment) only the Cα coordinates are needed. Example of PDB format: X Y Z occupancy / temp. factor ATOM 18 N GLY 27 40.315 161.004 11.211 1.00 10.11 ATOM 19 CA GLY 27 39.049 160.737 10.462 1.00 14.18 ATOM 20 C GLY 27 38.729 159.239 10.784 1.00 20.75 ATOM 21 O GLY 27 39.507 158.484 11.404 1.00 21.88 Note: the PDB format provides no information about connectivity between atoms. The last two numbers (occupancy, temperature factor) relate to disorders of atomic positions in crystals.

A PDB example file 1dhy HEADER OXIDOREDUCTASE (OXYGENASE) 07-JUL-95 1DHY 1DHY 2 TITLE KKS102 BPHC ENZYME 1DHY 3 COMPND MOL_ID: 1; 1DHY 4 COMPND 2 MOLECULE: 2,3-DIHYDROXYBIPHENYL 1,2-DIOXYGENASE; 1DHY 5 …... SOURCE MOL_ID: 1; 1DHY 11 SOURCE 2 ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.; 1DHY 12 … SEQRES 1 292 SER ILE GLU ARG LEU GLY TYR LEU GLY PHE ALA VAL LYS 1DHY 131 …. X Y Z occupancy/temp.factor ATOM 1 N SER 1 77.737 55.894 32.141 1.00 32.93 1DHY 188 ATOM 2 CA SER 1 78.285 57.279 32.019 1.00 38.09 1DHY 189 ATOM 3 C SER 1 79.410 57.462 30.998 1.00 33.00 1DHY 190 ATOM 4 O SER 1 79.707 58.597 30.609 1.00 32.00 1DHY 191 ATOM 5 CB SER 1 78.708 57.833 33.383 1.00 46.86 1DHY 192 ATOM 6 OG SER 1 77.573 58.043 34.213 1.00 55.95 1DHY 193 ATOM 7 N ILE 2 80.098 56.375 30.636 1.00 26.67 1DHY 194 ATOM 8 CA ILE 2 81.120 56.469 29.589 1.00 19.65 1DHY 195 ATOM 9 C ILE 2 80.322 56.614 28.286 1.00 18.52 1DHY 196 ATOM 10 O ILE 2 79.369 55.857 28.058 1.00 18.42 1DHY 197 ATOM 11 CB ILE 2 82.019 55.220 29.530 1.00 15.93 1DHY 198 ATOM 12 CG1 ILE 2 83.092 55.323 30.614 1.00 15.78 1DHY 199 ATOM 13 CG2 ILE 2 82.618 55.037 28.140 1.00 11.39 1DHY 200 ….. Note: the PDB format provides no information about connectivity between atoms. The last two numbers (occupancy, temperature factor) relate to disorders of atomic positions in crystals.

PDB New Fold Growth • The number of unique folds in nature is fairly small (possibly a few thousands) • 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Old fold New fold

Viewing protein structures When looking at a protein structure, we may ask the following types of questions: • Is a particular residue on the inside or outside of a protein? • Which amino acids interact with each other? • Which amino acids are in contact with a ligand (DNA, peptide hormone, small molecule, etc.)? • Is an observed mutation likely to disturb the protein structure? Standard capabilities of protein structure software: • Display of protein structures in different ways (wireframe, backbone, sticks, spacefill, ribbon. • Highlighting of individual atoms, residues or groups of residues • Calculation of interatomic distances • Advanced feature: Superposition of related structures

Display of protein structures in different ways (wireframe, backbone, sticks, spacefill, ribbon)

Overall Approach Protein Sequence Multiple Sequence Alignment Database Searching Homologuein PDB Secondary Structure Prediction FoldRecognition No Yes PredictedFold Yes Sequence-Structure Alignment Homology Modelling Ab-initioStructure Prediction No 3-D Protein Model

Predicting protein 3d structure Goal: 3d structure from 1d sequence An existing fold A new fold Fold recognition ab-initio Homology modeling

Protein Structure Prediction State of The Art • Template-based (or knowledge-based) methods • Homology modeling: sequence-sequence alignment , works if sequence identity > 25% . • Where there is a clear sequence relationship between the target structure and one or more known structures. • Fold recognition ('threading'): Protein threading: sequence-structure alignment , can go beyond the 25% limit. • No sequence homology with known structures. Find consistent folds. • Ab initio folding (‘de novo’) : (simulation-based method) • Deriving structures, approximate or otherwise, from sequence.

Homology modeling Based on the two major observations (and some simplifications): • The structure of a protein is uniquely defined by its amino acid sequence. • Similar sequences adopt similar structures. (Distantly related sequences may still fold into similar structures.)

Homology modeling needs three items of input: • The sequence of a protein with unknown 3D structure, the "target sequence." • A 3D “template” – a structure having the highest sequence identity with the target sequence ( >30% sequence identity) • An sequence alignment between the target sequence and the template sequence

Homology Modeling: How it works • Find template • Align target sequence • with template • Generate model: • - add loops • - add side chains • Refine model

Two zones of homology modeling [Rost, Protein Eng. 1999]

Automated Web-Based Homology Modelling • SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html • WHAT IF : http://www.cmbi.kun.nl/swift/servers/ • The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/ • 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/ • SDSC1 : http://cl.sdsc.edu/hm.html • EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/

Fold recognition = Protein Threading Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known?

MTYKLILN …. NGVDGEWTYTE Fold recognition(Protein Threading ) • The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB • Energy function – knowledge (or statistics) based rather than physics based • Should be able to distinguish correct structural folds from incorrect structural folds • Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments

Fold recognition (Protein Threading) • Basic premise • Statistics from Protein Data Bank (~2,000 structures) • Chances for a protein to have a structural fold that already exists in PDB are quite good. The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Protein Threading – structure database • Build a template database

Process • Threading - A protein fold recognition technique that involves incrementally replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found. Fold recognition methods 3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ Fugue http://www-cryst.bioc.cam.ac.uk/~fugue/ HHpredhttp://protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred

Ab initio folding (‘de novo’) • No template available for use, predict the structure by folding simulation • Rosetta: [Simons et al. 1997] Based on short segments independently sample distinct distributions of local conformations from known structure Folding happens when orientations and conformations allow low free energy interactions. Optimized by a Monte Carlo search procedure

Ab-initio, theoretical modeling, and conformation space search • Ab-initio = given amino acid primary structure, i.e. sequence, derive structure from first principles (e.g. treat amino acids as beads and derive possible structures by rotating through all possible ,  angles using a “reliable” energy function, then optimize globally) • Theoretical modeling = subset of ab-initio, given amino acid primary structure and knowledge about characteristic features, derive structure that has that structure and features (e.g. protein has an iron binding site  possible heme substructure) • Conformation space search = subset of ab-initio, but a stochastic search in which the sample space is reduced by initial conditions/assumptions (e.g. reduce sample space to conform to Ramachandran plot)

ab-initio folding Goal: Predict structure from “first principles” Requires: • A free energy function, sufficiently close to the “true potential” • A method for searching the conformational space Advantages: • Works for novel folds • Shows that we understand the process Disadvantages: • Applicable to short sequences only

http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php Qian et al. (Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent. *70,000 home computers for about two years.

Overall Approach Protein Sequence Multiple Sequence Alignment Database Searching Homologuein PDB Secondary Structure Prediction FoldRecognition No Yes PredictedFold Yes Sequence-Structure Alignment Homology Modelling Ab-initioStructure Prediction No 3-D Protein Model

Prediction Problems in ProteinStructural Analysis • Protein secondary structure prediction • Protein Phi-Psi angel prediction • Predicting disulphide Bridges • Predicting beta-turns • Domain recognition • Domain boundary detection • Protein structural classification • Mining structural motifs • …….. Not easy. It is a grand challenge of computational biology

Protein Structure Prediction • Stage 1: Backbone Prediction • Ab initio folding • Homology modeling • Protein threading • Stage 2: Loop Modeling • Stage 3: Side-Chain Packing • Stage 4: Structure Refinement The picture is adapted from http://www.cs.ucdavis.edu/~koehl/ProModel/fillgap.html

End of Protein Structure Prediction

Sequence Similarity • Sequence similarity implies structural, functional, and evolutionary commonality

Sequences of Cytochrome C Cytochrome C is a protein which can be found in all aerobic organisms.

tuna-heart photosynthetic denitrifying mitochondria bacterium bacterium Homologous Proteins : Structures of Cytochrome C

Homologous Proteins:Enterotoxin and Cholera toxin Enterotoxin Cholera toxin 80% homology

Sequence Similarity • Sequence similarity implies structural, functional, and evolutionary commonality • Low sequence similarity implies little structural similarity

Protein Structure Prediction and Determination Methods

Protein Structure Prediction and Determination Methods

Presentation Transcript

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction

Protein structure prediction

Prediction of Protein 3D Structure I-Comparative Modelling

Protein structure prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction