Protein Structure Prediction Ram Samudrala University of Washington

Protein Structure Prediction Ram Samudrala University of Washington

Rationale for understanding protein structure and function structure determination structure prediction Protein structure - three dimensional - complicated - mediates function homology rational mutagenesis biochemical analysis model studies Protein sequence -large numbers of sequences, including whole genomes ? Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution

Protein folding not unique mobile inactive expanded irregular spontaneous self-organisation (~1 second) native state DNA …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… protein sequence …-L-K-E-G-V-S-K-D-… one amino acid unfolded protein

Protein folding not unique mobile inactive expanded irregular spontaneous self-organisation (~1 second) unique shape precisely ordered stable/functional globular/compact helices and sheets native state DNA …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… protein sequence …-L-K-E-G-V-S-K-D-… one amino acid unfolded protein

Protein folding landscape Large multi-dimensional space of changing conformations J=10-3 s unfolded barrier height molten globule DG* free energy native J=10-8 s folding reaction

Protein primary structure twenty types of amino acids two amino acids join by forming a peptide bond R R H O H C H H N OH Cα Cα Cα OH N C N C H H O O H H R each residue in the amino acid main chain has two degrees of freedom (f and y) R R H O H O H H c c y f y f C N C f N f Cα Cα Cα Cα N C N C y y c c H H O H O H R R the amino acid side chains can have up to four degrees of freedom (c1-4)

Protein secondary structure many f,y combinations are not possible b sheet (anti-parallel) +180 b L a f 0 C -180 -180 0 y +180 N a helix C b sheet (parallel) N

Protein tertiary and quaternary structures Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh) Hemagglutinin (1hgd)

Methods for determining protein structure X-ray crystallography NMR spectroscopy Protein structure - three dimensional - complicated - mediates function homology rational mutagenesis biochemical analysis model studies Protein sequence -large numbers of sequences, including whole genomes ? Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution

X-ray crystallography- concept • X-rays interact with electrons in protein molecules arranged in a crystal to produce • diffraction patterns • The diffraction patterns of the x-rays can be used to determine the three-dimensional • structure of proteins • Provides a “static” picture From <http://info.bio.cmu.edu/courses/03231/LecF01/Lec25/lec25.html>

X-ray crystallography- details • Prepare protein crystals where the proteins are organised in a precisecrystal lattice • Shine x-rays on crystals which diffract off of electrons of atoms inthe crystals; the • intensities of the individual reflections aremeasured • Phases are usually obtained indirectly by ismorphous replacement, fromthe way one or • a few heavy atoms incorporated into the sameisomorphous crystal lattice affect the • diffraction patern • Intensities and phases of all reflections are combined in a Fouriertransform to provide • maps of electron density • Interpret the map by fitting the polypeptide chain to the contours • Refine the model by minimising the distance between the observedamplitudes and the • calculated amplitudes

NMR spectroscopy - concept NK-lysin (1nkl) S1 RNA binding domain (1sro) • The magnetic-spin properties of atomic nuclei within a molecule areused to obtain a • list of distance constraints between atoms in themolecule, from which a • three-dimensional structure of the proteinmolecule can be obtained • Provides a “dynamic” picture

NMR spectroscopy - details • Protein molecules placed in a strong magnetic field have theirhydrogen atoms aligned • to the field; the alignment can be excited byapplying radio frequency (RF) pulses • Possible to obtain unique signal (chemical shift) for each hydrogenatom in a protein • molecule • Structural information arises primarily from the Nuclear OverhauserEffect (NOE), • which gives information about distances between atoms ina molecule • A pair of protons give a detectable NOE cross-peak if they are within5.0 Åof each • other in space • After obtaining NOE data for protons througout the structure, a numberof independent • structures can be generated that are consistent withthe distance constraints

Computer representation of protein structure • Structures are stored in the protein data bank (PDB), arepository of mostly • experimental models based on X-raycrystallographic and NMR studies • <http://www.rcsb.org> • Atoms are defined by their Cartesian coordinates: • ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95 • ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74 • ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45 • ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81 • ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35 • ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48 • ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51 • ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12 • ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19 • ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83 • ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24 • These structures provide the basis for most of theoretical workin protein folding and • protein structure prediction

Comparison of protein structures 3.6 Å 2.9 Å NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model • Need ways to determine if two protein structures are related and to compare predicted • models to experimental structures • Commonly used measure is the root mean square deviation (RMSD)of the Cartesian • atoms between two structures after optimalsuperposition (McLachlan, 1979): • Usually use Caatoms • Other measures include contact maps and torsion angle RMSDs

Methods for predicting protein structure comparative modelling fold recognition ab initio prediction Protein structure - three dimensional - complicated - mediates function homology rational mutagenesis biochemical analysis model studies Protein sequence -large numbers of sequences, including whole genomes ? Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution

Comparative modelling of protein structure • Proteins that have similar sequences (i.e., related by evolution)have similar • three-dimensional structures • A model of a protein whose structure is not known can be constructedif the structure of • a related protein has been determined byexperimental methods • Similarity must be obvious and significant for good models to be built • Need ways to build regions that are not similar between the tworelated proteins • Need ways to move model closer to the native structure

Comparative modelling of protein structure scan align KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** … … build initial model construct non-conserved side chains and main chains refine

Fold recognition 3.6 Å 5% ID NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) • The number of possible protein structures/folds is limited (largenumber of sequences • but few folds) • Proteins that do not have similar sequences sometimes have similarthree-dimensional • structures • A sequence whose structure is not known is fitted directly (or“threaded”) onto a known • structure and the “goodness of fit” isevaluated using a discriminatory function • Need ways to move model closer to the native structure

Fold recognition evaluate fit KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** … … build initial model construct non-conserved side chains and main chains refine

Ab initio prediction of protein structure – concept • Go from sequence to structure by sampling the conformational space ina reasonable • manner and select a native-like conformation using a gooddiscrimination function • Problems: conformational space is astronomical, and it is hard todesign functions that • are not fooled by non-native conformations (or“decoys”)

Ab initio prediction of protein structure select sample conformational space such that native-like conformations are found hard to design functions that are not fooled by non-native conformations (“decoys”) astronomically large number of conformations 5 states/100 residues = 5100 = 1070

Sampling conformational space – continuous approaches energy • Most work in the field • Molecular dynamics • Continuous energy minimisation (follow a valley) • Monte Carlo simulation • Genetic Algorithms • Like real polypeptide folding process • Cannot be sure if native-like conformations are sampled

Molecular dynamics • Force = -dU/dx (slope of potential U); acceleration, m a(t) = force • All atoms are moving so forces between atoms are complicated functions of time • Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial • Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps) • x(t+Dt) = x(t) + v(t)Dt + [4a(t) – a(t-Dt)] Dt2/6 • v(t+Dt) = v(t) + [2a(t+Dt)+5a(t)-a(t-Dt)] Dt/6 • Ukinetic = ½ Σ mivi(t)2 = ½ n KBT • Total energy (Upotential + Ukinetic) must not change with time old position old velocity acceleration new position old velocity acceleration new velocity n is number of coordinates (not atoms)

Energy minimisation starting conformation energy deep minimum number of steps energy give up steepest descent conjugate gradient number of steps converge RMSD • For a given protein, the energy depends on thousands of x,y,z Cartesian atomic • coordinates; reaching a deep minimum is not trivial • With convergence, we have an accurate equilibrium conformation and a well-defined • energy value

Monte Carlo simulation • Discrete moves in torsion or cartesian conformational space • Evaluate energy after every move and compare to previous energy (DE) • Accept conformation based on Boltzmann probability: • Many variations, including simulated annealing (starting with ahigh temperature so • more moves are accepted initially and thencooling) • If run for infinite time, simulation will produce a Boltzmmandistribution

Genetic Algorithms • Generate an initial pool of conformations • Perform crossover and mutation operations on this set to generatea much larger pool of • conformations • Select a subset of the fittest conformations from this large pool • Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches select enumerate all possible conformations view entire space (perfect partition function) must use discrete state models to minimise number of conformations explored computationally intractable: 5 states/100 residues = 5100 = 1070 possible conformations

Scoring/energy functions • Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms • Knowledge-based scoring functions: derive information about atomic properties from a • database of experimentally determined conformations; common parametres include • pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions • Sampling methods must produce good decoy sets that are comprehensive and include • several native-like structures • Scoring function scores must correlate well with RMSD of conformations (the better • the score/energy, the lower the RMSD)

Overview of CASP experiment • Three categories: comparative/homology modelling, foldrecognition/threading, and • ab initio prediction • Goal is to assess structure prediction methods in a blind andrigourous manner; blind • prediction is necessary for accurateassessment of methods • Ask modellers to build models of structures as they are in the processof being solved • experimentally • After prediction season is over, compare predicted models to theexperimental • structures • Discuss what went right, what went wrong, and why • Compare progress from CASP1 to CASP4 • Results published in special issues of Proteins: Structure, Function, Genetics 1995, • 1997, 1999, 2002

Comparative modelling at CASP - methods • Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequencealignments • carefully hand-edited using secondary structure information • More successful side chain prediction methods include: • backbone-dependent rotamer libraries (Bower & Dunbrack) • segment matching followed by energy minimisation (Levitt) • self-consistent mean field optimisation (Bates et al) • graph-theory+ knowledge-basedfunctions (Samudrala et al) • More successful loop building methods include: • satisfaction of spatial restraints (Sali) • internal coordinate mechanics energy optimisation (Abagyan et al) • graph-theory + knowledge-basedfunctions (Samudrala et al) • Overall model building: there is no substitute for careful hand-constructed models • (Sternberg et al, Venclovas)

A graph theoretic representation of protein structure -0.6 (V1) represent residues as nodes -0.5 (I) -0.9 (V2) weigh nodes -0.7 (K) -1.0 (F) construct graph -0.6 (V1) -0.2 -0.5 (I) -0.9 (V2) -0.1 -0.5 (I) -0.9 (V2) -0.1 -0.1 -0.3 -0.1 find cliques -0.2 -0.4 -0.3 -0.1 -0.1 -0.4 W = -4.5 -0.2 -0.7 (K) -1.0 (F) -0.2 -0.7 (K) -1.0 (F)

Historical perspective on comparative modelling alignment side chain short loops longer loops BC excellent ~ 80% 1.0 Å 2.0 Å

Historical perspective on comparative modelling alignment side chain short loops longer loops BC excellent ~ 80% 1.0 Å 2.0 Å CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å

Prediction for CASP4 target T128/sodm Ca RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno Ca RMSD of 1.7 Å for 430 residues (PID 51%)

Prediction for CASP4 target T122/trpa Ca RMSD of 2.9 Å for 241 residues (PID 33%)

Prediction for CASP4 target T125/sp18 Ca RMSD of 4.4 Å for 137 residues (PID 24%)

Prediction for CASP4 target T112/dhso Ca RMSD of 4.9 Å for 348 residues (PID 24%)

Prediction for CASP4 target T92/yeco Ca RMSD of 5.6 Å for 104 residues (PID 12%)

Comparative modelling at CASP - conclusions alignment side chain short loops longer loops BC excellent ~ 80% 1.0 Å 2.0 Å CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å CASP2 fair ~ 75% ~ 1.0 Å ~ 3.0 Å CASP3 fair ~75% ~ 1.0 Å ~ 2.5 Å CASP4 fair ~75% ~ 1.0 Å ~ 2.0 Å CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity **T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%) **T128/sodm – 1.0 Å (198 residues; 50%) **T125/sp18 – 4.4 Å (137 residues; 24%) **T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

Fold recognition at CASP - methods • Visual inspection with sequence comparison (Murzin group) • Procyon - potential of mean force based on pairwise interactionsand global dynamic • programming (Sippl group) • Threader - potential of mean force and double dynamic programming (Jones group) • Environmental 3D Profiles (Eisenberg group) • NCBI Threading Program using contact potentials and models ofsequence-structure • conservation (Bryant group) • Hidden Markov Models (Karplus group) • Combination of threading with ab initio approaches (Friesner group) • Environment-specific substitution tables and structure-dependent gap penalties • (Blundell group)

Fold recognition at CASP - conclusions • Fold recognition is one of the more successful approaches atpredicting structure at all • four CASPs • At CASP2 and CASP4, one of the best methods was simple sequence searching with • carefulmanual inspection (Murzin group) • At CASP3 and CASP4, none of the threading targets could have been recognised bythe • best standard sequence comparison methods such asPSI-BLAST • For the most difficult targets, the methods were able to predict 60 residuesto 6.0Å • Ca RMSD, approaching comparative modelling accuracies as the similarity between • proteins increased.

Ab initio prediction at CASP – methods • Assembly of fragments with simulated annealing (Simons et al) • Exhaustive sampling and pruning using knowledge-based scoringfunctions • (Samudrala et al) • Constraint-based Monte Carlo optimisation (Skolnick et al) • Thermodynamic model for secondary structure prediction withmanual docking of • secondary structure elements and minimisation(Lomize et al) • Minimisation of a physical potential energy function with asimplified representation • (Scheraga et al, Osguthorpe et al) • Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based folding fragments from database 14-state f,y model generate … … monte carlo with simulated annealing conformational space annealing, GA minimise … … all-atom pairwise interactions, bad contacts compactness, secondary structure filter EFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK

Historical perspective on ab initio prediction Before CASP (BC): “solved” (biased results) CASP2: worse than random with one exception CASP1: worse than random CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues *T56/dnab – 6.8 Å (60 residues; 67-126) **T61/hdea – 7.4 Å (66 residues; 9-74) **T64/sinr – 4.8 Å (68 residues; 1-68) **T75/ets1 – 7.7 Å (77 residues; 55-131) *T74/eps15 – 7.0 Å (60 residues; 154-213) **T59/smd3 – 6.8 Å (46 residues; 30-75) CASP4: ?

Prediction for CASP4 target T110/rbfa Ca RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29 Ca RMSD of 6.2 Å for 80 residues (18-97)

Prediction for CASP4 target T106/sfrp3 Ca RMSD of 6.2 Å for 70 residues (6-75)

Protein Structure Prediction Ram Samudrala University of Washington

Protein Structure Prediction Ram Samudrala University of Washington

Presentation Transcript

Protein structure prediction

Prediction of protein structure

An Integrated Computational Framework for Systems Biology Ram Samudrala University of Washington

Modelling genome structure and function Ram Samudrala University of Washington

COMPUTATIONAL ENGINEERING OF BIONANOSTRUCTURES RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

Modelling, comparison, and analysis of proteomes Ram Samudrala University of Washington

Modelling the rice proteome Ram Samudrala University of Washington

Modelling Genome Structure and Function Ram Samudrala University of Washington

Prediction of protein structure

MODELLING INTERACTOMES RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

Computational engineering of bionanostructures Ram Samudrala University of Washington

INTERACTOMICS RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

Modelling the rice proteome Ram Samudrala University of Washington

Modelling, comparison, and analysis of proteomes Ram Samudrala University of Washington

SHOTGUN STRUCTURAL PROTEOMICS RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

Modelling proteomes Ram Samudrala University of Washington

Modelling Genome Structure and Function Ram Samudrala University of Washington

SHOTGUN STRUCTURAL PROTEOMICS RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

MODELLING INTERACTOMES RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

COMPUTATIONAL VACCINE DESIGN RAM SAMUDRALA ASSOCIATE PROFESSOR UNIVERSITY OF WASHINGTON

Modelling genome structure and function Ram Samudrala University of Washington