Protein structure prediction

Protein structure prediction Alexander Churbanov University of Nebraska at Omaha CSCI 8980 February 14, 2002

Structure of the presentation • Introduction • Protein native structure • Computational methods of finding a native structure • Common methods and principles • Specific methods • Homology finding • Threading • Modeling on lattice

Introduction • In Greek mythology, Sisyphus is condemned to an eternity of hard labor; his labor is a frustrating and fruitless, for just as he is about to achieve his goal, his work is undone and he must start again from the beginning • Those who work in protein structure prediction seem to share the same fate

Problem of protein structure prediction • Proteins are key molecules in all life processes • The function of a protein directly related to its three dimensional structure • Knowing and understanding the structure of proteins will have a tremendous impact on understanding of biological processes, medical discoveries, and biotechnological inventions

Problem of protein structure prediction • For over 30 years, there has been an ardent search for methods to the predict three-dimensional (3D) structure from the sequence • Many methods were found which looked initially very promising - but always the hope has been dashed

Problem of protein structure preduction • Given a sequence of amino acids, predict the unique 3D folding of molecule minimizing its free energy 1 2 3 Lys Computational Methods of prediction Practical use of the 3D structural knowledge Gly Leu Physical methods of prediction Primary structure

Common part  Chain residue General structure of an amino acid • Each amino acid consists of: • Common main chain part, containing the heavy atoms N, C, O, C forming amide plane • Chain residue of size 0 – 10 additional atoms

  Peptide bond • Peptide bond connects carboxyl group of the first amino acid with amino group of the second acid • Peptide bonds are planar and rigid

Sequence of amino acids • Sequence of amino acids, connected by peptide bonds, form protein • There is no flexibility for rotation around peptide bond • There is more flexibility for protein to rotate around N-C-bond (called the -angle) and around C-C-bond (-angle) • These angles are restricted to small regions in natural proteins

Part of Protein (…|Phe|Asp|Ala|…)

Protein folding • Using the freedom of rotations, the protein can fold into a specific and unique three dimensional structure (called conformation), forming a native structure

Computational methods to find a protein structure • The unique 3D arrangement of protein corresponds to lowest free energy conformation • Most computational approaches for solving the protein folding problem look for the lowest free energy conformation • Two principal methods are currently in use for computing the lowest energy conformation: • Molecular dynamics • Monte Carlo

Molecular dynamics • Forces acting on each atom at a particular state of the system are calculated using an empirical force field • Atoms allowed to move with accelerations resulting from forces, changing conformation • Once atom moved significantly, acting forces are recalculated (every 10-15 sec) • Even super computers can simulate only 10-9 sec of folding time, which is insufficient

Monte Carlo method • Used with simplified model of protein (does not consider structure of every amino acid) • Procedure makes random move from current conformation and evaluates resulting energy changes • If new conformation is better, it replaces old one with newly generated, and process repeats • Method is not powerful enough to find an optimal conformation even for simple cases

Knowledge based structure prediction methods • The most successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules • The most successful theoretical approachis homology modelling

Homology modeling • Given a sequence of unknown fold (denote U), if U has significant sequence similarity to a protein of known structure (T) (i.e., if the pairwise sequence identity is >25%), it is possible to construct an approximate 3D model which has a correct fold but inaccurate loop regions

Homology modeling • The basic assumption of homology modelling is that U and the homologous template protein of known structure (T) have nearly identical backbone structure in the aligned regions • A new generation of alignment methods are based on Hidden Markov Models and another on Genetic algorithms

Homology modeling • For sequence identities down to about 30% sequence identity, U and T will still have the same fold, but the number of loops inserted grows and the divergence between U and T becomes considerable • Modelling of loop regions is still a difficult problem; even the best methods only rarely achieve atomic accuracy and are often completely different to the correct structure

Homology modeling • A pessimistic view is that the accuracy of resulting 3D predictions is typically at the level of ribbon plots, i.e. the mutual orientation of elements such as helices and sheets can be identified • The optimistic version is that even down to levels of 30% sequence identity homology modelling occasionally yields correct predictions at atomic resolution

Three difficult problems of homology modeling • Remote homology modelling (<25%) has three obstacles to overcome: • the remote homology between U and T has to be detected • U and T have to be aligned correctly • the homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity

Solution to the first problem • In the early 1990s, there was a great deal of optimism that the first obstacle, the detection of similar folds, would be solved by threading methods • The basic idea is to thread the sequence of U into the backbone 3D structure of T, at each step evaluating the 'fitness of sequence for structure' using environment-based or knowledge-based mean-force-potentials

Protein threading • Many proteins in nature are homologous • They have different primary structure • They form similar conformation to carry out the same functionality in a living matter • There are groups of proteins having the same evolutionary origin

Protein threading • Most protein share the secondary structure motifs: • Helices • Extended strands forming sheets • Specific turns • Random coils

Protein threading • Threading means mapping a given sequence to a given structure • To assign a structure to a sequence one would then need to thread the sequence through all known conformations, evaluating compatibility, and assign the most compatible structure to the sequence • Upon discovery of completely different structure from any known, enter it into database of structures

Protein threading • Structure is presented by the black trace • Sequence (at the top) is threaded through the structure, encoding an alignment (at the bottom) • Zero means structure deletion, values greater that one mean sequence deletion, while one is a fit

Protein threading • The size of the search space to thread sequence of length k into structure of size n could be found as a selection with repetition • Search space is huge and problem appears to be NP-complete [Unger,R., Moult,J. (1993)]

m-1 core regions m loops (non-core) Protein threading • In order to reduce complexity of search task, (m –1) core and mnon-core regions are introduced • Usually -helices and -sheets are core regions, connected by loops • Total number of amino acids in core regions is c

Protein threading • Although suffering from some inherent limitations (such as prediction of the right structure with completely wrong threading), method became a significant tool in protein structure prediction • Any threading procedure must contain two major components: • An alignment algorithm to position a sequence on a structure • Score function to evaluate the “energy” of the sequence in given conformation

Protein threading possible implementations • Protein threading could be implemented using: • Enumeration for small problems, • Dynamic programming to find core regions to “freeze”, • Monte Carlo variants with Gibbs sampling • Branch and bound search • Genetic programming with constraints seems to be a decent alternative in comparison with other methods

Protein structure prediction on lattice • Another way to model protein folding in 3D space is to assume certain simplifications • Modeling on Lattice is a way to fight the complexity of the prediction problem • Though the problem solution on Lattice is still NP-complete, we can expand size of the protein modeled significantly

Protein simplification for lattice model • Monomers (or residues) are represented using a unified size • Bond length is unified • The positions of the monomers are restricted to positions in a lattice • Simplified energy function

HP - model • 20 letter alphabet of amino acids is reduced to a two letter alphabet, namely H and P; • H represents non-polar or hydrophobic amino acid • P represents polar or hydrophilic amino acid

The energy function • The energy function for HP-model is given by the matrix • Energy contribution of a contact between two monomers is –1 if both are H-monomers, and 0 otherwise

Contact energy • Two monomers form a contact in some specific conformation if they are not connected via a bond, but occupy neighboring positions in the conformation • A conformation with minimal energy is just a conformation with the maximal number of contacts between H-monomers

Sample conformation • A sample conformation for the sequence PHPPHHPH in the two-dimentional lattice with energy –2 is

Cubic lattice • Lattice 3D space

Native conformation

Z Z2 Vertical and horizontal contribution to the surface of a conformation in Vertical contribution to the surface Horizontal contribution to the surface

Conclusions • Native 3D structures of proteins are encoded by a linear sequence of amino acid residues • To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers • Have they finally succeeded in their goal? The bad news is: no, we still cannot predict structure for any sequence • The good news are: we have come closer, and growing databases facilitate the task.

Protein structure prediction