Protein Structure Prediction

Protein Structure Prediction Charles Yan

Different Levels of Protein Structures • The primary structure is the sequence of residues in the polypeptide chain. • Secondary structure is a local regularly occurring structure in proteins. • Alpha helices • Betasheets • Loops (Coils, Turns)

Different Levels of Protein Structures • Tertiary structure describes the packing of alpha-helices, beta-sheets and random coils with respect to each other on the level of one whole polypeptide chain.

Different Levels of Protein Structures • Quaternary structure only exists, if there is more than one polypeptide chain present in a complex protein.

Question • Why and howa sequence of amino acids can fold into its functional nativestructure given the abundance of geometrically possiblestructures?

Protein Structure Prediction • Anfinsen’s (1973) thermodynamichypothesis: Proteins are not assembled into their native structuresby a biological process, but folding is a purely physicalprocess that depends only on the specific amino acidsequence of the protein. • Anfinsen’shypothesis implies that in principle protein structurecan be predicted if a model of the free energy is available,and if the global minimum of this function can be identified.

Protein Structure Prediction • Protein structureprediction remains utterly complex, since even shortamino acid sequences can form an abundant number of geometricstructures among which the free energy minimumhas to be identified.

Structure Prediction Methods • Methods for structure prediction can bedivided into four groups: • Comparative modeling • Fold recognition • Fragment-based method • Ab initio (methods that do not use database information).

Comparative Modeling • The number of protein structures thathave been determined experimentally continues to growrapidly. At the end of 2004, the number of structures freelyavailable from the Protein Data Bank (Berman et al., 2000)is approaching 28,000. • The availability of experimentaldata on protein structures has inspired the developmentof methods for computational structure prediction that areknowledge-based rather than physics based.

Comparative Modeling • While such database methodshave been criticized for not helping to obtain a fundamentalunderstanding of the mechanisms that drive structureformation, these knowledge-based methods can often successfullypredict unknown three dimensional structures.

Comparative Modeling • In comparative modeling the structure of a protein is predictedby comparing its amino acid sequence to sequencesfor which the native three-dimensional structure is alreadyknown. • Comparative modeling is based on the observationthat sequence similarity implies structural similarity. • Theaccuracy of predictions by comparative modeling, however,strongly depends on the degree of sequence similarity.

Comparative Modeling • Ifthe target and the template share more than 50% of theirsequences, predictions usually are of high quality and havebeen shown to be as accurate as low-resolution X-ray predictions. • For 30–50% sequenceidentity more than 80% of the C-atoms can be expected tobe within 3.5 ˚A of their true positions. • For less than 30% sequence identity, the predictionis likely to contain significant errors

Comparative Modeling In general, comparative modeling consists of • Selection of one or more templates from a database. • BLAST (for closely related sequences). • PSI-BLAST (for distantly related sequences). • A single template rarely provides a complete model. Alternative template structures may provide some additional structural features. • Alignment to the target sequence. • Require a correct alignment of the target and template sequences. This is not trivial, especially when the similarity is not very high. • Refinement of sidechaingeometry and regions of low sequence identity.

Comparative Modeling • Comparative modeling methods hardly differ with respectto template selection and alignment. • Little progress in refiningtemplates. Early hopes that molecular dynamics methods wouldallow refinement have not been fulfilled. Reasons for thisare a matter of hot debate within the field, with threesuggested inter-related explanations: inadequate samplingof alternative conformations, insufficiently accuratedescription of the inter-atomic forces and too short trajectories.

Comparative Modeling Improving sequence comparisontechniques have broadened the scope of comparative modeling. While 30% sequence similarity was considered to bethe threshold for successful comparative modeling, predictionsfor targets with as low as 17% sequence similaritywere made during the CASP4 experiment and 6% during CASP5.

Comparative Modeling Challenges • Aligning the target sequence onto the template structure or structures is challenging, and typically results in very significant errors. • Generally, a significant fraction of residues in a target will have no structural equivalent in an available template. Reliably buildingregions of the structure not present in a template remainsa challenge. • Sidechain accuracy of these approximate modelsis poor. • Refinement remains the principal bottleneck toprogress.

Comparative Modeling The importanceof comparative modeling will continue to grow as the numberof experimentally determined structures grows steadily and,therefore, the number of sequences that can be related to aknown structure is growing.

Comparative Modeling • SWISS-MODEL http://swissmodel.expasy.org//SWISS-MODEL.html

Fold Recognition • While similar sequence implies similar structure, the converseis in general not true. • In contrast, similar structuresare often found for proteins for which no sequence similarityto any known structure can be detected. • As a consequence, the repertoire of different folds is morelimited than suggested by sequence diversity.

Fold Recognition • Fold recognition methods are motivated by the notionthat structure is evolutionary more conserved than sequence. • Fold recognitionmethods are one class of methods that aim at predicting thethree-dimensional folded structure for amino acid sequencesfor which comparative modeling methods provide no reliableprediction.

Fold Recognition • Since the number of sequences is much larger than thenumber of folds, fold recognition methods attempt to identifya model fold for a given target sequence among theknown folds even if no sequence similarity can be detected.

Fold Recognition • Do we have all the folds? • According to arecent assessment, the protein data bank already containsenough structures to cover small protein structures up to alength of about a hundred residues.

Fold Recognition • One approach to fold recognition is based on secondarystructure prediction and comparison. • This subclass ofmethods is based on the observation that secondary structuresimilarity can exceed 80% for sequences that exhibit lessthan 10% sequence similarity. • Clearly any such approach can only be as good as the underlyingsecondary structure prediction method.

Fold Recognition Accuracy of secondary structure predictions. • 60% (1990s) • 76% (Current)

Fold Recognition • Secondary structure information is often combined withother one-dimensional descriptors in fold recognition methods(e.g., with simple scores for solvent accessibility ofeach amino acid)。 • The approach is basedon predicting one dimensional descriptors for a target, andidentifying a similar fold by comparing these descriptorsto the descriptors of known folds.

Fold Recognition • Threadingis an importantrepresentative of fold recognition methods. • Threadingmethods attempt to fit a target sequence to a known structurein a library of folds. • Threading-based methods are known tobe computationally expensive. • Globally optimal proteinthreading is known to be NP-hard

Fold Recognition • Several threading methods ignore pairwiseinteraction between residues. In doing so, the threading problemis simplified considerably, and the simplified problemcan be solved with dynamic programming

Fold Recognition • In early methods of this kind, a onedimensionalstring of features was recorded for known foldsand compared to the target sequence. • The recorded features comprise attributes like buried sidechainarea, side chain area covered by polar atoms includingwater, and the local secondary structure. • In this manner,the three-dimensional structure of known proteins is convertedinto a one-dimensional sequence of descriptors and fold recognition is reduced to seeking the most favorable sequence alignment between the query sequence and a database of sequences.

Fold Recognition • Recent approaches take into account pairwise residue interaction potentials that describe a mean force derivedfrom a database of known structures.

Fragment Assembly Methods • These methods do not compare a target to a knownprotein, but they compare fragments, that is, short aminoacid subsequences, of a target to fragments of knownstructures obtained from the Protein Data Bank. • Once appropriate fragments have been identified,they are assembled to a structure.

Ab Initio Methods • Methods of this type make direct use of Anfinsen’s thermodynamichypothesis in that they attempt to identify the structure with minimum free energy. • Computationally demanding. • Indispensablecomplementary approach to any knowledge-based approachfor several reasons.

Ab Initio Methods • First, in some cases, even a remotelyrelated structural homologue may not be available. • Second,new structures continue to be discovered which couldnot have been identified by methods which rely on comparisonto known structures. • Third, knowledge-based methodshave been criticized for predicting protein structures withouthaving to obtain a fundamental understanding of the mechanismsand driving forces of structure formation.Ab initiomethods, in contrast, base their predictionson physical models for these mechanisms.

Ab Initio Methods • POS: This class of methods can be applied to any given targetsequence using only physically meaningful potentials andatom representations. • NEG: These methods are the most difficult of the proteinstructure prediction methods.

Ab Initio Methods Challenges • Energy functions that can reliablediscriminatenative and non-native structures. • Enormous amount of computations.

Ab Initio Methods Ab initio methods have recently received increased attentionin the prediction of loops. • Loops exhibit greaterstructural variability than Beta-sheets and Alpha helices. • Loop structuretherefore is considerably more difficult to predict thanthe structure of the geometrically highly regular Beta-sheets and Alpha helices. • Loopsare often exposed to the surface of proteins and contributeto active and binding sites. Consequently,loops arecrucial for protein function.

CASP Progress for all variants of computational protein structureprediction methods is assessed in the biannual, communitywideCritical Assessment of Protein Structure Prediction(CASP) experiments. In the CASP experiments, research groups are invitedto apply their prediction methods to amino acid sequencesfor which the native structure is not known but to be determinedand to be published soon.

CASP • Over200 prediction teams from 24 countries participated inCASP6.

Protein Structure Prediction