Protein structure prediction

Protein structure prediction • May 26, 2011 • HW #8 due today • Quiz #3 on Tuesday, May 31 • Learning objectives-Understand the biochemical basis of secondary structure prediction programs. Become familiar with the databases that hold secondary structure information. Understand neural networks and how they help to predict secondary structure. • Workshop-Predict secondary structure of p53. • Homework #9-Due June 2

What is secondary structure? • Three major types: • Alpha Helical Regions • Beta Strand Regions • Coils, Turns, Extended (anything else)

Can we predict the final structure? http://en.wikipedia.org/wiki/Protein_folding

Some Prediction Methods • ab initio methods • Based on physical properties of aa’s and bonding patterns • Statistics of amino acid distributions in known structures • Chou-Fasman • Sequence similarity to sequences with known structure • PSIPRED

Chou-Fasman • First widely used procedure • Output-helix, strand or turn • Percent accuracy: 60-65%

Psi-BLAST Predict Secondary Structure (PSIPRED) • Three steps: • 1) Generation of position specific scoring matrix. • 2) Prediction of initial secondary structure • 3) Filtering of predicted structure

Conformational parameters for α-helical, β-strand, and turn amino acids (from Chou and Fasman, 1978)

PSIPRED • Uses multiple aligned sequences for prediction. • Uses training set of folds with known structure. • Uses a two-stage neural network to predict structure based on position specific scoring matrices generated by PSI-BLAST (Jones, 1999) • First network converts a window of 15 aa’s into a raw score of h,e (sheet), c (coil) or terminus • Second network filters the first output. For example, an output of hhhhehhhh might be converted to hhhhhhhhh. • Can obtain a Q3 value of 70-78% (may be the highest achievable)

Neural networks • Computer neural networks are based on simulation of adaptive • learning in networks of real neurons. • Neurons connect to each other via synaptic junctions which are either • stimulatory or inhibitory. • Adaptive learning involves the formation or suppression of the right • combinations of stimulatory and inhibitory synapses so that a set • of inputs produce an appropriate output.

Neural Networks (cont. 1) • The computer version of the neural network involves • identification of a set of inputs - amino acids in the • sequence, which transmit through a network of • connections. • At each layer, inputs are numerically • weighted and the combined result passed to the next • layer. • Ultimately a final output, a decision, helix, sheet or • coil, is produced.

Neural Networks (cont. 2) 90% of training set was used (known structures) 10% was used to evaluate the performance of the neural network after the training session.

Neural Networks (cont. 3) • During the training phase, selected sets of proteins of known structure were scanned, and if the decisions were incorrect, the input weightings were adjusted by the software to produce the desired result. • Training runs were repeated until the success rate is maximized. • Careful selection of the training set is an important aspect of this technique. The set must contain as wide a range of different fold types as possible without duplications of structural types that may bias the decisions.

Neural Networks (cont. 4) • An additional component of the PSIPRED procedures involves sequence alignment with similar proteins. • The rationale is that some amino acids positions in a sequence contribute more to the final structure than others. (This has been demonstrated by systematic mutation experiments in which each consecutive position in a sequence is substituted by a spectrum of amino acids. Some positions are remarkably tolerant of substitution, while others have unique requirements.) • To predict secondary structure accurately, one should place less weight on the tolerant positions, which clearly contribute little to the structure • One must also put more weight on the intolerant positions.

Provides info on tolerant or intolerant positions Row specifies aa position 15 groups of 21 units (1 unit for each aa plus one specifying the end) Filtering network three outputs are helix, strand or coil (Jones, 1999)

Example of Output from PSIPRED PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence Conf: 923788850068899998538983213555268822788714786424388875156215 Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD 10 20 30 40 50 60

How to calculate Q3? Sequence: MEETHAPYRGVCNNM Actual Structure: CCCCCHHHHHHEEEE PSIPRED Prediction: CCCCCHHHHHHEEEH Q3 = 14/15 x 100 = 93%

Recognizing motifs in proteins. • PROSITE is a database of protein families and domains. • Most proteins can be grouped, on the basis of similarities in their sequences, into a limited number of families. • Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

PROSITE Database • Contains 1612 documentation entries. • Signatures are produced by scanning the PROSITE database with your query. A “signature” of a protein allows one to place a protein within a specific function class based on structure and/or function. • An example of an documentation entry in PROSITE is: http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020

Signatures are produced from profiles and patterns. • Profile-”a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.”

Sequences in one profile and the PSSM associated with the profile F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q F K L L G N V L V C A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18

A-T-H-[DE]-X-V-X(4)-{ED} This pattern is translated as: Ala, Thr, His, [Asp or Glu], any, Val, any, any, any, any, any but Glu or Asp How are the patterns constructed? Sequences necessary for structure or function are aligned manually by experts in field. Then a pattern is created. ALRDFATHDDVCGK.. SMTAEATHDSVACY.. ECDQAATHEAVTHR..

Example of a pattern in a PROSITE record ID ZINC_FINGER_C3HC4; PATTERN. PA C-X-H-X-[LIVMFY]-C-X(2)-C-[LIVMYA]

Scanning the PROSITE database • “Scan a sequence against PROSITE patterns and profiles” allows the user to scan the ProSite database to search for patterns and profiles. It uses dynamic programming to determine optimal alignments. If the alignment produces a high score (a hit), then the hit is shown to the user. http://www.expasy.ch/prosite/ • If a “hit” is generated, the program gives an output that shows the region of the query that contains the pattern and a reference to the 3-D structure database if available.

Example of output from Prosite Scan

RPSBlast • Reverse psi-blast, or rpsblast, is a program that searches a query protein sequence or protein sequences against a database of position specific scoring matrices. The PSSMs are from conserved protein sequences that have known functions/structure.

3D structure data • The largest 3D structure database is the Protein Databank • It contains over 20,000 records • Each record contains 3D coordinates for macromolecules • 80% of the records were obtained from X-ray diffraction studies, 20% from NMR.

Part of a record from the PDB ATOM 1 N ARG A 14 22.451 98.825 31.990 1.00 88.84 N ATOM 2 CA ARG A 14 21.713 100.102 31.828 1.00 90.39 C ATOM 3 C ARG A 14 22.583 101.018 30.979 1.00 89.86 C ATOM 4 O ARG A 14 22.105 101.989 30.391 1.00 89.82 O ATOM 5 CB ARG A 14 21.424 100.704 33.208 1.00 93.23 C ATOM 6 CG ARG A 14 20.465 101.880 33.215 1.00 95.72 C ATOM 7 CD ARG A 14 20.008 102.147 34.637 1.00 98.10 C ATOM 8 NE ARG A 14 18.999 103.196 34.718 1.00100.30 N ATOM 9 CZ ARG A 14 18.344 103.507 35.833 1.00100.29 C ATOM 10 NH1 ARG A 14 18.580 102.835 36.952 1.00 99.51 N ATOM 11 NH2 ARG A 14 17.441 104.479 35.827 1.00100.79 N

Quiz #3 prep • BLAST • Three steps • Gapped BLAST • Heuristic program • Uses S-W algorithm for final scoring • CLUSTAL W • Pairwise alignments • Difference matrix • Guide tree • Importance of having highly similar sequences • Secondary Structure prediction • Chou-Fasman • PSIPRED • Good for secondary str • Protein analysis • ProScan • RPBlast

Protein structure prediction