1 / 46

Class 7: Protein Secondary Structure

Class 7: Protein Secondary Structure. Protein Structure. Amino-acid chains can fold to form 3-dimensional structures Proteins are sequences that have (more or less) stable 3-dimensional configuration. Why Structure is Important?. The structure a protein takes is crucial for its function

cybill
Télécharger la présentation

Class 7: Protein Secondary Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class 7: Protein Secondary Structure .

  2. Protein Structure • Amino-acid chains can fold to form 3-dimensional structures • Proteins are sequencesthat have (more or less) stable 3-dimensional configuration

  3. Why Structure is Important? The structure a protein takes is crucial for its function • Forms “pockets” that can recognize an enzyme substrate • Situates side chain ofspecific groups to co-locate to form areas with desired chemical/electrical properties • Creates firm structures such ascollagen, keratins, fibroins

  4. Determining Structure • X-Ray and NMR methods allow to determine the structure of proteins and protein complexes • These methods are expensive and difficult • Could take several work months to process one proteins • A centralized database (PDB) contains all solved protein structures • XYZ coordinate of atoms within specified precision • ~23,000 proteins have solved structures

  5. Structure is Sequence Dependent • Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence • Force the protein to loose its structure, by introducing agents that change the environment • After sequences put back in water, original conformation/activity is restored • However, for complex proteins, there are cellular processes that “help” in folding

  6. Levels of structure

  7. Secondary Structure -helix -strands

  8. a Helix • Single protein chain • Turn every 3.6 amino acids • Shape maintained byintramolecular H bondingbetween -C=O and H-N-

  9. Hydrogen Bonds in -Helices

  10. Amphipathic -helix • Hydrophilic residues on one side • Hydrophobic residues on other side

  11. -Strands • Alternating 120’ angles • Often form sheets

  12. Anti-parallel parallel -Strands form Sheets These sheets hold together by hydrogen bonds across strands

  13. …which can form a -barrel porin – a membranal transoporter

  14. Angular Coordinates • Secondary structures force specific angles between residues

  15. Ramachandran Plot • We can relate angles to types of structures

  16. Define "secondary structure" 3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT DSSP= Database of Secondary Structure in Proteins

  17. DSSP symbols H = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4) E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished) S= beta-bridge (isolated backbone H-bonds) T=beta-turn (specific sets of angles and 1 i->i+3 H-bond) G=3-10 helix or turn (i,i+3 H-bonds) I=Pi-helix (i,i+5 Hbonds) (rare!) _= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding. L

  18. Labeling Secondary Structure • Using both hydrogen bond patterns and angles, we can label secondary structure tags from XYZ coordinate of amino-acids • These do not lead to absolute definition of secondary structure

  19. Prediction of Secondary Structure Input: • amino-acid sequence Output: • Annotation sequence of three classes: • alpha • beta • other (sometimes called coil/turn) Measure of success: • Percentage of residues that were correctly labeled

  20. Accuracy of 3-state predictions True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TTPrediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL Q3-score = % of 3-state symbols that are correct Measured on a "test set" Test set == An independent set of cases (proteins) that were not used to train, or in any way derive, the method being tested. Best methods: PHD (Burkhard Rost) -- 72-74% Q3 Psi-pred (David T. Jones) -- 76-78% Q3

  21. What can you do with a secondary structure prediction? (1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand. (2) Find out whether a helix or strand is extended/shortened in the homolog. (3) Model a large insertion or terminal domain (4) Aid tertiary structure prediction

  22. Statistical Methods • From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type • Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P(a,aa) = 500/20,000, p(a) = 4,000/20,000, p(aa) = 2,000/20,000 P = 500 / (4,000/10) = 1.25 Used in Chou-Fasman algorithm (1974)

  23. Chou-Fasman: Initiation • Identify regions where 4/6 have propensity P(H) >1.00 • This forms a “alpha-helix nucleus”

  24. Chou-Fasman: Propagation • Extend helix in both directions until a set of four residues have an average P(H) <1.00.

  25. Chou-Fasman Prediction • Predict as -helix segment with • E[P] > 1.03 • E[P] > E[P] • Not including proline • Predict as  -strand segment with • E[P] > 1.05 • E[P] > E[P] • Others are labeled as turns/loops. (Various extensions appear in the literature)

  26. Achieved accuracy: around 50% • Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids • We would like to use the sequence context as an input to a classifier • There are many ways to address this. • The most successful to date are based on neural networks

  27. A Neuron

  28. Artificial Neuron Input Output a1 W1 W2 a2 … Wk ak • A neuron is a multiple-input -> single output unit • Wi = weights assigned to inputs; b = internal “bias” • f = output function (linear, sigmoid)

  29. Artificial Neural Network Input Hidden Output a1 W1 o1 W2 a2 … … … Wk om a3 • Neurons in hidden layers compute “features” from outputs of previous layers • Output neurons can be interpreted as a classifier

  30. Example: Fruit Classifer Shape Texture Weight Color Apple ellipse hard heavy red Orange round soft light yellow

  31. Si-w o o Si oo ... Si+w ... ... ... Hidden Output Input Qian-Sejnowski Architecture

  32. Neural Network Prediction • A neural network defines a function from inputs to outputs • Inputs can be discrete or continuous valued • In this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for it • Structure element determined by max(o, o, oo)

  33. Training Neural Networks • By modifying the network weights, we change the function • Training is performed by • Defining an error score for training pairs <input,output> • Performing gradient-descent minimization of the error score • Back-propagation algorithm allows to compute the gradient efficiently • We have to be careful not to overfit training data

  34. Smoothing Outputs • The Qian-Sejnowski network assigns each residue a secondary structure by taking max(o, o, oo) • Some sequences of secondary structure are impossible: • To smooth the output of the network, another layer is applied on top of the three output units for each residue: • Neural network • Markov model

  35. Success Rate • Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins • Depending on the exact choice of training/test sets

  36. Breaking the 70% Threshold • A innovation that made a crucial difference uses evolutionary information to improve prediction Key idea: • Structure is preserved more than sequence • Surviving mutations are not random • Suppose we find homologues (same structure) of the query sequence • The type of replacements at position i during evolution provides us with information about the use of the residue i in the secondary structure

  37. Nearest Neighbor Approach • Select a window around the target residues • Perform local alignment to sequences with known structure • Choice of alignment weight matrix to match remote homologies • Alignment weight takes into account the secondary structure of aligned sequence • Use max (na, nb, nc) or max(sa, sb, sc) • Key: Scoring measure of evolutionary similarity.

  38. PHD Approach Multi-step procedure: • Perform BLAST search to find local alignments • Remove alignments that are “too close” • Perform multiple alignments of sequences • Construct a profile (PSSM) of amino-acid frequencies at each residue • Use this profile as input to the neural network • A second network performs “smoothing”

  39. PHD Architecture

  40. Psi-pred : same idea (Step 1) Run PSI-Blast --> output sequence profile (Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position. (Step 3) 60 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction. Performs slightly better than PHD

  41. Other Classification Methods • Neural Networks were used as a classifier in the described methods. • We can apply the same idea, with other classifiers. E.g.: SVM • Advantages: Effectively avoid overfitting • Supplies prediction confidence

  42. SVM based approach • Suggested by S. Hua and Z. Sun, (2001). • Multiple sequence alignment from HSSP database (same as PHD) • Sliding window of w  21 w input dimension • Apply SVM with RBF kernel • Multiclass problem: • Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E) • maximum output score • Decision tree method • Jury decision method

  43. No No No E / ~E H / ~H C / ~C No No No C/ H Yes H / E E / C Yes Yes Yes Yes Yes E C H H C H E C E Decision tree

  44. Accuracy on CB513 set

  45. State of the Art • Both PHD and Nearest neighbor get about 72%-74% accuracy • Both predicted well in CASP2 (1996) • PSI-Pred slightly better (around 76%) • Recent trend: combining classification methods • Best predictions in CASP3 (1998) • Failures: • Long term effects: S-S bonds, parallel strands • Chemical patterns • Wrong prediction at the ends of helices/strands

More Related