1 / 40

Presenter: Sancar Adali

A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approach by S. Hua & Z. Sun. Presenter: Sancar Adali. Outline. Problem Statement SVM primer Methods and Parameters Results Conclusion Discussion. Problem Statement.

Télécharger la présentation

Presenter: Sancar Adali

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approachby S. Hua & Z. Sun Presenter: Sancar Adali

  2. Outline • Problem Statement • SVM primer • Methods and Parameters • Results • Conclusion • Discussion

  3. Problem Statement • Predicting Secondary Protein Structure from amino acid Sequences Secondary Protein Structure: The local shape of polypeptide chain dependent on the hydrogen bonds between different amino acids In the big picture, the goal is to solve tertiary/quaternary structure in 3D. By solving the more easily tackled secondary structure problem in 2D, we’re solving an easier problem.

  4. Some important terminology • residue:  When two or more amino acids combine to form a peptide, the elements of water are removed, and what remains of each amino acid is called an amino acid residue. • segment:  a group of consecutive residues that form a secondary structure such as helix, coil, sheet • C-terminus: The residue that has a free carboxyl group… • N-terminus: The residue in a peptide that has an amino group that is free… [International Union of Pure and Applied Chemistry definitions]

  5. Problem Statement • Our main goal is to find Sequence to Structure relationship. We usually know the amino acid sequence of proteins. The problem is complicated, because of the many interactions/bonds between amino acid residues. We start at a lower level of complexity and try to predict secondary structure.

  6. Amino acids to Proteins H H Carboxyl(COO) Amino C C Amino(NH3) C C Carboxyl R1 R2 O H H H Amino C C C N C C Carboxyl R1 R2 H Peptide bond C-terminus N-terminus

  7. Linear Decision Boundary • # of classes: 2 • y: output {-1,1}; x ЄRn; L =((x1,y1), (x2,y2)…(xn,yn) • wЄRn Hyperplane: H = {x : w∙x+b = 0}

  8. SVM(Support Vector Machine) • Linearly Separable Training Samples

  9. SVM(Support Vector Machine) • Original Idea • Choose H by • Maximizing min{||x-xi||: w∙x+b = 0, i=1,…n} with respect to w and b • Maximizing the margin(the distance from hyperplane to the closest positive and negative training points)

  10. Solving Linearly Separable Case • Hyperplane: H = {x : w∙x+b = 0} • M=d+ + d- • Choose w,b to maximize margin. Then fSVM=sign(w∙x+b) Since data points are linearly separable, there exist such w and b yi(w∙xi +b)>0 for all i =1, …k

  11. Rescale w, b to make the points closest to HP satisfy |w·xi +b|=1(standardizing distance) yi(w∙xi + b)≥1 for all i = 1,2,…k Now we have two parallel planes to H Solving Linearly Separable Case

  12. Linearly Separable Case Soln. The distance between H1 , H2 is 2/||w||.

  13. Linearly Separable Case Soln. • We want to maximize distance between hyperplane H1 and H2 which means minimizing ||w|| • The problem has become an optimization problem where the objective function is

  14. The decision boundary depends only on a subset of training samples. These are called “Support Vectors”. • Introduce αi≥0 :Lagrange (nonnegative) multipliers to solve the problem • Solution and the decision function is

  15. Error bound on SVM using LOOCV

  16. Not Linearly Separable Case • Introduce slack variables ξi≥0 to take classification errors into account • If ξi > 1, we make an error • Add error term ξi to the objective function

  17. Not Linearly Separable Case Solution is again

  18. Nonlinear SVM • Map the training samples to a higher-dimensional space where they become linearly separable and solve the decision problem there • Important observation • Training samples only appear in dot products: or

  19. Nonlinear SVM We don’t need to know Φ explicitly, just need It can be represented by a kernel function

  20. Nonlinear SVM Φ exists neither in the decision function nor in the training step

  21. Typical Kernels THE ONE USED IN ARTICLE

  22. Methods • Use of SVMs with Radial Basis Function • Three classes H(α-helix), E(-sheet), C(coil,and the remaining part) • Construction of tertiary classifiers from multiple binary classifiers(SVMs are binary classifiers)

  23. SVM Implementation • Gaussian RBF with r=0.10 • The authors claim this parameter is not critical and they have tried other values • C: the penalty for classification errors. Must be chosen carefully to find a balance between overfitting(too high penalty) and poor generalization(too low penalty). The optimal value for C was calculated using 7-fold CV in this paper

  24. Encoding Amino Acid Sequence Information • 20 amino acids + 1 for N or C terminus : 21 binary unit vectors (1,0,…,0),(0,1,…,0) • Concatenate these vectors for each sliding window • For Window Length ℓ: Dimension of Feature Vector is 21∙ℓ

  25. Encoding Example • Assume ℓ=5 and there are 3 amino acids(for our purposes),named Arg(1,0,0,0), Lys(0,1,0,0), Met(0,0,1,0). The two ends of the chain are encoded separately as (0,0,0,1) • The feature vector for Arg* residue in the sequence Arg,Lys,Arg*,Met,Arg,Met,Met would be encoded as xi = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0) • This will be our input to SVM.

  26. Determination of Optimal Window Length • Determines dimension of feature vector • Determined using 7-fold cross-validation

  27. Data Sets • RS126 Set: 126 nonhomologous proteins • non-homologous:no two proteins in the set share more than 25 % sequence identity over a length of more than 80 residues • CB513 Set: 513 nonhomologous proteins. Almost all the sequences in the RS126 set are included in the CB513 set. • non-homologous data set: i.e. an SD score of ≥5 is regarded as homology. SD score(distance of “alignment score” from the mean score of randomized sequences in terms of std) is a more stringent measure than the percentage identity. In fact, 11 pairs of proteins in the RS126 set are sequence similar when using the SD score instead of percentage identity. The CB513 set contains 16 chains of 30 residues and short chains usually decrease accuracy.

  28. Protein Secondary Structure Definition H(α-helix), H G(310-helix) I(π-helix), E(β-strand), E B(isolated β-bridge), T(turn), C S(bend) -(rest). Reduction 1 Reduction 2

  29. Reliability Index(of single classifications) • Using a discretized function of the sample distance to the separating hyperplane • Compares it with accuracy measure to prove the point RI increases monotonically with increasing accuracy

  30. Accuracy Measures Estimate of Estimate of Y is the class/state variable and Ŷ is the output of classifier

  31. Segment Overlap Measure • Segment Overlap Measure: A more realistic way of measuring accuracy compared to per-residue accuracy • In terms of relevancy to 3D structure • Problems with per-residue accuracy measurement • Proteins with same 3D folding differ by 12% in Secondary Structure • This means maximum performance of Q3 should =88% • End of segments might vary for proteins with same 3D structure. (so their classification is less relevant to determining protein structure) • Account for the following • Type and position of secondary structure segments rather than per-residue • Variation of the residues at the end of the segments

  32. Comparison of SOV and per-residue

  33. Accuracy Measures Correlation Measure where pα:# of true positives nα: :# of true positives oα: :# of false positives uα: :# of false negatives α stands for H,C or E

  34. Tertiary Classifier Design E/~E H/~H C/~C Choose one with maximum margin: SVM_MAX_D

  35. Tertiary Classifier Design E/~E C/~C H/C E/H E/C H/~H Vote Neural Network SVM_VOTE SVM_NN Combine all of tertiary classifiers designed so far to form a jury SVM_JURY

  36. Results

  37. Results(Comparison with PHD)

  38. Results,Details – Support Vectors In typical pattern recognition problems # SVs/Training Samples is much lower. Interpretation: Remember the error bound found by LOOCV was EV(# SVs)/ Training samples. This means the error bound will be very high compared to typical pattern recognition problems.

  39. Conclusions • Application of SVM method on prediction with good results(improvement over PHD). • Importance of accuracy measure used to measure performance of algorithms and class reduction method • Use of machine learning method in solving protein structure problem at lower complexity • The general protein folding problem in three dimensions requires too much computational power, so this is a cheap alternative.

  40. Critique of article • Number of SVs too large. This puts the performance of SVM into question. • The article doesn’t put the machine learning aspects into clear form, too focused on results

More Related