1 / 11

Protein Classification Using Averaged Perceptron SVM

Eugene Ie. CS6772 Project Presentation 12/03/2003. Protein Classification Using Averaged Perceptron SVM. Protein Sequence Classification. Protein = (  )* |  | = 20 amino acids Easy to sequence proteins, difficult to obtain structure. 3D Structure. Sequence.

brandona
Télécharger la présentation

Protein Classification Using Averaged Perceptron SVM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eugene Ie CS6772 Project Presentation 12/03/2003 Protein Classification Using Averaged Perceptron SVM

  2. Protein Sequence Classification • Protein = ()* |  | = 20 amino acids • Easy to sequence proteins, difficult to obtain structure 3D Structure Sequence VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR ? Class Globin family Globin-like superfamily Function Oxygen transport

  3. Sequence Alignment vs. Classification • Sequence similarity through alignment distant homology SGFIEEDELKLFL SGFIEEEELKFVL close homology • Sequence classification for remote homology Classifier

  4. Structural Hierarchy of Proteins SCOP • Remote homologs: • Structure and function conserved • Sequence similarity - low Fold Superfamily Negative Test Set Negative Training Set Family Positive Test Set Positive Training Set

  5. Remote Homology Detection • Discriminative supervised learning approach to protein classification Approach: Support Vector Machines with String Kernels C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM Protein Classification. C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching.

  6. QP SVM Training Sequence Training Data >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR Total: n sequences + n labels Learned Weights and Bias QP Solver (slow) From KKT

  7. Averaged Perceptron SVM Training Training Algorithm: Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron Algorithm.

  8. Averaged Perceptron SVM Training Iterate t Epochs Sequence Training Data Run Perceptron Algorithm >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR Total: n sequences + n labels Generalized Bound for k Final Weight Vector, Voting Weights s = no. of dimensions in feature space k = no. of mistakes made during perceptron run SCOP experiments show: For average n ~ 1000 Average k ~ 50-60

  9. Averaged Perceptron SVM Classification Testing Algorithm: Note: Only k kernel products with unknown sequence x need to be computed. Recurrence relation: M is the set of “mistake indices”

  10. Implementation Details • Built on top of protclass (Protein Classification) platform • Java Platform • Classification Task • Classification Task • Hash table scan instead of Mismatch Trie • Generate mismatch mappings once using shifts • Dynamic kernel matrix storage • Still needs debugging • Speed/Space Performance • ~80% reduction in space requirement • ~50% reduction in training time • ~50% reduction in testing time • Mainly from simple online algorithm

More Related