1 / 25

Protein Homology Detection Using String Alignment Kernels

Jean-Phillippe Vert, Tatsuya Akutsu. Protein Homology Detection Using String Alignment Kernels. Learning Sequence Based Protein Classification. Problem: classification of protein sequence data into families and superfamilies

moe
Télécharger la présentation

Protein Homology Detection Using String Alignment Kernels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jean-Phillippe Vert, Tatsuya Akutsu Protein Homology Detection Using String Alignment Kernels

  2. Learning Sequence Based Protein Classification • Problem: classification of protein sequence data into families and superfamilies • Motivation: Many proteins have been sequenced, but often structure/function remains unknown • Motivation: infer structure/function from sequence-based classification

  3. Sequence Data Versus Structure and function Sequences for four chains of human hemoglobin Tertiary Structure >1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Function: oxygen transport

  4. Structural Hierarchy • SCOP: Structural Classification of Proteins • Interested in superfamily-level homology – remoteevolutionary relationship Difficult !!

  5. Learning Problem • Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise • Focus on remote homology detection • Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm

  6. Two supervised learning approaches to classification • Generative model approach • Build a generativemodel for a single protein family; classify each candidate sequence based on itsfitto the model • Only uses positive training sequences • Discriminative approach • Learning algorithm tries to learn decision boundary between positive and negative examples • Uses both positive and negative training sequences

  7. Class Secondary Structure Prediction Fold Threading Super Family HMM, PSI-BLAST, SVM Family SW, BLAST, FASTA Targets of the current methods

  8. Discriminative Learning Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory • Goal: learn a classifier that generalizes well to new examples • Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

  9. SVM for protein classification • Want to define feature map from space of protein sequences to vector space • Goals: • Computational efficiency • Competitive performance with known methods • No reliance on generative model – general method for sequence-based classification problems

  10. Summary of the current kernel methods • Feature vector from HMM • Fisher kernel(Jaakkola et al., 2000) • Marginalized kernel (Tsuda et al., 2002) • Feature vector from sequence • Spectrum kernel (Leslie et al., 2002) • Mismatch kernel (Leslie et al., 2003) • Feature vector from other score • SVM pairwise (Liao & Noble, 2002)

  11. String Alignment Kernels • Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution. • It can not be used as kernel because of lack of positive definiteness. • A family of local alignment (LA) kernels that mimic SW score are presented .

  12. LA Kernels Choose Feature Vector representation Get Kernel by inner product of vectors Other Kernels LA Kernel Measure similarity Get valid kernel

  13. LA Kernels • Pair score Kaβ(x,y) • Gap kernel Kgβ(x,y) for penalty gap model Β>=0, s is a symmetric similarity score. with d is gap opening and e is extension costs

  14. LA Kernels • Kernel convolution: • For n>=1, the string kernel can be expressed as K0=1 K0 is initial part, succession of n aligned residues Ka β with n-1 possible gap Kgβ and a terminal part K0.

  15. LA Kernels It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels

  16. LA with SW score • π:local alignment • p(x,y,π): score of local alignment π over x,y. • Π:set of all possible local alignment over x,y.

  17. Why SW can not be kernel • 1. SW only keep the best alignment instead of sum of alignment of x,y. • 2. Logrithm can destroy the property of being postive definite.

  18. Example LA Kernel SW score

  19. SVM-pairwise LA kernel x y SW Score y x Pair HMM (0.2, 0.3, 0.1, 0.01) (0.9, 0.05, 0.3, 0.2) Inner Product 0.253 0.227

  20. Diagonal Dominant Issue • It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.

  21. Diagonal Dominant Issue (1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal. (2) The empirical kernel map LA-ekm

  22. Methods • Implementation • The computation of the kernel [and therefore of ] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm. • Normaliztion • Dataset • 4352 sequences extracted from the Astral database (www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.

  23. ROC Curve

  24. ROC Curve

  25. Summary for the kernels

More Related