1 / 24

Using the Fisher kernel method to detect remote protein homologies

Using the Fisher kernel method to detect remote protein homologies. Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16). Abstract. Detecting remote protein homologies Fisher kernel method Variant of Support Vector Machines using new kernel function

romeo
Télécharger la présentation

Using the Fisher kernel method to detect remote protein homologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

  2. Abstract • Detecting remote protein homologies • Fisher kernel method • Variant of Support Vector Machines using new kernel function • Derived from Hidden Markov Models

  3. Introduction (1) • Detecting protein homologies (sequence-based algorithm) • BLAST, Fasta, PROBE, templates, profiles, position-specific weight matrices, HMM • Comparison by (Brenner 1996; Park et al. 1998) • SCOP classification of protein structures • Remote protein homologies existing between protein domain in the same structural superfamily. • Statistical models like PSI-BLAST and HMMs are better than simple pairwise comparison methods.

  4. Introduction (2) • Generative statistical models (HMMs) • Extracting features from protein sequences • Mapping all protein sequences to points in a Euclidean feature space of fixed dimension. • General discriminative statistical method to classify the points. • Improvements acquired • Over HMMs alone.

  5. Methods • How generative models work. (HMMs) • Training examples ( sequences known to be members of protein family ) : positive • Tuning parameters with a priori knowledge • Model assigns a probability to any given protein sequence. • The sequence from that family yield a higher probability than that of outside family. • Log-likelihood ratio as score

  6. Discriminative approaches • Using both positive and negative examples • Parameter is tuned so that the model can optimally discriminate members of the family from nonmembers. • When training examples are few • Likelihood ratio is optimal if generative models perfectly fit to data but… • Discriminative methods often performs better.

  7. Kernel methods • Discriminant function L(X) • Where { Xi, i = 1,…,n} and hypothesis class H1, H2 • + : the sequence of the family, - : outside of the family • Contribution of Kernel • i : overall importance of the example Xi. • Measure of pairwise similarity : K(Xi, X) • User supplies the type of kernel for the application area!!

  8. The Fischer kernel (1) • Deriving kernel function from generative models • Advantage 1 : handle variable length protein sequences!! • Advantage 2 : encoding of prior knowledge about protein sequences • HMMs (difference) • Kernel function specifies a similarity score for any pair of sequences. • Likelihood score from an HMM only measures the closeness of the sequence to the model itself.

  9. The Fischer kernel (2) • Sufficient statistics • Each parameter in HMM : Posterior frequencies • Of particular transition. • Of generating one of the residues of the query sequence. • Reflects the process of generating the query sequence from HMM. • Alterative of sufficient statistics : Fischer score • Magnitude of the components : how each contributes to generating the query sequence.

  10. The Fischer kernel (3) • Kernel function used in this paper. note that its fixed vector. • Summary • Train HMM with positive examples. • Map each new protein sequence X into a fixed vector, Fisher score. • Calculate the kernel function • Get resulting discriminant function (SVM-Fisher)

  11. The Fischer kernel (4) • Combination of scores • There might be more than one HMM model for the family or superfamily of interest. • Average score • Maximum score

  12. Experimental Methods • Methods • SVM-Fisher (this paper) • BLAST (Altshul et al. 1990; Gish & States 1993) • HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996) • Measurement of recognition rate for members of superfamilies of the SCOP protein structure classification (Hubbard et al. 1997) • Withholding all members of SCOP family • Train with the remaining members of SCOP superfamily • Test with withheld data • Question: “Could the method discover a new family of a known superfamily?”

  13. Overview of experiments • Database • SCOP version 1.37 PDB90 : consisting of protein domains, no two of which have 90% of more residue identity • PDB90 eliminates redundant sequences. • Generative models • SAM-T98 HMMs • Data selection • Get 33 test families from 16 superfamilies. • Evaluation strategy • Assessing to what extent it gave better scores to the positive test examples thant it gave to the negative test examples.

  14. SCOP: a Structural Classification of Proteins database • Hierachical levels • Family: clustered proteins by common evolutionary origin: residue identities of above 30%, lower sequence identities but very similar functions and structures • Superfamily: low sequence identities but probably common evolutionary origin • Fold: same major secondary structure in the same arrangement and with the same topological connections

  15. Figure 1: Separation of the SCOP PDB90 database into training and test sequences, shown for the G proteins test family

  16. Multiple models used • Modeling superfamily • SAM-T98 : starts with a single sequence (the guide sequence for the domain) and build a model • Too many sequences! • Using a subset of PDB90. • Train SVM-Fisher method using each of models in turn

  17. Details on the training and test sets • All PDB90 sequence outside the fold of the test family were used as either negative training or negative test examples. • Reverse test/training allocation of negative examples, and repeat experiments. • Fold-by-fold basis split of negative examples. • For positive examples • PDB90 sequences in the superfamily of the test family are used. • Homologs found by each individual SAM-T98 model are used.

  18. BLAST methods • WU-BLAST version 2.0a16 (Althcshul & Gish 1996) • PDB90 database was queried with each positive training examples, and E-values were recorded. • BLAST:SCOP-only • BLAST:SCOP+SAM-T98-homologs • Scores were combined by the maximum method

  19. Generative HMM models • SAM-T98 method • Null model: reverse sequence model • Same data and same set of models as in the SVM-Fisher • Combined with maximum methods

  20. Results • Metric : the rate of false positives (RFP) • RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence.

  21. G-proteins • The result of the family of the nucleotide triphosphate hydrolases SCOP superfamily • Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds. • Table 1 • In SVM-Fisher • 5 of the 8 G proteins are better than all 2439 negative test sequences. • Maximum RFP • Median RFP • Figure 2 • RFP curve

  22. Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method

  23. Figure 2: 4 methods on the 33 test families. Curve of median RFP

  24. Discussion • New approach • to recognition of remote protein homologies make a discriminative method built on top of a generative model (HMMs) • Discriminative method on top of HMM methods • Significant improvement • Combining multiple score would be improved. • Allocation problem • Different training set for tuning HMM and different training set for discriminative model • Extend the method to identify multiple domains within large protein sequences

More Related