1 / 20

CISC 841 Bioinformatics Combining HMMs with SVMs

CISC 841 Bioinformatics Combining HMMs with SVMs. HMM gradients. Fisher Score <X> =   log P(X|H,  ) The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model.

Télécharger la présentation

CISC 841 Bioinformatics Combining HMMs with SVMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CISC 841 Bioinformatics Combining HMMs with SVMs Li Liao, CISC841, F07

  2. HMM gradients • Fisher Score <X> = log P(X|H, ) • The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. • Each dimension corresponds to one parameter of the model. • The feature space is tailored to the sequences from which the model was trained. Li Liao, CISC841, F07

  3. SVM-Fisher discrimination • A probabilistic hidden Markov model  is trained from some example sequences x1 x2 x3 … xN • Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership. • The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi =  P(xi|) • One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors. • A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences. Li Liao, CISC841, F07

  4. Application: Protein remote homology detection Li Liao, CISC841, F07

  5. SVM-Pairwise method Positive train Negative train Protein non-homologs Protein homologs 1 Positive pairwise score vectors Negative pairwise score vectors Testing data Target protein of unknown function 2 3 Support vector machine Binary classification Li Liao, CISC841, F07

  6. Experiment: known protein families Li Liao, CISC841, F07 Jaakkola, Diekhans and Haussler 1999

  7. Sample family sizes

  8. A measure of sensitivity and specificity 5 6 ROC = 1 ROC = 0.67 ROC = 0 ROC: receiver operating characteristic score is the normalized area under a curve the plots true positives as a function of false positives

  9. Application: Discriminating signal peptide from transmembrane proteins Li Liao, CISC841, F07

  10. SignalP TM protein Feature selection • We expect gradients w.r.t transition parameters to be better discrimination features • We look for those transitions that are differentially used by TM proteins and SP proteins - transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters and find the resultant vector - transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors Li Liao, CISC841, F07

  11. Gradients of P(s|x) In pattern recognition problems, we are interested in P(s|x,) rather than P(x|) Us|x =  log P(s|x,) =  log P(s, x|) -  log P(x|) Li Liao, CISC841, F07

  12. TMMOD sequence to vector x  Us|x subsets of 247 TM proteins SVM Learn ? subsets of 1275 SP proteins SVM Classifier ? ? ? Classification experiment • 10-fold cross validation experiment using - positive set (247 TM proteins) - negative set (1275 signal peptide containing proteins) • SVM-light package is used. Li Liao, CISC841, F07

  13. Discrimination results • Results • A third (68) more SP proteins that were incorrectly classified as TM TM proteins are identified correctly. Li Liao, CISC841, F07

  14. Application: Protein-Protein Interaction Prediction Li Liao, CISC841, F07

  15. Interaction Profile Hidden Markov Model (ipHMM) Fredrich et al (2006) Li Liao, CISC841, F07

  16. Knowledge transfer: • Build ipHMM from proteins whose structural information is available. • Align the sequences of proteins whose structural information is • not available to the model. Likelihood Score Vector <LSai, A, LSai, B, LSbj,A, LSbj, B> Fisher Score Vector U(x) = ∇θ logP(x|θ) Uij = Ej(i) / ej(i) + k Ej(k) Li Liao, CISC841, F07

  17. Li Liao, CISC841, F07

  18. Li Liao, CISC841, F07

  19. Data set Fredrich et al (2006): 2018 proteins in 36 domain families Li Liao, CISC841, F07

  20. Conclusions • Structural information at binding sites enhances protein-protein interaction prediction. • Interaction profile HMM can transfer structural information • Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information. Li Liao, CISC841, F07

More Related