1 / 33

Profile HMMs

Profile HMMs. Biology 162 Computational Genetics Todd Vision 16 Sep 2004. Outline. Profile HMMs generate MSAs States and transitions for Matches, Insertions, Deletions, Silent and Flanking states Statistics Null model, E values Training

anevay
Télécharger la présentation

Profile HMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004

  2. Outline • Profile HMMs generate MSAs • States and transitions for • Matches, Insertions, Deletions, Silent and Flanking states • Statistics • Null model, E values • Training • Model construction, Weighting training sequences and including pseudocounts (which have a Bayesian interpretation) • Existing tools • Interpro, including Pfam and HMMER

  3. Globins Helix 1111111222222222222222222222 3333333333333 HBA_HUMAN -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL- HBB_HUMAN GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKL- MYG_PHYCA KHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKH- LGB2_LUPLU LK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG- Consensus .l.t ... .kHg.kV. a. ..... . ..l. L. .H. K.

  4. Hidden Markov models • Observed sequence of symbols • Hidden sequence of underlying states • Transition probabilities govern transitions among states • Emission probabilities govern the likelihood of observing a symbol in a particular state

  5. Profile HMMs • Use scores rather than emission probabilities directly

  6. A PSSM as a simple HMM begin M1 Mi Mi+1 ML end … … With emission probabilities unique to each match state

  7. begin M1 Mi ML end … … But what about gaps? • Ignore them (BLOCKS database) OR • Model them • Insert states have background emission probabilities Ii

  8. Gap scores • For an insert of length k with background emission probabilities, we have affine gap scores

  9. Length distribution of inserts • Geometric distribution I aII aIM aMI

  10. Which columns are match states? Helix 1111111222222222222222222222 3333333333333 HBA_HUMAN -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL- HBB_HUMAN GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKL- MYG_PHYCA KHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKH- LGB2_LUPLU LK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG- Consensus .l.t ... .kHg.kV. a. ..... . ..l. L. .H. K. • Options • Assign columns to be match states by eye • Heuristic i.e. no more than 50% gaps per column • Maximum a posteriori (MAP) model construction • O(L2) dynamic programming algorithm exists to find model that optimizes score on training data

  11. Two ways to handle deletions • Transitions between match states • Silent deletion states (no emission) begin M1 Mi ML end … … D2 D3 D1 begin M1 M2 M3 end

  12. Profile HMM

  13. Flanking states • Many sites in a sequence may be assigned to 'flanking' states (N, C, or J) • Transitions should force one or more match states to be traversed at least once

  14. Local or global alignment? • Are transitions allowed • From start to internal match? • From internal match to end? • Are there states that can emit sequences before and after the profile? • Do transitions allow the profile to be repeated? • In HMMs • Global/local behavior governed by model not algorithm • Behavior may differ w.r.t. the profile and the sequence

  15. Null model N S T

  16. Extreme value problems • How to convert Sbit to an expect value? • Since alignment is not truly local, theory used for BLAST does not hold here • Solutions (both available in HMMER) • Conservative approximation valid for any profile • Empirically fit extreme value distribution using simulated sequences • Must be done once for every profile HMM

  17. Why not always use full model? • The sum of probabilities is constrained to be one • Spreading probability among many paths decreases power to discriminate among them • You should always choose the most restrictive model (fewest transitions) consistent with your purpose

  18. Which algorithm to use? • Three choices • Viterbi: maximum likelihood path • Forward: sum of probabilities of all possible paths • Forward-backward: prob of each state at each pos • For database search • Query sequence against a database of profile HMMs • Profile HMM against a database of sequences? • For alignment • Adding new sequence(s) to an existing alignment

  19. Training a profile HMM • Weighting training sequences • We saw the same problem when scoring multiple alignments • Same approaches are used for profile HMMs • Estimating transition probabilities • Taken care of by MAP model construction • Estimating emission probabilities • We will assume the alignment is correct • Only issue is how to add pseudocounts

  20. Better pseudocounts • Laplace's Rule • Ignores background frequencies of residues • Background frequency pseudocounts

  21. Pseudocounts as Bayesian priors • Bayes' rule Likelihood Prior Posterior

  22. Dirichlet mixture pseudocounts • Background probabilities are not uniform throughout the protein • eg exposed loops (hydrophilic residues abundant) vs. buried core (small side chains abundant) • Different sets of pseudocount priors (Dirichlet distributions) for each environment • Pseudocounts for Ii are determined by a mixture of Dirichlet distributions fit to position i

  23. Evolutionary pseudocounts • Related to phylogenetic methods we will see later • Calculate probability of each residue having been the common ancestor of the residues in a column • Calculate probability of each residue as a descendent • Use these probabilities as priors with appropriate weighting • Requires use of a position-independent scoring matrix (eg PAM)

  24. Queries vs. subjects • Two directions of search are possible • Sequence query against database of profile HMMs • Profile HMM against a database of sequences • Bit scores will be the same regardless • But E-values will differ • Search space (ie number of subjects in database) can differ considerably • It is usually more sensitive to search a database of profile HMMs

  25. Interpro • Regular Expressions • PROSITE • PSSMs, other motifs • PROSITE, PRINTS, PRODOM • Profile HMMs • Pfam • SMART • TIGRFAMs • PIR SuperFamily • SUPERFAMILY

  26. Interpro v8

  27. Pfam • A profile HMM database • Based on Swissprot and TREMBL • Current version (v15) has 7503 families. • ~75% of all new protein sequences match an existing Pfam profile • Profiles constructed semi-automatically • New families identified • Seed alignment manually optimized • Profile HMM constructed • All matching sequences aligned to HMM

  28. HMMER • Used in construction of Pfam • Can build a profile (with MAP algorithm) • Can search a sequence against a profile and vice versa (i.e. with forward algorithm) • Can add new sequences to an alignment (via Viterbi) • Uses Plan 7 profiles • User sets the local/global behavior

  29. HMMER2.0 [2.3.1] NAME fn3 ACC PF00041.8 DESC Fibronectin type III domain LENG 84 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 108 DATE Mon Jul 26 14:10:07 2004 CKSUM 1153 GA 8.0 -1.0 TC 8.0 0.0 NC 7.9 7.9 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -45.006527 0.260185 HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -13 * -6758 1 -1698 -4236 -5400 -853 -4220 -2885 -1258 -930 -2438 408 -3428 -4769 3631 -1835 -4774 -1187 -1320 -120 -4666 -1510 1 - -150 -501 232 46 -382 399 104 -628 211 -461 -722 274 395 44 95 358 118 -368 -296 -251 - -144 -3402 -12951 -19 -6284 -701 -1378 -13 * 2 -613 -5389 1601 -868 -5707 553 -3558 -5456 -3139 -894 -4479 -526 1872 -1543 -2014 1940 -582 -1300 -5575 -1474 3 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249

  30. Summary • Profile HMMs generate MSAs • States and transitions for • Matches, Insertions (which can model affine gaps), Deletions, which allow local alignment, Silent states and flanking states • Statistics • Scored relative to a null model and E values must be determined empirically • Training • MAP model construction, Training sequence weighting and pseudocounts (which have a Bayesian interpretation) • Existing tools • Interpro, including Pfam and HMMER

  31. Assignment • Look over study guide • posted on Blackboard • Turn in lab/problem set on Tuesday • Midterm on Thursday

More Related