70 likes | 187 Vues
Presenter: Sasha Tkachev. Hidden Markov Models Sasha Tkachev and Ed Anderson. Forward algorithm. • We want to find P(sequence | HMM) • Naïve way: sum up probabilities of all possible paths
E N D
Presenter: Sasha Tkachev Hidden Markov ModelsSasha TkachevandEd Anderson
Forward algorithm • We want to find P(sequence | HMM) • Naïve way: sum up probabilities of all possible paths • Using recursion this can be done more effectively, probability to be in cloudy state at t=2 only depends on t=1 and observation at t=2 • When we reach t=3 our P is simply a sum of probabilities of being sunny, cloudy or rainy at t=3 http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/forward_algorithm/s1_pg1.html
Pfam • Database of protein domains and domain families • Contains multiple sequence alignments and profile HMMs for every domain • “Seed” and “full” alignments, seed alignment is rather small full alignment contains everything and is built using HMMER out of seed alignment http://www.sanger.ac.uk/Software/Pfam
Using Pfam • For known proteins, get a pre-calculated domain structure • For new sequences, get a list of matching domains • Analyse domain structure, e.g., find a list of proteins with a similar domain structure; find a list of proteins containing domains A and B; • Species specific analysis, e.g. find all domains unique to a certain virus
generalized HMM (GENSCAN) Gene prediction, GENSCAN (1997) • “Explicit state duration HMM”, generalized HMM (GHMM) • P(Φ, S) = P(s1|q1,d1)f(d1)T(q1|q2) x P(s2|q2,d2)f(d2) … T(qN-1|qN) x P(sN|qN,dN)f(dN) Φ – sequence of states {q1 … qN} T(q|q’) – transition probability q’ → q f(d) – state duration probability according to a distribution • Individual states can themselves be an HMM, e.g. coding exon states
Modelling Internal Coding Exons • See if evaluated sequence looks like coding or non-coding region by looking at hexamer (a “word” of 6 bp long) frequencies in exons/introns. This is done with 5-th order HMM • Take into account splice signals, start and stop translational signals (all non-HMM) • Use modified Viterbi algorithm to get the optimal parse
Comparative genomic methods • Mouse and human genome sequences provide new data, how to use it ? • Use GPHMM for alignment and gene prediction at the same time for both genomes (SLAM) • Or modify GENSCAN scoring schema with alignment scores (TWINSCAN) generalized pair HMM (SLAM) • Methods that can use more than two genomes are being developed, e.g. TWINSCAN 3.0