Genome evolution: a sequence-centric approach

Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs

Web: www.wisdom.weizmann.ac.il/~atanay/GenomeEvo Access ppts and ex. directly: /home/atanay/public_html/GenomeEvo/ Subscribe to course messages: amos.tanay@weizmann.ac.il

(Probability, Calculus/Matrix theory, some graph theory, some statistics) Course outline Probabilistic models Genome structure (Continuous time) Markov Chaing Inference Mutations Simple Tree Models Parameter estimation Population Inferring Selection

Stochastic Processes and Stationary Distributions Process Model t Stationary Model

Inference on trees and the EM algorithm: summary Inference using dynamic programming (up-down Message passing): Marginal/Posterior probabilities: The EM update rule (conditional probabilities per lineage):

whiteboard/ exercise Claim: if G have no cycles, Claim: if G is a tree, we can compute marginals and total probability efficiently Proof: exactly what we did last time.. Claim: For General G, inference is NP hard Why the up-down will not work? whiteboard/ exercise We will discuss methods for approximate inference in detail later, now, lets look for more easy cases Bayesian Networks • Defining the joint probability for a set of random variables given: • a directed acyclic graph • Conditional probabilities

Markov Models xtthe state at time t Transition probabilities are defining the process Add an initial condition to define a distribution on infinite sequences: Problem: we observe finite sequences…and infinite probability spaces are difficult to work with Solution: add an absorbing finish state. Add start state to express probability at time 0.

Caution! This is NOT the HMM Bayes Net 1.Cycles 2.States are NOT random vars! Hidden Markov Models Emission space Observing only emissions of states to some probability space E Each state is equipped with an emission distribution (x a state, e emission)

Hidden Markov Models The HMM can be viewed as a template-model Given a sequence of observations or just its length, we can form a BN Since the BN will be have a tree topology, we know how to compute posteriors Start States Finish Emissions

Inference in HMM Basically, exactly what we saw for trees Forward formula: (like the down alg): Backward formula: (like the up alg):

Almost, but we have to share parameters: EM for HMMs Start States Finish Emissions Can we apply the tree EM verbatim? Claim: HMM EM is monotonically improving the likelihood (i.e., sharing parameters is ok)

No Emission Hidden state Hidden states Example: Two Markov models describe our data Switching between models is occurring at random How to model this?

Hidden states Hidden Emitting What about hidden cycles?

Profile HMM for Protein or DNA motifs D D D D D D I I I I I I S F M M M M M M • M (Match) states emit a certain amino acid/nucleotide profile • I (Insert) states emit some background profile • D (Delete) states are hidden • Can use EM to train the parameters from a set of examples • The use the model for classification or annotation • (Both emissions and transition probabilities are informative!) • (How do we determine the right size of the model?) • (google PFAM, Prosite, “HMM profile protein domain”)

Common error: N-order Markov model • For evolutionary models, the Markov property makes much sense • For spatial (sequence) effects, the Markov property is a (horrible) heuristic • N-order relations can be modeled naturally Forward/Backward in N-order HMM. Dynamic programming would work?

1-HMM Bayes Net: Start States Finish Emissions 2-HMM Bayes Net: Start Finish (You shall explore the inference problem in Ex 2)

Pair-HMM Given two sequences s1,s2, an alignment is defined by a set of ‘gaps’ (or indels) in each of the sequences. indel ACGCGAACCGAATGCCCAA---GGAAAACGTTTGAATTTATA ACCCGT-----ATGCCCAACGGGGAAAACGTTTGAACTTATA indel Affine gap cost: Substitution matrix: Standard distance metric: Standard dynamic programming algorithm compute the best alignment given such distance metric:

G1 M F S G2 Match states emit and aligned nucleotide pair Gap states emit a nucleotide from one of the sequences only Pr(M->Gi) – “gap open cost”, Pr(G1->G1) – “gap extension cost” Is it a BN template? Forward-backward formula? Whiteboard/ Exercise Pair-HMM Generalize the HMM concept to probabilistically model alignments. Problem: we are observing two sequences, not a-priori related. What will be emitted from our HMM?

Mixture models Inference? EM for Parameter estimation? What about very high dimensions? Whiteboard/ Exercise

Genome evolution: a sequence-centric approach