1 / 26

Gibbs sampling

Gibbs sampling. The Motif Finding Problem. Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

mahogany
Télécharger la présentation

Gibbs sampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gibbs sampling

  2. The Motif Finding Problem • Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc • Find the motif in each of the individual sequences

  3. The Motif Finding Problem • If starting positions s=(s1, s2,… st) are given, finding consensus is easy because we can simply construct (and evaluate) the profile to find the motif. • But… the starting positions s are usually not given. How can we find the “best” profile matrix? • Gibbs sampling • Expectation-Maximization algorithm

  4. Notations • Set of symbols: • Sequences: S = {S1, S2, …, SN} • Starting positions of motifs: A = {a1, a2, …, aN} • Motif model ( ) : qij = P(symbol at the i-th position = j) • Background model: pj = P(symbol = j) • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region

  5. Motif Finding Problem • Problem: find starting positions and model parameters simultaneously to maximize the posterior probability: • This is equivalent to maximizing the likelihood by Bayes’ Theorem, assuming uniform prior distribution:

  6. Equivalent Scoring Function • Maximize the log-odds ratio:

  7. Sampling and optimization • To maximize a function, f(x): • Brute force method: try all possible x • Sample method: sample x from probability distribution: p(x) ~ f(x) • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

  8. Gibbs Sampling • Idea: a joint distribution may be hard to sample from, but it may be easy to sample from the conditional distributions where all variables are fixed except one • To sample from p(x1, x2, …xn), let each state of the Markov chain represent (x1, x2, …xn), the probability of moving to a state (x1, x2, …xn) is: p(xi |x1, …xi-1,xi+1,…xn). It is also called Markov Chain Monte Carlo (MCMC) method.

  9. Gibbs Sampling

  10. Gibbs Sampling in Motif Finding Randomly initialize A0; Repeat: (1) randomly choose a sequence z from S; A* = At \ az; compute θt = estimator of θ given S and A*; (2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* union x; Select At that maximizes F; Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model

  11. Estimator of θ • Given an alignment A, i.e. the starting positions of motifs, θ can be estimated by its MLE with smoothing (e.g. Dirichlet prior with parameter bj):

  12. The Motif Finding Problem • Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc • Find the motif in each of the individual sequences

  13. TF CACGTGT TF CACGTGA TF CAAGTGA TF CAGGTGA Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Transcription factor binding site, or motif instances

  14. Evolutionary Conservation CACGTGACC CACGTGAAC CACGTGAAC

  15. Overview of TGS How did the motifs evolve? How to find the ancestral motif instances? Colored lines: regulatory regions of genes Colored boxes: motif instances

  16. Ancestral motif profile: 1 2 3 4 5 6 7 8 9 A .036 .892 .036 .036 .036 .036 .892 .036 .036 C .892 .036 .892 .036 .036 .036 .036 .75 .75 G .036 .036 .036 .892 .036 .892 .036 .036 .036 T .036 .036 .036 .036 .892 .036 .036 .178 .178 CACGTGACC CACGTGAAC CACGTGAAC How to find the ancestral motif instances?

  17. How did the motifs evolve? Background substitution matrix A C G T A 0.8515 0.0278 0.0775 0.0432 C 0.0464 0.8026 0.0344 0.1167 G 0.1167 0.0350 0.8023 0.0460 T 0.0429 0.0785 0.0264 0.8522 Motif substitution matrix A C G T A 0.9802 0.0066 0.0066 0.0066 C 0.0120 0.9640 0.0120 0.0120 G 0.0120 0.0120 0.9640 0.0120 T 0.0066 0.0066 0.0066 0.9802

  18. Evolution of motifs 250 million years Distant species

  19. In order to draw: draw ~ Overview of Gibbs Sampler Implementation Iteratively sample from conditional distribution when other parameters are fixed.

  20. Ancestral motif weight matrix at the root Background distribution (multinomial) Probability that a gene in the i-th species will contain the motif Motif width Background substitution matrix for the i-th branch Motif substitution matrix for the i-th branch Implementation Parameters

  21. Implementation Prior distribution Beta(1,1) Poisson distribution

  22. Implementation Initialization Parameters are sampled using prior distributions; Motif instances in current species are sampled from sequences directly for each current species; Motif instances in ancestral species are randomly assigned with one of its immediate child motif instances.

  23. M11 M12 Implementation Motif instance updating Updating motif instances in ancestral species Updating motif instances in current species

  24. 2th position A: 0.932… C: 0.067 G: 8.4e-6 T: 2.5e-4 M11 M12 C A Implementation Updating motif instanceinancestral species Ancestral Motif Weight Matrix 1 2 3 4 5 6 7 8 9 A .036 .892 .036 .036 .036 .036 .892 .036 .036 C .892 .036 .892 .036 .036 .036 .036 .75 .75 G .036 .036 .036 .892 .036 .892 .036 .036 .036 T .036 .036 .036 .036 .892 .036 .036 .178 .178 M11 M12 CCCGTGACC CACGTGAAC

  25. Implementation Updating motif instances for current species Updated ancestral motif instance CACTTGAAC M11 M12 …CACACCACGTGAGCTT... …CACATCACGTGAACTT…

  26. Multiple Species? ? CAGGTGATC CACGTGAAC CACGTGAAC CACGTGAAC CACGTGATC CAGGTGATC

More Related