1 / 34

Motif finding: Lecture 1

Motif finding: Lecture 1. CS 498 CXZ. From DNA to Protein: In words. DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) DNA  mRNA (single stranded) Alphabet size = 4 (A,C,G,U) mRNA  amino acid sequence Alphabet size = 20

betty_james
Télécharger la présentation

Motif finding: Lecture 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif finding: Lecture 1 CS 498 CXZ

  2. From DNA to Protein: In words • DNA = nucleotide sequence • Alphabet size = 4 (A,C,G,T) • DNA  mRNA (single stranded) • Alphabet size = 4 (A,C,G,U) • mRNA  amino acid sequence • Alphabet size = 20 • Amino acid sequence “folds” into 3-dimensional molecule called protein AATACGAAGTAA AAUACGAAGUAA Asn Thr Lys Stop

  3. Gene expression • Process of making a protein from a gene as template • Transcription, then translation • Can be regulated

  4. Transcription • Process of making a single stranded mRNA using double stranded DNA as template • Only genes are transcribed, not all DNA

  5. Step 1: From DNA to mRNA Transcription SOURCE: http://academy.d20.co.edu/kadets/lundberg/DNA_animations/rna.dcr

  6. GENE ACAGTGA PROTEIN Transcriptional regulation TRANSCRIPTION FACTOR

  7. GENE ACAGTGA PROTEIN Transcriptional regulation TRANSCRIPTION FACTOR

  8. The importance of gene regulation

  9. Genetic regulatory network controlling the development of the body plan of the sea urchin embryo Davidson et al., Science, 295(5560):1669-1678.

  10. That was the “circuit” responsible for development of the sea urchin embryo • Nodes = genes • Switches = gene regulation • Change the switches and the circuit changes • Gene regulation significance: • Development of an organism • Functioning of the organism • Evolution of organisms

  11. Binding sites and motifs

  12. Binding sites • Binding sites of transcription factor “Bicoid”, collected experimentally

  13. http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  14. T A A T C C C Motif (“Consensus String”) http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  15. W A A T C C N Motif W = T or A N = A,C,G,T http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  16. Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites

  17. Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”

  18. Motif representation • Consensus string • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Position weight matrix • More powerful representation • Probabilistic treatment

  19. The motif finding problem • Suppose a transcription factor (TF) controls five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

  20. The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? • Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF

  21. A variant of motif finding • Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence • For consensus string, problem is trivial • For each position l in input sequence, check if substring starting at position l matches the motif. • For weight matrix, not so trivial

  22. Given a string s of length l = 7 • s = s1s2…sl • Pr(s | W) = • Example: • Pr(CTAATCCG) = • 0.67 x 0.89 x 1 x 1 x 0.89 • x 1 x 0.89 x 0.11 Binding sites from a weight matrix motif W Probability of each base In each column Counts of each base In each column Wk = probability of base  in column k

  23. Binding sites from a weight matrix motif • Given sequence S (e.g., 1000 base-pairs long) • For each substring s of S, • Compute Pr(s|W) • If Pr(s|W) > some threshold, call that a binding site • Look at S, as well as its “reverse complement” • Rev.Compl. of AGTTACACCA is TGGTGTAACT • (That’s what is on the other strand of DNA)

  24. Ab initio motif finding • The original motif finding problem • To find a motif that represents binding sites of an unknown TF

  25. Ab initio motif finding • Define a motif score, find the motif with maximum score over all possible motifs in search space (motif model) • Consensus string model => exhaustive search algorithm, guarantee on finding the optimal motif • PWM model => local search, not guaranteed to find optimal motif.

  26. Ab initio motif finding - consensus string motifs • A precise motif model defines the search space (I.e., a list of all candidate motifs). • The motif model also prescribes exactly how to determine if a substring is a match to a particular motif. • Define motif model precisely

  27. Ab initio motif finding - consensus string motifs • E.g., string over alphabet {A,C,G,T} of fixed length l. If l = 4, all 256 strings AAAA, AAAT, AAAC, …, TTTT, are “candidate motifs”. • E.g., string over alphabet {A,C,G,T} of fixed length l, and allowing up to d mismatches. If AAAA is a motif, and d=1, then AAAT, AATA etc. are also counted as matches to motif. • E.g., string over extended alphabet {A,C,G,T,N} of fixed length l. Here “N” stands for any character (A,C,G,or T.) • If AANAA is the motif, then AACAA, AAGAA, AATAA or AAAAA are all counted as matches to this motif.

  28. Ab initio motif finding - consensus string motifs • Define a motif score, i.e., a real number associated with each candidate motif, in relation to the input sequences. • E.g., count Ns of a motif s in input sequences(s). • E.g., some function of the motif count Ns. • E.g., Zs = (Ns - Es)/s • Es is the expected count of motif s in random sequences; and • s is the variance of the count in random sequences

  29. Ab initio motif finding - consensus string motifs • For each motif s in the search space, • Compute the score of s • Output the highest scoring motifs. • This is the “enumerative” algorithm. • Guaranteed to produce the optimal motif, since every possible motif is considered. • Guarantee possible due to small search space. (E.g., 4l where l is the motif length). • Cant handle large values of l (e.g., > 10) : exponential growth of running time.

  30. Ab initio motif finding - PWM motifs • Local search techniques, e.g., • Gibbs sampling • Expectation Maximization • Greedy

  31. Gibbs sampling: The search space • Input: a set of sequences {S1,S2,…,Sn} • Input: motif length l • Candidate motif: A set of substrings {s1,s2,…,sn}, each of length l, one from each Si. • Search space: all possible candidate motifs • O(Ln) where L is length of each Si.

  32. Gibbs sampling: algorithm • Consider any candidate motif {s1,s2,…,sn},where each si is of length l • Let Wkbe the frequency of base  in the kth position of the candidate motif • Pr(s|W) = • Let “background” (genome-wide) frequency of nucleotide  be q

  33. Gibbs sampling: algorithm • Let current motif be Wt = {s1,s2,…,sn} • Pick one si to replace • For each substring s’in Si, replace si with s’ and compute

  34. Gibbs sampling: algorithm • Pick s’ with probability proportional to Pr(s’) as computed • Replace si with s’ to obtain new current motif Mt+1 • Keep updating motif • Report the motif with maximum score

More Related