1 / 52

Motif Finding

Motif Finding. Yueyi Irene Liu CS374 Lecture Oct. 17, 2002. Outline. Background biology Motif-finding methods Word enumeration Gibbs sampling Random projection Phylogenetic footprinting Reducer. Regulation of Gene Expression. Chromatin structure Transcription initiation

afya
Télécharger la présentation

Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

  2. Outline • Background biology • Motif-finding methods • Word enumeration • Gibbs sampling • Random projection • Phylogenetic footprinting • Reducer

  3. Regulation of Gene Expression • Chromatin structure • Transcription initiation • Transcript processing and modification • RNA transport • Transcript stability • Translation initiation • Post-Translational Modification • Protein Transport • Control of Protein Stability

  4. Typical Structure of an Eukaryotic mRNA Gene

  5. Control of Transcription Initiation

  6. Motif • A conserved pattern that is found in two or more sequences • Can be found in • DNA (e.g., transcription factor binding sites) • Protein • RNA

  7. Models for Representing Motifs • Regular expression • Consensus • TGACGCA • Degenerate • WGACRCA • Position Specific Matrix TGACGCA TGACGCA AGACGCA TGACACA AGACGCA

  8. Where to look for motifs? • Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus • How do you construct gene families? • Microarray experiments

  9. Cells of Interest Known DNA sequences Isolate mRNA experiments Glass slide genes 3.25 3.01 1.30 0.70 6.73 2.89 0.92 0.67 1.14 1.15 0.60 0.23 2.12 6.12 0.07 0.02 Resulting data Reference sample 10 Microarrays

  10. Motif-finding Methods • Goal: Look for motifs (5-15bp) in the data set • Methods: • Word enumeration method • Gibbs sampling • Random projection • Phylogenetic footprinting • Reducer

  11. Word Enumeration • For every word w, calculate: • Expected frequency based on entire upstream region of the yeast genome • E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4, P(G)=P(C) = 0.1 • Expected number of occurrences of ATTGA: n*P(ATTGA) • Observed frequency in the data set • Statistical significance of enrichment Z = (O - E) / sqrt[np  (1 - p)] ~ N(0, 1) • Disadvantage: only consider exact word • E.g, YCTGCA: TCTGCA and CCTGCA

  12. Gibbs Sampling • Matrix to capture a motif • Goal: find the best ak to maximize the difference between motif and background base distribution. a1 a2 a3 a4 ak Liu, X

  13. Gibbs Sampling (Lawrence, et al, 1993) • Step 1: Pick random start position, compute current motif matrix • Step 2: Iterative update • Take one sequence out, update motif matrix • Calcuate fitness score of each position of out sequence • Pick start position in out sequence based on weight Ax • Take out another sequence, …, until converge • Step 3: Reset starting position Liu, X

  14. a1 a1' a2 a2' a3 a3' a4 a4' ak ak' Gibbs Sampling InitializationPick random start position, compute motif matrix Liu, X

  15. Gibbs Sampling Iteration Steps1) Take out one sequence, calculate the fitness score of every subsequence relative to the current motif a1' ????????????????? a2' a3' a4' ak' Liu, X

  16. Fitness Score Current Motif • Ax = Qx / Px • Qx: probability of generating subsequence x from current motif • Px: probability of generating subsequence x from background Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1 X = GGA: Q? P?

  17. Gibbs Sampling Iteration Steps2) Pick new start position sampling from fitness score a1'' a2' a3' a4' ak' Liu, X

  18. Recent Development • Random Projection • Phylogenetic Footprinting • Reducer

  19. Random Projection (Buhler, 2002) • (l, d)-motif problem: • M is an (unknown) motif of length l • Each occurrence of M is corrupted by exactly d point substitutions in random positions • No known biological motifs are of (l, d)-motif CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG CtATgG CCctAc tCtTAG CaAcAG CCAgAa

  20. x(1) ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... x(2) ATGCGTC =M x(5) (7,2) motif x(8) Random Projection Algorithm • Guiding principle: Some instances of a motif agree on a subset of positions. • Use information from multiple motif instances to construct model. Buhler, J

  21. k-Projections • Choose k positions in string of length l. • Concatenate nucleotides at chosen k positions to form k-tuple. • In l-dimensional Hamming space, projection onto k dimensional subspace. k = 7 l = 15 P ATGGCATTCAGATTC TGCTGAT Buhler, J P = (2, 4, 5, 7, 11, 12, 13)

  22. TGCACCT Bucket TGCT Random Projection Algorithm • Choose a projection by selecting k positions uniformly at random. • For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. • Recover motif from bucket containing multiple l-tuples. Input sequence x(i): …TCAATGCACCTAT... Buhler, J

  23. ATCCGAC GCTC ATGC Example • l = 7 (motif size) , k = 4 (projection size) • Choose projection (1,2,5,7) Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets GCCTTAC Buhler, J

  24. GCTC CATC ATTC ATGC Hashing and Buckets • Hash function h(x) obtained from k positions of projection. • Buckets are labeled by values of h(x). • Enriched buckets: contain more than sl-tuples, for some parameter s. Buhler, J

  25. ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC Motif Refinement • How do we recover the motif from the sequences in the enriched buckets? • k nucleotides are known from hash value of bucket. • Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif Buhler, J

  26. Parameter Selection • Projection size k • Choose k small so several motif instances hash to same bucket. (k < l - d) • Choose k large to avoid contamination by spurious l-mers. ( 4k > t (n - l + 1) • Bucket threshold s: (s = 3, s = 4) Buhler, J

  27. Recent Development • Random Projection • Phylogenetic Footprinting • Reducer

  28. Hepatic site C CCAAT box Mouse Rabbit Human Chicken Mouse Rabbit Human Chicken Mouse Rabbit Human Chicken TATA box Conservation of Regulatory Elements in Upstream of ApoAI Gene TATA box TATA box

  29. AAGCA AAGCA AAGCA ACGCA AAGCA

  30. Substring Parsimony Problem Given: • orthologous upstream sequences S1,…Sn • phylogenetic tree T of the n species • size k of the motif, threshold d Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k, such that the parsimony score of s1,…sn on T is at most d Blanchette, M

  31. Parsimony Score s1 Tree T: s2 s`34 s3 s6 s5 s4 Minimum (all possible labelings of internal nodes) • l(v) – label of node v • d(l1, l2) – Hamming distance Blanchette, M

  32. AAGCA AAGCA AAGCA ACGCA AAGCA String Parsimony Problem S1: AAAGCATTC S2: TACGCACCC S3: GAAGCAGGG k = 5 d = 1 S1 S2 S3

  33. Algorithm: version I • Root the tree at arbitrary internal node r • Compute table Wu of size 4kfor each node u, where Wu[s] – best parsimony score for subtree rooted at u when u is labeled with s • Direct implementation of this recursion gives O(n∙k∙(42k + l), where l –average sequence length Blanchette, M

  34. Algorithm: version II • Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v u labeled s w v Blanchette, M

  35. Algorithm: version II (continued) • Update X(u, v) in phases: in phase p maintain set Bpof sequences t, such that X(u, v)[t] = p • Define: • Ra= {s: Wv[s] = a} • N(s) = {t in ∑k: d(s, t) = 1} • Start in phase m and let Bm = Rm • Update • Computationof X(u, v) takes O(k∙4k) Blanchette, M

  36. Improvements • Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold d In phase p, only care for sequence X(u, v) [s] if Leads to significant reductions in stages d/2 … d • Reducethe number of substrings inserted in W at the leaves For substring s of Si, if its best match against any Sj, has Hamming distance at least d, s can be discarded Blanchette, M

  37. Results • Practical limit on k = 10 • There appeared to be a threshold d0with very few solutions below and many above • Algorithm found ~80% known binding sites • Performed better than ClustalW, MEME, Consensus Blanchette, M

  38. Recent Development • Random Projection • Phylogenetic Footprinting • Reducer

  39. Reducer (Bussemaker, et al 2001) • Links motif finding to expression level • Ag= C + Σ Fu Nug • Ag: gene expression level (logarithm of expression ratio) • M: number of significant motifs • Ng: number of occurrences of motif u in gene g • C: baseline expression level (same for all genes) • F: increase/decrease of expression level caused by presence of motif

  40. Reducer (Cont’d) Liu, X

  41. Reducer (Cont’d) • Normalize expression (A) and motif (n) vectors • Linear regression between A vector and every n vector to find the best fit n to A • Step-wise regression to combine effects of motifs • Subtract the effect of one motif • Find the next best motif Liu, X

  42. Acknowlegement • People from whom I borrowed slides: • Xiaole Liu (Reducer) • Olga Troyanskaya (Microarray) • Jeremy Buhler (Random projections) • Mathieu Blanchette (Phylogenetic footprinting) • Various web sources

  43. excitation scanning cDNA clones (probes) laser 2 laser 1 PCR product amplification purification emission printing mRNA target) overlay images and normalise 0.1nl/spot Hybridise target to microarray microarray analysis

  44. Information Content of Motifs • Uncertainty • Information = Hbefore - Hafter

  45. Improvement on Original Gibbs sampler • 0 ~ n copies of sites in each sequence • Iterative masking to find multiple motifs • Use higher order Markov models to improve motif specificity

  46. Clinical Importance of Defects in Regulatory Elements Burkitt’s Lymphoma

  47. Statistical Methods • Expectation Maximization (EM) • MEME • Gibbs sampling • BioProspector • AlignACE

  48. Motifs are not limited to DNAs • RNA motifs • RNA – RNA interaction motifs, e.g., intron-exon splice sites • RNA – protein interaction motifs, e.g., binding of proteins to RNA polyA tail • Protein motifs • E.g., Helix-turn-helix motif

More Related