1 / 54

Regulatory Motif Finding

Regulatory Motif Finding. Wenxiu Ma CS374 Presentation 11/03/2005. Outline. Regulation of genes Regulatory Motifs Motif Representation Current Motif Discovery Methods. Regulation of Genes. What turns genes on (producing a protein) and off? When is a gene turned on or off?

jamuna
Télécharger la présentation

Regulatory Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regulatory Motif Finding Wenxiu Ma CS374 Presentation 11/03/2005

  2. Outline • Regulation of genes • Regulatory Motifs • Motif Representation • Current Motif Discovery Methods

  3. Regulation of Genes • What turns genes on (producing a protein) and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced?

  4. Overview of Gene Control • The mechanisms that control the expression of genes operate at many levels. source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

  5. Transcriptional Regulation • The transcription of each gene is controlled by a regulatory region of DNA relatively near the transcription start site (TSS). • two types of fundamental components • short DNA regulatory elements • gene regulatory proteins that recognize and bind to them.

  6. Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Gene Regulatory Element source: M. Tompa, U. of Washington

  7. Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene source: M. Tompa, U. of Washington

  8. Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene source: M. Tompa, U. of Washington

  9. Outline • Regulation of genes • Regulatory Motifs • Motif Representation • Current Motif Discovery Methods

  10. What is a motif? • A subsequence (substring) that occurs in multiple sequences with a biological importance. • Motifs can be totally constant or have variable elements. • Protein Motifs often result from structural features. • DNA Motifs (regulatory elements) • Binding sites for proteins • Short sequences (5-25) • Up to 1000 bp (or farther) from gene • Inexactly repeating patterns

  11. daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1 source: Peter Swoboda

  12. Motif Representing • Consensus sequence: a single string with the most likely sequence(+/- wildcards) • Regular expression: a string with wildcards, constrained selection • Profile: a list of the letter frequencies at each position • Sequence Logo: • graphical depiction of a profile • conservation of elements in a motif.

  13. Motif Logos: an Example (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

  14. Measure of Conservation • Relative heights of letters reflect their abundance in the alignment. • Total height = entropy-based measurement of conservation. • Entropy(i) = -SUM{ f(base, i)* ln[f(base, i)] } over all bases • Conservation(i) = 2- Entropy(i) • Units of conservation = bits of information • Entropy measures variability/disorder. • High conserved = low entropy = tall stack • Very variable = high entropy = low stack

  15. Outline • Regulation of genes • Regulatory Motifs • Motif Representation • Current Motif Discovery Methods

  16. Finding Regulatory Motifs . . . Given a collection of genes with common expression, Find the (TF-binding) motif in common

  17. Identifying Motifs: Complications • We do not know the motif sequence • We do not know where it is located relative to the genes start • Motifs can differ slightly from one gene to another • How to discern it from “random” motifs?

  18. Current Motif Discovery Methods • GOAL: comprehensive identification of all the regulatory motifs in genomes. • by overrepresentation • MEME, Gibbs sampling • by phylogenetic footprinting • Footprinter • Cross species comparative analysis • Combine structure information

  19. Motif Finding: Comparative Analysis • Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. • Xie, X. et al., Nature (2005). • Identify motifs based on comparative analysis of human, mouse, rat and dog genomes • A systematic catalogue of human gene regulatory motifs • Short, functional sequences (6-10bp) used many times in a genome • Focus regions • Promoters • 3’ untranslated regions (3’ UTRs) • microRNAs (miRNAs) • post-transcriptional regulation

  20. Motif Discovery Procedure • Alignment of promoters & 3’ UTRs • Motif conservation score (MCS) • Measure the extent of excess conservation • “Highly conserved motifs” • MCS>6 • Clustering

  21. Alignment of promoters & 3’ UTRs • construct a whole-genome alignment for the four mammalian genomes • Blastz1 and Multiz2 • Extract the aligned promoter and 3’ UTRs portions respectively. • Coordinates: the annotation of NCBI reference sequences (RefSeq)

  22. Motif Conservation Score (MCS) • Consensus sequence representation • Alphabet size: 11 (A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT]) • conserved occurrence of a motif m is an instance in which an exact match to this motif is found in all four species. • conservation rate p = ratio of conserved occurrences to total occurrences in human • Expected conservation rate p0 = avg. conservation rate of 100 random motifs, given same length and redundancy.

  23. MCS • MCS = # of s.d. by which the observed conservation rate of a motif p exceeds the expected conservation rate p0. • p = k/n • Binomial probability of observing k out of n • Estimated by way of Normal approximation to the binomial Dist.

  24. Conservation Properties of Regulatory Motifs • Known 8-mer TGACCTTG • Conservation rate 37% (162 out of 434) • random rate 6.8% • MCS = 25.2 s.d. • Promoter Region • TRANSFAC: 446 motifs • MCS>3: 63% • MCS>5: ~50% • 3’ UTR • no database analogous to TRANSFAC • some known motifs

  25. Motif Discovery Procedure • Alignment of promoters & 3’ UTRs • Motif conservation score (MCS) • “Highly conserved motifs” • MCS>6 • Clustering

  26. Results: motifs in promoters • 174 highly conserved motifs • 59 strong match to known motifs, 10 weaker match. • 105 potential new regulatory motifs Xie, X. et al., Nature, 2005

  27. Results: motifs in 3’ UTRs • 106 highly conserved motifs • Two unusual properties • Strand specificity • Unusual length distribution

  28. Property1: strand specificity Xie, X. et al., Nature, 2005

  29. Property2 Xie, X. et al., Nature, 2005

  30. Properties => miRNA • Strand specificity • 3’-UTR motifs acting at the level of RNA rather than DNA • have a role in post-transcriptional regulation • Length distribution • Many mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs. • Hypothesis: many of the highly conserved 8-mer motifs might be binding sites for conserved miRNAs.

  31. 7mG(5’)ppp(5’)G The microRNA pathway pri-miRNA Drosha Pasha 3’-nA…AAA pre-miRNA Dicer miR/miR* duplex mature miRNA miRNP Adapted from Tomari & Zamore Curr Biol 2004

  32. Relationship with miRNA • 72 highly conserved 8-mer motifs • Contiguous, non-degenerate • ~46% of all 3’-UTR motifs • 207 distinct human miRNAs • From current registry • Complementary matches • Exactly match: ~43.5% • One mismatch: ~50% • 95% of matches begin at NT 1 or 2 of the miRNA gene • 8-mer motifs represent target sites for miRNA

  33. 8-mer motifs ->new miRNA genes • RNAfold program • 242 conserved and stable stem-loop sequences • 113 known, 129 potential new miRNAs • Biological validation • 12 selected new miRNA genes • 6 (50%) have clearly expression activity in tissues.

  34. Prevalence of miRNA regulation • 20% of 3’ UTRs may be targets for conserved miRNA-based regulation at the 8-mer motifs. • Unbiased assessment of the relative importance of miRNA-based regulation in the human genome

  35. Summary: comparative genome analysis • 4 mammalian species • an initial systematic catalogue • Promoters • 3’ UTRs • Importance of the new miRNA regulatory mechanism • Future directions: • genome-wide discovery • more genomes alignments: the primate

  36. Now… • Motif Finding Methods • Cross species comparative analysis • Combine structure information

  37. Motif Finding: Structural Knowledge • Ab initio prediction of transcription factor targets using structural knowledge, • Kaplan T, et al., PLoS Comput Biol (2005) • Propose a general framework for predicting DNA BS sequences of novel TFs from known family • Structure-based approach • No prior TF binding data and target gene • Family-wise probabilistic model • Context-specific amino acid-nucleotide recognition preferences

  38. Structure-based approach • Family-wise probabilistic model • Input: • pairs of TFs and their target DNA sequences • structural information • Output: Context-specific amino acid-nucleotide recognition preferences • Position specificity • Then, discover TFBSs of other TFs from the same family

  39. Cys2His2 Zinc Finger protein family • largest known DNA-binding family in multicellular organisms • common, strict binding models source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

  40. Cys2His2 Zinc Finger: Canonical DNA binding model Residues at positions 6, 3, 2, and -1 (relative to the beginning of the a-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). Kaplan. et al., PLoS Comput Biol, 2005

  41. Cys2His2 Zinc Finger: DNA Binding Model source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

  42. Cys2His2 Zinc Finger: Compiling dataset • Goal: DNA-recognition preferences for each of the four key positions • every AA v.s. every NT • insufficient solved protein-DNA complex • Known protein sequence data and their DNA targets • TRANSFAC: 455 protein-DNA Pairs • Non-canonical model • Profile HMM • No exact binding locations • CX(2-4)CX(11-13)HX(3-5)H

  43. Profile HMM • build a model representing the consensus sequence for a family, rather than the sequence of any particular member • Find potential alignment for new sequences “Silent” deletion states Insertion states Match states

  44. Example: full profile HMM

  45. Structure-based approach • Input: set of pairs of TFs and their target DNA sequences • Output: Context-specific amino acid-nucleotide recognition preferences • Iterative Expectation Maximization(EM) algorithm

  46. Cys2His2 Zinc Finger: Probabilistic Model • The set of interacting residues in 4 different positions of the k fingers • N1,… NL be a target DNA sequence • The probability that an interaction starting from jth position in the DNA • where PP(N|A) is the conditional probability of nucleotide N given amino acide A at position p. Kaplan. et al., PLoS Comput Biol, 2005

  47. EM algorithm • Iterative EM algorithm • Exact binding locations for all protein-DNA pairs • recognition preferences: Pp(N|A) • E-step • Compute expected posterior probability of binding locations, based on current preferences • M-step • Update DNA-recognition preferences to maximize the likelihood of current binding locations based on the distribution of possible binding locations in previous E-step • Local optima

  48. Estimate DNA-recognition preferences Kaplan. et al., PLoS Comput Biol, 2005

  49. Apply on TFs from the same family Kaplan. et al., PLoS Comput Biol, 2005

  50. Evaluation • compatible with experimental results • 10-fold cross validation • genome-wide scan of Drosophia melanogaster • 29 canonical Cys2His2 TFs • GO Enrichment of predicted target genes • 21 enriched with at least one GO term. • mRNA expression profile of target genes • 21 showed significant associations in at least one embryogenesis experiment.

More Related