1 / 51

Computational Genomics and Proteomics

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Computational Genomics and Proteomics. Lecture 8 Motif Discovery. Outline Gene Regulation DNA Transcription factors Motifs What are they?

isla
Télécharger la présentation

Computational Genomics and Proteomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Computational Genomics and Proteomics Lecture 8 Motif Discovery

  2. Outline Gene Regulation DNA Transcription factors Motifs What are they? Binding Sites Combinatoric Approaches Exhaustive searches Consensus Comparative Genomics Example Probabilistic Approaches Statistics EM algorithm Gibbs Sampling

  3. www.accessexcellence.org

  4. www.accessexcellence.org

  5. www.accessexcellence.org

  6. Four DNA nucleotide building blocks G-C is more strongly hydrogen-bonded than A-T

  7. Degenerate code • Four bases: A, C, G, T • Two-fold degenerate IUB codes: • R=[AG] -- Purines • Y=[CT] -- Pyrimidines • K=[GT] • M=[AC] • S=[GC] • W=[AT] • Four-fold degenerate: N=[AGCT]

  8. Transcription Factors • Required but not a part of the RNA polymerase complex • Many different roles in gene regulation • Binding • Interaction • Initiation • Enhancing • Repressing • Various structural classes (eg. zinc finger domains) • Consist of both a DNA-binding domain and an interactivedomain

  9. Motifs • Short sequences of DNA or RNA (or amino acids) • Often consist of 5- 16 nucleotides • May contain gaps • Examples include: • Splice sites • Start/stop codons • Transmembrane domains • Centromeres • Phosphorylation sites • Coiled-coil domains • Transcription factor binding sites (TFBS – regulatorymotifs)

  10. TFBSs • Difficult to identify • Each transcription factor may have more than one bindingsite • Degenerate • Most occur upstream of translation start site (TSS) but areknown to also occur in: • introns • exons • 3’ UTRs • Usually occur in clusters, i.e. collections of sites within aregion (modules) • Often repeated • Sites can be experimentally verified

  11. Why are TFBSs important? • Aid in identification of gene networks/pathways • Determine correct network structure • Drug discovery • Switch production of gene product on/off Gene A Gene B

  12. Consensus sequences • Matches all of the example sequences closely but notexactly • A single site TACGAT • A set of sites: TACGAT TATAAT TATAAT GATACT TATGAT TATGTT • Consensus sequence: TATAAT or TATRNT • Trade-off: number of mismatches allowed, ambiguity inconsensus sequence and the sensitivity and precision ofthe representation.

  13. Information Content and Entropy

  14. Sequence Logos

  15. Frequency Matrices • Given a collection of motifs, TACGAT TATAAT TATAAT GATACT TATGAT TATGTT • Create the matrix: T A C G

  16. Position weight matrices

  17. Finding Motifs • Two problems: • Given a collection of known motifs, develop arepresentation of the motifs such that additional occurrences can reliably be identified in new promoter regions • Given a collection of genes, thought to be relatedsomehow, find the location of the motif common to all andarepresentation for it. • Two approaches: • Combinatorial • Probabilistic

  18. Combinatorial Approach

  19. Exhaustive Search

  20. Exhaustive Search Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4W) words exhaustively

  21. Greedy Motif Clustering

  22. Greedy Motif Clustering

  23. Greedy Motif Clustering

  24. Comparative Genomics • Main Idea: Conserved non coding regions areimportant • Align the promoters of orthologous co-expressed genesfrom two (or more) species e.g. human and mouse • Search for TFBS only in conserved regions • Problems: • Not all regulatory regions are conserved • Which genomes to use?

  25. Phylogenetic Footprinting Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.

  26. Phylogenetic Footprinting An Example • Xie et al. 2005 • Genome-wide alignments for four species (human, mouse,rat, dog) • Promoter regions and 3’UTRs then extracted for 17,700well-annotated genes • Promoter region taken to be (-2000, 2000) • This set of sequences then searched exhaustively formotifs Nature434, 338-345, 2005

  27. The Search Xie et al. 2005

  28. Expected Rate

  29. Probabilistic Approach

  30. Gibbs Sampling (applied to Motif Finding)

  31. Gibbs Sampling Algorithm

  32. Gibbs Sampling – Motif Positions

  33. AlignACE - Gibbs Sampling

  34. Remainder of the lecture: Maximum likelihood and the EM algorithm The remaining slides are for your information only and will not be part of the exam

  35. Basic Statistics

  36. Maximum Likelihood Estimates

  37. EM Algorithm

  38. Basic idea (MEME) http://meme.nbcr.net/meme/meme-intro.html

  39. Basic idea (MEME) MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. http://meme.nbcr.net/meme/meme-intro.html

  40. Basic MEME Model

  41. MEME Background frequencies

  42. MEME – Hidden Variable

  43. MEME – Conditional Likelihood

  44. EM algorithm

  45. Example

  46. E-step of EM algorithm

  47. Example

  48. M-step of EM Algorithm

  49. Example

  50. Characteristics of EM

More Related