1 / 41

cisGreedy Motif Finder for Cistematic

cisGreedy Motif Finder for Cistematic. Sarah Aerni Mentors: Ali Mortazavi Barbara Wold. cisGreedy. De novo motif finder which implements a greedy algorithm similar to Consensus motif finder

calhoun
Télécharger la présentation

cisGreedy Motif Finder for Cistematic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold

  2. cisGreedy • De novo motif finder which implements a greedy algorithm similar to Consensus motif finder • Goal: To provide an efficient algorithm to be included in the Cistematic package that performs similarly to Consensus and meme

  3. Cistematic • Integrate visualization, refinement of motifs and improve performance of multiple motif finders in a single package Mortazavi, 2006

  4. Cistematic • cisGreedy becomes part of “Bottom Tier” • Motif finder would be included in the Cistematic package (prevents need for complicated installations) Image: Ali Mortazavi

  5. What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes

  6. What is a Motif? • GAL4 in Yeast • Activator of galactose-induced genes (convert galactose to glucose) • Protein structure determines motif • DNA-protein interactions require certain bases at specified locations • Motif reflects homodimer structure

  7. What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes • Gene Regulation believed to be a major source of complexity • Plants may have more genes or larger genomes than humans – are they more complex? • Identification of cis-regulatory elements will help us understand gene regulatory networks (bigger picture)

  8. How do we find motifs? • Hard to identify • Relatively short sequences (as small as 6 bases) • Many positions not well conserved • Factors improving identification • Usually localized in certain proximity of a gene (search within 3 kb upstream) • Some positions highly conserved • Use other data (Microarray?)

  9. Motif Finders • Greedy • Maximizes similarity of motifs from sequences through a greedy approach • Eliminate background modeling by using Cistematic package preprocessing steps • Improves speed • Prevents false negatives • Implements multiple models (zoops, oops, TCM)

  10. ConsensusScoring • Use equation similar to log likelihood called Information Content L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probability of letter i • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.

  11. Removing Background • Goal of a background model: differentiate noise from signal • Issues with background: • What background should be used? • Whole genome? Conserved regions? • Selective pressures maintain conserved regions • Arguably searching in conserved regions guarantees there is little noise (it has been maintained) • Solution: • Search in conserved regions • Use simple repeat masking • Sequences which reoccur are likely TFBS

  12. cisGreedy scoring • Scoring focuses on maximizing number of identical bases • Percent identity is dependent on number of deviations from the strict consensus • Background adds complexity that may lead to false negatives

  13. cisGreedy • Input sequences are analyzed • Randomly select 2 sequences to be compared

  14. cisGreedy • The two selected sequences are analyzed independently of the remaining sequences

  15. cisGreedy • The two selected sequences are analyzed independently of the remaining sequences • Windows of motif size are scanned starting at the beginning of each sequence

  16. cisGreedy • Sequences are scanned in an attempt to locate the highest scoring alignment • Alignments are ungapped • Score is established as the number of sequences containing the most frequently occurring base at each position

  17. cisGreedy • Reverse Complements are analyzed (user specified) • Once start locations are established with a top alignment score, these are left unchanged (Greedy)

  18. cisGreedy • Select an additional sequence in which to identify the location of the motif • Windows in the additional sequence are aligned to previously established windows (Greedy)

  19. cisGreedy • Additional sequence scanned as before, reverse complement (user specified) • Alignment score established as before

  20. cisGreedy • Final motif locations are used in order to build position specific frequency matrices • Reverse complement sequence used in building PSFM if used

  21. Testing cis-Greedy • AIY • 16bp cis-regulatory motif drives expression • Experimentally verified • Gene battery consists of a set of genes bound by AIY • Orthologous genes contain highly specified binding sites • Individual binding sites of battery genes within a single species can vary considerably (Wenick and Hobert 759)

  22. Cistematic Results for AIY regions of conservation orthologous genes hen-1 hen-1

  23. Results for AIY AIY Identified AAATTGGCTTCCTCAAA cisGreedy TTTGAGGAAGCCAATTT (reverse comp) AAATTGGCTTCCTCAAA meme AAATTGGCTTCCTCAAA AIY- Battery Consensus

  24. Cistematic Results for AIY hen-1

  25. Results for AIY hen-1

  26. Tompa Bakeoff • 3 benchmark datasets • Real • Markov Chain • Generic • 4 organisms • Human • Mouse • Fruitfly • Yeast • Each dataset contains 0-1 motifs. • Each sequence can have 0 or multiple motifs • Report 0-1 motif per dataset and locations of motifs • Use statistical tools provided by bakeoff to analyze runs

  27. Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently

  28. Real Real Interesting pattern appears between 3 of 10 sequences

  29. Markov

  30. Generic

  31. Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently • Determine which motif appears most reasonable across 3 benchmarks and map motif in sequences using Cistematic • Compare results to actual locations (provided in bakeoff package)

  32. Solution

  33. Real Real

  34. Solution

  35. Markov

  36. Bakeoff results • Correlation Coefficient: nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) • Sensitivity (fraction of known sites that are predicted): sSn = sTP / (sTP + sFN) • Positive Predictive Value (fraction of predicted sites that are known.) sPPV = sTP / (sTP + sFP)

  37. Bakeoff results • cisGreedy overall 7th best performer (excluding those with no data): • Overall top performer in fly • Worst performer in yeast • 3rd worst performer in mouse • 4th best performer in human

  38. Bakeoff results Adapted from Tompa, 2005 When running programs in parallel, correlation of motif finder results to true binding sites improves

  39. Future goals • Complete analysis of results for cisGreedy using benchmarks established by Tompa paper (Nature Biotech, 2005) • Document results and algorithm development • Continue improving cisGreedy

  40. References • Bioalgorithms.info • Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms . : MIT Press , 2004. • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. • Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. • Wenick, Adam S., and Oliver Hobert. "Genomic cis-Regulatory Architecture and trans-Acting Regulators of a Single Interneuron-Specific Gene Battery in C. elegans." Developmental Cell 6(2005): 757-770. • http://cistematic.caltech.edu

  41. Acknowledgements • Ali Mortazavi • Barbara Wold • Wold Lab funding provided by DOE & NASA • Additional funding by NSF & NIH • SoCalBSI faculty, staff and fellow students

More Related