Class: Motif Finding CS-67693, Spring 2005

Class: Motif FindingCS-67693, Spring 2005 *Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif%20finding.ppt School of Computer Science & Engineering Hebrew University, Jerusalem cbio course, spring 2005, Hebrew University

Background • Basic dogma: • Information is coded in the genome • Information includes: • Where the genes are coded, including: • Transcription Start • UTR • Exons and Introns • Alternative splicing cbio course, spring 2005, Hebrew University

Eukaryotic Gene Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/ cbio course, spring 2005, Hebrew University

Background • Basic dogma: • Information is coded in the genome • Information includes: • Where the genes are coded, including: • Transcription Start • UTR • Exons and Introns • Alternative splicing • Functional units in proteins cbio course, spring 2005, Hebrew University

Proteins Local structure motifs I-sites Library = a catalog of local sequence-structure correlations diverging type-2 turn Frayed helix Type-I hairpin Serine hairpin glycine helix N-cap alpha-alpha corner cbio course, spring 2005, Hebrew University Proline helix C-cap

Background • Basic dogma: • Information is coded in the genome • Information includes: • Where the genes are coded, including: • Transcription Start • UTR • Exons and Introns • Alternative splicing • Functional units in proteins • RNA family structure cbio course, spring 2005, Hebrew University

RNA – Multiple Align. + structure Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998 cbio course, spring 2005, Hebrew University

Background • Basic dogma: • Information is coded in the genome • Information includes: • Where the genes are coded, including: • Transcription Start • UTR • Exons and Introns • Alternative splicing • Functional units in proteins • RNA family structure • How to control which gene to turn on/off and when cbio course, spring 2005, Hebrew University

Background • In many cases, we can related such functions to reappearing “motifs” in the genome: • Splice/start/end site signals in coding genes • Binding sites of regulatory elements controlling transcription of nearby genes • A certain function of a protein “domain”. The definition of what is a sequence “motif” depends on the context ! cbio course, spring 2005, Hebrew University

Background • Basic dogma: • Information is coded in the genome • Information includes: • Where the genes are coded, including: • Transcription Start • UTR • Exons and Introns • Alternative splicing • Functional units in proteins • RNA family structure • How to control which gene to turn on/off and when Future Classes cbio course, spring 2005, Hebrew University

Regulation of Gene Expression • Gene regulatory proteins bind to specific places (regulatorysites) on DNA. These sites are usually close to the gene. off site gene regulatory protein on site gene cbio course, spring 2005, Hebrew University

Regulatory Sites • Regulatory sites are sometimes divided to 2 types: • Promoter sites –Usually upstream of a gene in non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions. • Enhancer sites – Can be very far away (either upstream or downstream). • Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences. cbio course, spring 2005, Hebrew University

lac operon in E. coli cbio course, spring 2005, Hebrew University

Figure 13.16 The lac Operon of E. coli

Promoter… cbio course, spring 2005, Hebrew University

cbio course, spring 2005, Hebrew University

Transcription Factor Binding Sites We want to describe this site Non-coding regions  gene regulation cbio course, spring 2005, Hebrew University

Difficulty of Finding Regulatory Elements • Regulatory sites are short (up to 30 nucleotides). • Non-coding regions are very long (includes all regions which are not translated into proteins). • Experiments to find regulatory sites are tedious and time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes. • We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints cbio course, spring 2005, Hebrew University

Why Not Use Multiple Alignment? • The motif is short and may appear at different location in different sequences. Most other areas are random • Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate • The problem is made more complicated since not every sequence contains a motif, due to: • The upstream region used may not be long enough to include a regulatory site in every sequence • Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated cbio course, spring 2005, Hebrew University

Computational Approach • Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). • Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. • Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites. cbio course, spring 2005, Hebrew University

How to Find Regulatory Sites sample gene site gene site gene site gene site gene site cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task • Given a set of sequences, find a common motif shared by these sequences. • Steps: • Construct a model of what we mean by common motif. • Solve the problem within the model on simulated samples. • Evaluate performance on real life biological samples. cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task (2) • This means we need to define: • Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?) • Type of “motif” class: • Search Algorithm: How we search the space of possible motifs? • Scoring function: How we score putative motifs? • Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites? • Evaluation technique: How do we test our algorithm? cbio course, spring 2005, Hebrew University

Task Definition Example • Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? • Input: a set of sequences, each one with an unknown pattern at an unknown position. • Output: a set of starting positions of the pattern in each sequence. cbio course, spring 2005, Hebrew University

Pattern == Subsequence Subsequence = AAAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa cbio course, spring 2005, Hebrew University

Pattern == (l,d) • First formulated by Pevzner (ISMB 2000) • Pattern = subsequence of length l and exactly d random mismatches in it • All other sequence is assumed random • Assumes exactly one “true” occurrence of the motif in each sequence atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat AgAAgAAAGGttGGG All variants of AAAAAAAAGGGGGGG ..|..|||.|..||| cAAtAAAAcGGcGGG cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task (2) • We need to define: • Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?) • Type of “motif” class: • Search Algorithm: How we search the space of possible motifs? • Scoring function: How we score putative motifs? • Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites? • Evaluation technique: How do we test our algorithm? • Think: • How the (l,d) problem defines these ? • How does it relate to “real” biology? cbio course, spring 2005, Hebrew University

How to Define Motif Class? • Subsequences : ACTCTT • IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V, N } = all subsets of {A,C,G,T} • PSSM / PWM (Position Specific Score Matrix or Position Weight Matrix) • More general probabilistic/other models: e.g. using Bayesian Networks modeling language • Refined definition based on prior knowledge: • Homo/Hetro dimers • Variable gaps • Bias to some characteristic information profile (Van, 2003) cbio course, spring 2005, Hebrew University

A C G T 1 2 k PSSM Representation of Binding Sites Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is: • Probabilistic interpretation: w[i,c] – weight of letter c at position i • NOTE: • Independence assumption between biding sites positions ! • The score used in a probabilistic setting is the log odds score • In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}. • The entries in the Matrix can be Pi(a), log(Pi(a)) or log(Pi(a)/logQ(a) – depending on the context of its usage ! cbio course, spring 2005, Hebrew University

√ PSSM:+ Enables representing low/high affinity in different Positions+ Trade off Sens. and Spec. in genomic wide scans- Huge Search space, how to cover efficiently? PSSM vs. IUPAC ABF1 Example – (Targets by Lee at el. ,2002) ? >YAL011W: CGTGTTAGATGA cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif? Easier Task - We have aligned samples to learn from: • We have a set of known BS, all of length k, (e.g. verified by some biological experiment) • Compute counts for each base in each position, and normalize == ML estimator: • N number of sequence, Na number of “a”s in position i: • Note: • This is the ML solution. As in many other cases, this might be problematic when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.) • Solution: use pseudo counts or some prior (e.g. Derichele prior) cbio course, spring 2005, Hebrew University

1 2 3 4 5 6 7 ACGT How to Learn PSSM Motif ? (2) Remember: In the motif finding problem we have a much harder task – The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where ! The output: Prediction of new BS based on our learned PSSM motif BSModel Predictions Input Sequence:Dark blue are BS positions which are hidden from us, and we are trying to learn cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 ) • (Still) one of the most commonly used tools for motif (PSSM) search: cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 ) • The basic probabilistic framework used by MEME: • Input: N sequences • Assume each has 1 BS • Assume a generative model: sequence is either generated by BS model M (PSSM) or from a fixed background distribution BG • Assume each sequence has exactly 1 BS in it. • Scoring function: P(Seq | M,BG) • Try to maximize likelihood scoring function by adjusting M’s (PSSM) parameters. cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (4) • What’s the problem? Why is it hard? • Think of the positions of the BS in each sequence as H were H is a vector of dimension N • Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case  easy • Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given we have no close form to compute them analytically and going over all possible H assignments is not feasible,  we need to resort to some method to search the space of possible assignments to M’s parameters • Problem 2: The landscape of the likelihood function is typically far from convex  many local optima cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (5) MEME Algorithm • MEME uses a technique called EM to search the space of model M’s parameters • EM = Expectation Maximization • We review how EM is used in the MEME algorithm in class…. cbio course, spring 2005, Hebrew University

Problems with the MEME & other Models • Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data? • MEME has also other variants we did not discuss here (oops, zoops, etc.) • Also: EM is very sensitive to starting point  need a good way to find good ones cbio course, spring 2005, Hebrew University

Other Algorithmic Techniques for Motif Finding • MEME (Expectation Maximization) • GibbsDNA, AlignAce (Gibbs Sampling) • CONSENUS (greedy multiple alignment) • WINNOWER (Clique finding in graphs) • SP-STAR (Sum of pairs scoring) • MITRA (Mismatch trees to prune exhaustive search space) More then one way to skin a cat…. cbio course, spring 2005, Hebrew University

GeneSet How to find Binding Sites- Revisited “Classical” Solutions: Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…) Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well  not a good candidate to explain regulation Discriminative Solutions: Find a common & unique motif in genes Extract the relevant bit from sequences Promoter “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01 cbio course, spring 2005, Hebrew University

Search Space, Evaluate Motifs using discriminative scoring Choose Significant Motifs Correct for multiple hyp.Bonfferoni or FDR criteria Finding Discriminative Motifs Step1 Define Space of Motifs “mimic” motifs with a simpler class for efficient search Step2: Refine Motifs “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01 cbio course, spring 2005, Hebrew University

Binding Sites - Revisited → independence assumption Two relevant questions: • Are there dependencies in binding sites? • Do we gain an edge in computational tasks if we model such dependencies? ?T ?C gene A binding site promoter cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T How to model binding sites ? represent a distribution of binding sites Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Aligned binding sites Models GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC X1 X2 X3 X4 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X2 X3 X4 X5 T T Learning models: Aligned binding sites Learning procedure for Bayesian networks Learning Machineryselect maximum likelihood model cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Mixture of Profiles 76% 24% Tree X4 X5 X6 X7 X8 X9 X10 X11 X12 Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold) Arabidopsis ABA binding factor 1(49 examples) Profile Test LL per instance -19.93 Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold) cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Mixture of Profiles Tree X4 X5 X6 X7 X8 X9 X10 X11 X12 Rap1 Example (Harbison at. el.04)(171 expmples) Profile cbio course, spring 2005, Hebrew University

8 67 4 2 Fold change in likelihood (held out test data) 1 ½ ¼ 1 11 21 31 41 51 61 71 81 91 Datasets of Binding Sites Significant Non sig. Likelihood improvement over profiles Significant improvement in generalization  Data often exhibits dependencies cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model Models X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T Learning models: unaligned data EM algorithm Unaligned Data Learna model Identify binding sites cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Detect target genes on a genomic scale: Evaluating Performance ACGTAT…………….………………….AGGGATGC GAGC -473 -1000 0 Probability by binding site model Scoring rule: Crucial issue: p-value of scores Background model (order-3 markov chain) “CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04 cbio course, spring 2005, Hebrew University

90% Mixture of Trees 80% 70% Mixture of Profiles 60% Tree Profile 50% True Positive Rate (Sensitivity) 40% 30% 20% 10% 0% 0% 1% 2% 3% 4% 5% False Positive Rate Example: ROC curve of HSF1 ~60 FP cbio course, spring 2005, Hebrew University “Modeling Dependencies in Protein-DNA Binding Sites”,RECOMB 03

Class: Motif Finding CS-67693, Spring 2005