1 / 71

Part II Algorithms for string motif finding

Part II Algorithms for string motif finding. Jaime Seguel, PhD Electrical and Computer Engineering Department University of Puerto Rico at Mayaguez. Disclaimer.

hope-meyers
Télécharger la présentation

Part II Algorithms for string motif finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part IIAlgorithms for string motif finding Jaime Seguel, PhD Electrical and Computer Engineering Department University of Puerto Rico at Mayaguez Summer Institute in Bioinformatics PSC - 2008

  2. Disclaimer Some slides presented in this talk have been taken with minor or without modifications from power point presentations published in the Website http://www.bioalgorithms.info/ Summer Institute in Bioinformatics PSC - 2008

  3. Outline • The problem of finding small common patterns in a set of DNA sequences • Brute force approach: • consensus maximization • Hamming distance minimization • Branch-and-Bound approach: • Consensus maximization • Hamming distance minimization • Consensus and Pattern Branching: • Greedy Motif Search • Summary Summer Institute in Bioinformatics PSC - 2008

  4. Problem: Given the following 10 DNA sequences, each with 82 characters: Is there a 15-character common pattern? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga Summer Institute in Bioinformatics PSC - 2008

  5. YES of course!It is AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Summer Institute in Bioinformatics PSC - 2008

  6. That was an easy one! A general algorithm for finding an l-character common pattern is Common l-character Pattern Detection Algorithm: Input: t DNA sequences, each of length n; and l < n, the length of the pattern Procedure: • Compare the first two strings using a pattern matching algorithm • If no l-character common pattern is found, return “NO” • Otherwise, save the l-character common pattern • For j = 2,…,t • Check if the pattern appears in the jth sequence • If it does not, return “NO” • End For • Return “Yes, of course! It is {pattern}” Bioinformatics Algorithms

  7. Complexity of the Common l-character Pattern Detection Algorithm The time complexity of the previously discussed Common l-character Pattern Detection Algorithm can be estimated as follows: • Step 1 is computed in time • Steps 4 – 7 are computed in • The whole algorithm takes time • Therefore, the algorithm is polynomial (indeed, quadratic) Summer Institute in Bioinformatics PSC - 2008

  8. Unfortunately… Real-lifeproblems are not that simple: • The pattern is not exactly the same in each array because random point mutations may occur in the sequences • The length of the pattern is usually unknown • It is not know where it is located relative to the genes start These facts-of-life make the motif (i.e. pattern) finding problem much more complex Summer Institute in Bioinformatics PSC - 2008

  9. Same sequences except by a few point mutations: Is there a motif? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Summer Institute in Bioinformatics PSC - 2008

  10. Well, there are 15-character patterns that look pretty much alike. Indeed, they differ for at most 4 characters. Is that what you are asking for? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG Summer Institute in Bioinformatics PSC - 2008

  11. Instead of a Pattern, what we get is a Motif Logo • Motifs can mutate on non important bases • The illustration shows five motifs in five different genes that have mutations in position 3 and 5 • Representations called motif logosillustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Summer Institute in Bioinformatics PSC - 2008

  12. A larger motif logo Summer Institute in Bioinformatics PSC - 2008

  13. Consensus strings • The largest characters in a motif logo represent a consensus string, this is, a string containing the most frequently repeated characters • The “quality” of a motif logo as a “generalized common pattern” in a family of DNA sequences can be assessed by scoring all consensus strings as follows BIOINFORMATI

  14. Selecting a t x l “window” Parameters: t=3, l=9,n=18 Sequences: S1: TTGAGGTACACCTATAAC S2: TAGCTCCACTCATATCAG S3: TATCGCATGTACAATCAC Selected window: s=(4, 2, 7) (initial positions) ***AGGTACACC****** *AGCTCCACT******** ******ATGTACAAT*** BIOINFORMATI

  15. Alignment, profile, consensus an scoring the selected window A G G T A C A C C A(4, 2, 7) =A G C T C C A C T A T G T A C A A T P(4, 2, 7)=A: 3 0 0 0 2 0 3 1 0 C: 0 0 1 0 1 3 0 2 1 G: 0 22 0 0 0 0 0 0 T: 0 1 0 3 0 0 0 0 2 Consensus: A G G T A C A C T Score A(4,2,7): 3+2+2+3+2+3+2+2 = 19 BIOINFORMATI

  16. Motif finding problem as a maximization problem Given a set of t DNA sequences of length n and a segment length l < n; find a set of t subsequences of length l , one from each of the given DNA sequences, whose consensus score is maximal BIOINFORMATI

  17. Brute force approach to maximum consensus score Input: t DNA sequences of length n, and the pattern’s length l • Initialize bestScore 0; • For s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) • Compute Score score of alignment matrix A(s) • If Score> bestScore • bestScore Score • bestMotif  (s1,s2 , . . . , st) • Return bestMotif BIOINFORMATI

  18. Complexity • Count the windows: Varying (n - l + 1)positions in each of tsequences produces (n - l + 1)twindows (or sets of starting positions). The order is • For each set of starting positions, the scoring function makes O(l) operations, so complexity is O(l nt) • That means that for t = 8, n = 1000, l = 10 we must perform approximately 1025computations!!! • Even in a supercomputer this will take a few billions years!!! BIOINFORMATI

  19. A different approach Instead of finding all windows, why not comparing each of the possible l-character patterns over the alphabet {A, G, T, C} with each of the l-mers (subsequences of l characters) in each of the tDNAsequences and find the pattern that appears in all t sequences with the minimum number of mutations Question is: Will this approach yield a better brute force algorithm ? Summer Institute in Bioinformatics PSC - 2008

  20. Hamming distances The Hamming distance dH(v,w) is the number of nucleotide pairs that do not match when v and w are aligned. For example: dH(AAAAAA,ACAAAC) = 2 The Hamming distance between a patternV and a DNA sequenceS is the minimum of all distances d(X, V) taken over all possible substrings X over S Summer Institute in Bioinformatics PSC - 2008

  21. Illustration of a total distance: some computations Parameters: t=3, l=9, n=15 Sequences: S1: TTGAGGTACACCTAT S2: TAGCTCCACTCATAT S3: TATCGCATGTACAAT Proposed Pattern: V=AGGTATACG BIOINFORMATI

  22. The distance form pattern AGGTATACG to sequence S1 is 2 • TTGAGGTACACCTAT  First sequence (S1) and chosen subsequence X d(TTGAGGTAC,AGGTATACG)=8 • TTGAGGTACACCTAT Second choice of X d(TGAGGTACA,AGGTATACG)=5 • TTGAGGTACACCTAT Third choice of X d(GAGGTACAC,AGGTATACG)=8 • TTGAGGTACACCTAT Forth choice of X d(AGGTACACC,AGGTATACG)=2  Minimum • TTGAGGTACACCTAT Fifth choice of X d(GGTACACCT,AGGTATACG)=7 • TTGAGGTACACCTAT Sixth choice of X d(GTACACCTA,AGGTATACG)=8 • TTGAGGTACACCTAT Seventh choice of X d(TACACCTAT,AGGTATACG)=9 BIOINFORMATI

  23. The total distance • In the previous example: d(TTGAGGTACACCTAT,AGGTATACG) d(S1, AGGTATACG) =2 achieved when X is the 9-letter segment starting at position 4 in the DNA string • Similarly, we get d(S2, AGGTATACG) = d(S3, AGGTATACG) = 4 • The total Hamming distance over the set of DNA sequences {S1, S2, S3} is defined to be TotalDistance( AGGTATACG, {S1, S2, S3}) = d(S1, AGGTATACG)+ d(S2, AGGTATACG)+ d(S3, AGGTATACG)} = 2 + 4 + 4 = 10 BIOINFORMATI

  24. Motif finding problem as a Hamming distance minimization problem Given a set of t DNA sequences of length n and a segment length l < n; find a string v in the DNA alphabet (this is, a string of nucleotides) with length l which minimizesTotalDistance(v, Set of DNA sequences) This is finding: min {TotalDistance(v, {S1,…,St}): v DNA sequence of length l } BIOINFORMATI

  25. Brute force implementation of the total-distance minimization method Input: t DNA sequences of length n, and the pattern length l • Initialize bestWord AAA…A (l characters) • Initialize bestDistance highest integer in your system • For each l-mer v from AAA…A to TTT…T • Compute TotalDistance(v, DNA set) • If TotalDistance(v, DNA) < bestDistance • bestDistanceTotalDistance(v, DNA set) • bestWord  v • Return bestWord BIOINFORMATI

  26. Complexity • Minimizing the total Hamming distance requires examining all 4l combinations for the pattern v, and each pattern choice is followed by O(t(n-l+1)) operations. This is, the method’s complexity is O(4l t(n-l+1)) • Conclusion, the complexity of the brute-force total distance minimization is dominated by an exponential factor, as well. But the actual count is much less in this case. BIOINFORMATI

  27. It’s all in what affects the exponential growth!!! • In most practical situations n is significantly larger than l. Recall that l is usually a number between 7 and 15. • The advantage of 4l over (n -l+ 1)t is that the former expression does not depend exponentially neither on the number of sequences in the set (t) nor in the sequence lengths (n) • The latter parameters (t and n) are less likely to be bounded in practice BIOINFORMATI

  28. Mathematical Equivalence • The Motif Finding is a maximization problem while Median String is a minimization problem. Computationally, Median String allows searches over much larger data sets. Are the results comparable? • Indeed, the Motif Finding problem and Median String problem are mathematically equivalent. Next we show that minimizing TotalDistance is equivalent to maximizing Score Summer Institute in Bioinformatics PSC - 2008

  29. Proof of the mathematical equivalence l a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 311 0 Profile C 24 0 0 14 0 0 G 0 14 0 0 0 31 T 0 0 0 51 0 14 _________________ Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4 TotalDistance 2+1+1+0+2+1+2+1 Sum 5 5 5 5 5 5 5 5 At any column I Scorei+ Hamming Distancei= t Because there are lcolumns Score+ TotalDistance= l * t Rearranging: Score= l * t - TotalDistance Since l* t is constant the minimization of the right side is equivalent to the maximization of the left side t Summer Institute in Bioinformatics PSC - 2008

  30. Structuring the Search Let’s take a closer look to the pseudo-code line For each l-merv from AAA…A to TTT…T • There is more than one way to navigate over all possible l-mers • We need a navigation method able to exhibit intermediate approximations so potentially “low scoring or highly distant” l-mers can be eliminated as earlier as possible in the search Summer Institute in Bioinformatics PSC - 2008

  31. Structuring the Search • For the Median String Problem we need to consider all 4l possible l-mers: aa… aa aa… ac aa… ag aa… at . . tt… tt How to organize this search? l Summer Institute in Bioinformatics PSC - 2008

  32. Alternative Representation of the Search Space • Let A = 1, C = 2, G = 3, T = 4 • Then the sequences from AA…A to TT…T become: 11…11 11…12 11…13 11…14 . . 44…44 • Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit l Summer Institute in Bioinformatics PSC - 2008

  33. Linked lists don’t exhibit intermediate approximations • Suppose l = 2 aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt • Need to visit all the predecessors of a sequence before visiting the sequence itself Start Summer Institute in Bioinformatics PSC - 2008

  34. Trees do !!! • Linked lists organize the patterns. A tree, instead, may show the patterns and their prefixes aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt Summer Institute in Bioinformatics PSC - 2008

  35. Search Tree a- c- g- t- aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt root -- Summer Institute in Bioinformatics PSC - 2008

  36. Moving through a Search Tree • Four common moves in a search tree that we are about to explore: • Move to the next leaf • Visit all the leaves • Visit the next node • Bypass the children of a node Summer Institute in Bioinformatics PSC - 2008

  37. Visit the Next Leaf Given a current leaf a, we need to compute the “next” leaf: • NextLeaf( a,L, k ) // a : the array of digits • foriL to 1 //L: length of the array • ifai < k // k : max digit value • aiai + 1 • returna • ai 1 • returna Summer Institute in Bioinformatics PSC - 2008

  38. NextLeaf (cont’d) • The algorithm is common addition in radix k: • Increment the least significant digit • “Carry the one” to the next digit position when the digit is at maximal value Summer Institute in Bioinformatics PSC - 2008

  39. NextLeaf: Example • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Current Location Summer Institute in Bioinformatics PSC - 2008

  40. NextLeaf: Example (cont’d) • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Next Location Summer Institute in Bioinformatics PSC - 2008

  41. Visit All Leaves • Printing all permutations in ascending order: • AllLeaves(L,k) // L: length of the sequence • a (1,...,1) // k : max digit value • while forever // a: array of digits • output a • a NextLeaf(a,L,k) • ifa = (1,...,1) • return Summer Institute in Bioinformatics PSC - 2008

  42. Visit All Leaves: Example • Moving through all the leaves in order: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -- Order of steps Summer Institute in Bioinformatics PSC - 2008

  43. Depth First Search • So we can search all leaves • How about searching all vertices of the tree? • We can do this with a depth first search Summer Institute in Bioinformatics PSC - 2008

  44. Visit the Next Vertex • NextVertex(a,i,L,k) // a : the array of digits • ifi < L // i : prefix length • a i+1 1 // L: max length • return ( a,i+1) // k : max digit value • else • forjl to 1 • ifaj < k • ajaj +1 • return( a,j ) • return(a,0) Summer Institute in Bioinformatics PSC - 2008

  45. Example • Moving to the next vertex: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008

  46. Example • Moving to the next vertices: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Location after 5 next vertex moves -- Summer Institute in Bioinformatics PSC - 2008

  47. Bypass Move • Given a prefix (internal vertex), find next vertex after skipping all its children • Bypass(a,i,L,k) // a: array of digits • forji to 1 // i : prefix length • ifaj < k// L: maximum length • ajaj +1// k : max digit value • return(a,j) • return(a,0) Summer Institute in Bioinformatics PSC - 2008

  48. Bypass Move: Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008

  49. Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Next Location -- Summer Institute in Bioinformatics PSC - 2008

  50. Improving brute force search: The Branch and Bound approach • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most lto Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * ltoScore(s,i,DNA) • If Score(s,i,DNA) + (t – i) * l < BestScore, it makes no sense to search in the descendents of the current vertex • Use ByPass() Summer Institute in Bioinformatics PSC - 2008

More Related