1 / 33

Finding approximate occurrences of a pattern that contains gaps

Finding approximate occurrences of a pattern that contains gaps . Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park. Contents. The exact/approximate gapped pattern matching problem Previous approaches Our contributions. Exact gapped pattern matching problem.

anoki
Télécharger la présentation

Finding approximate occurrences of a pattern that contains gaps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park

  2. Contents • The exact/approximate gapped pattern matching problem • Previous approaches • Our contributions

  3. Exact gapped pattern matching problem Definition find the occurrences of the pattern that contains gaps from the text. Pattern P = AA *(2,3) GC *(1,3) TT subpatterns *(2,3) *(1,3) A A G C T T P1 P2 P3 any string whose length is between 2 and 3 any string whose length is between 1 and 3

  4. Example – Exact matching Pattern P = AA *(2,3) GC *(1,3) TT Text T = GCAATTGCACTTC *(2,3) *(1,3) Pattern A A G C T T G C A A T T G C A C T T C Text

  5. Approximate gapped pattern matching problem Definition find all the substrings of the text which match each subpattern Pi with ki number of insertion, deletion, and substitution. Pattern P = AA *(2,3) GC *(1,3) TT *(2,3) *(1,3) A A G C T T P1 k1 = 0 P2 k2 = 1 P3 k3 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3

  6. Example – Approximate matching Pattern P = AA *(2,3) GC *(1,3) TT , k1 = k3 = 0, k2 = 1 Text T = GCAATTGTACTTC *(2,3) *(1,3) Pattern A A G C T T G C A A T T G T A C T T C Text 1 substitution

  7. Class of characters Allow more than two different characters at a position of the pattern Pattern P = AA *(2,3) G[CT] *(1,3) TT C or T C *(2,3) *(1,3) A A G T T T P1 k1 = 0 P2 k2 = 1 P3 k3 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3

  8. Example – Class of characters Pattern P = AA *(2,3) G[CT] *(1,3) TT Text T = GCAATTGTACTTC *(2,3) *(1,3) C Pattern A A G T T T G C A A T T G T A C T T C Text

  9. Application of the gapped pattern matching • Information retrieval • Data mining • Computational biology • Especially, finding motifs in a sequence

  10. Motifs Sometimes overall sequence alignment doesn’t show the relation between biologically related sequences. Sequence 1 Sequence 2 Sequence 3 Sequence 4 Motifs (biologically important common region)

  11. PROSITE database • Database of protein families, domains and motifs • http://www.expasy.ch/prosite • Motifs are represented as gapped patterns from the alphabet of 20 amino acids. • Prion protein (Creutzfeld-Jacob Disease) : E*(1,1)[ED]*(1,1)K[LIVM][LIVM]*(1,1) [KR][LIVM][LIVM]*(1,1) [QE]MC*(2,2)QY • Ribosomal protein L1 : [IM]*(2,2)[LIVA]*(2,3) [LIVM][GA]*(2,2)[LMS] [GSNH][PTKR][KRAV]G*(1,1) [LIMF]P[DENSTKQ]

  12. Finding hidden motifs how to find unknown motifs? a set of sequences

  13. Finding motifs in a sequence Our topic known motif new sequence x As biological sequences may contain errors, we should consider approximate matching occurrences.

  14. Previous approaches • Regular expression approaches • Exact matching • Navarro and Raffinot’s approach [RECOMB 2002] • Exact and approximate matching • Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] • Approximate matching

  15. Regular expression approach Pattern P = AA *(2,3) GC *(1,3) TT Regular expression AA**(*|e)GC*(*|e) (*|e)TT Too general! * * * A A e G C e e T T * * * Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA)

  16. Navarro and Raffinot’s approach NFA is not easy to run and DFA can be large. * * * A A e G C e e T T * * * 0 0 1 0 0 1 0 0 0 1 1 0 0 Bit-Vector Simulate NFA by the bit-parallelism technique. (A word can be read and written simultaneously)

  17. Navarro and Raffinot’s approach Allow k errors for all the pattern. * S, e * * 0 errors A A e G C e e T T * * * S, e S, e S, e S, e S, e S, e S, e S, e S, e S, e S, e S, e S S S S S S S S S S S S S A A e G C e e T T * * * 1 errors * * * Works for small size pattern and small number of errrors. O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size)

  18. Akutsu’s approach Combination of the dynamic programming and the balanced search tree. O (mn log n) time Text Dynamic Table P1 *(a1, b1) use the tree to compute the smallest values here P2 *(a2, b2) P3

  19. Drawbacks of the previous approaches ? k = 3 for all the pattern O O O O O X X X O O O O O O O O O O O X X X O O O O O O k1 = 1 k2 = 1 k3 = 1 O X O O O O X O O O X O O X O O O O X O O O X O more sensitive and desirable

  20. Our contributions • O (ln + m) time algorithm for the exact gapped pattern matching problem. • l : number of subpatterns • n : length of the text • m : length of the pattern • O (mn) time algorithm for the approximate gapped pattern matching problem.

  21. Graph Modeling 1. Create a node where a subpattern appears (exactly or approximately) in the text 2. Link two nodes with an edge if they represent the two consecutive subpatterns and satisfy the gap condition. 3. If there is a path P1– P2 - … - Pm in the graph, there is an occurrence of the pattern in the text.

  22. Exact matching Step 1. Create nodes P = AA *(2,3) GC *(1,3) TT P1 = AA, P2 = GC, P3 = TT G C A A T T G C A C T T C Text P1 P2 P2 P3 P3

  23. Exact matching Step 2. Connect the nodes with the edges P = AA *(2,3) GC *(1,3) TT P1 = AA, P2 = GC, P3 = TT G C A A T T G C A C T T C Text P1 P2 P2 P3 P3

  24. Exact matching Step 3. Find the path by Depth-First Search P = AA *(2,3) GC *(1,3) TT P1 = AA, P2 = GC, P3 = TT G C A A T T G C A C T T C Text P1 P2 P2 P3 P3

  25. A better idea No need to build the graph explicitly. Step 1. Find P1 = AA and compute the candidate range for P2. P = AA *(2,3) GC *(1,3) TT G C A A T T G C A C T T C Text P1 candidate range

  26. A better idea Step 2. Find P2 = GC within the candidate range and compute candidate range for P3. P = AA *(2,3) GC *(1,3) TT G C A A T T G C A C T T C Text P1 P2 candidate range

  27. A better idea Step 3. After findng P3 = TT within the candidate range, we found the occurrence of P. P = AA *(2,3) GC *(1,3) TT G C A A T T G C A C T T C Text P1 P2 P3

  28. Approximate matching Almost the same idea as the exact matching case. Find the approximate occurrence of subpatterns, instead of the exact one. Text P1 k1 = 0 *(2,3) , k2 = 1 candidate range

  29. Approximate matching Infinity – no alignment can start from here Text P2 k2 = 1 *(1,3) , k3 = 0 candidate range

  30. Approximate matching Text P3 k3 = 0 approximate occurrence of the pattern

  31. Handling class of characters Represent characters as bit masks. [GC] Text Pattern G [GC] & 0100 0101 0100 T [GC] & 0010 0101 0000 nonzero zero

  32. Time Complexity O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice

  33. Conclusion • O (ln + m) time algorithm for the exact gapped pattern matching problem • O (mn) time algorithm for the approximate gapped pattern matching problem. • Open problem • time complexity in the average case?

More Related