String Data Structures and Algorithms
E N D
Presentation Transcript
String Data Structures and Algorithms David Fernández-Baca UNAM (Mexico) (based on notes by Srinivas Aluru) slightly modified by Benny Chor
Why Strings? • Biological sequences can be viewed as strings, or finite series of characters, over an alphabet Σ. • There is a wealth of algorithmic theory developed for general strings that we can apply to specific biological problems. BBSI Summer School - Iowa State University
Suffix Trees S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 A $ M LA YALAM$ AL 5 10 $M YALAM$ YALAM$ $M $ ALAYALAM$ 3 8 4 7 $M YALAM$ 1 9 6 2 BBSI Summer School - Iowa State University
Suffix tree properties • For a string S of length n, there are n leaves and at most n internal nodes. • therefore requires only linear space • Each leaf represents a unique suffix. • Concatenation of edge labels from root to a leaf spells out the suffix. • Each internal node represents a distinct common prefix to at least two suffixes. BBSI Summer School - Iowa State University
Edge Encoding S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 (2, 2) (10, 10) (5, 10) (3, 4) (1, 1) 10 5 (5, 10) (3, 4) (10, 10) (5, 10) (9, 10) (2, 10) (9, 10) 7 3 1 8 4 9 (9, 10) (5, 10) 6 2 BBSI Summer School - Iowa State University
Näive Suffix Tree Construction Before starting: Why exactly do we need this $, which is not part of the alphabet? BBSI Summer School - Iowa State University
Näive Suffix Tree Construction 3 4 2 A $MALAYALAM LAYALAM$ LAYALAM$ YALAM$ 2 1 3 4 BBSI Summer School - Iowa State University
Finding a (short) Patternin a (long) String • Build a suffix tree of the string. • Starting from the root, traverse a path matching characters of the pattern. • If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. BBSI Summer School - Iowa State University
Finding a Pattern in a String Find “ALA” A $ M LA YALAM$ AL 5 10 M$ YALAM$ YALAM$ M$ $ ALAYALAM$ 3 8 4 7 M$ YALAM$ 1 9 Two matches - at 6 and 2 6 2 BBSI Summer School - Iowa State University
Finding Common Substrings • Construct a generalized suffix tree for two strings (each suffix of each string is represented). • Label each leaf with the suffix number and string label. • Each internal node with a leaf from both strings in its subtree gives a common substring. BBSI Summer School - Iowa State University
Generalized Suffix Tree WINDOW$ INDIGO$ 1234567 1234567 $ D ND I $OG O W (1, 7) (2, 7) (2, 5) ND OW$ $ $OGI OW$ $OGI $OG $W INDOW$ $ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) BBSI Summer School - Iowa State University
Lowest Common Ancestors • The lowest common ancestor (lca) of two nodes x and y in a rooted tree is the deepest node (farthest away from root) that is an ancestor of both x and y • Concatenation of edge labels from root to the lca of two leaves spells out the longest common prefix (lcp) of two strings • lca(x,y) an be found in constant time after linear preprocessing [Bender00] BBSI Summer School - Iowa State University
A Useful Property String depth (lca (i , j)) = lcp (suffixi, suffixj) A A $ String depth = 3 M LA YALAM$ AL AL 5 lca 10 M$ YALAM$ YALAM$ M$ $ ALAYALAM$ 3 8 4 7 M$ YALAM$ 1 9 6 2 BBSI Summer School - Iowa State University
Longest Common Extension RAILWAY$ 12345678 RAI GRAINY$ 1234567 RAI lce(1,1) = 0 lce(2,1) = 3 We’ll soon find lce’s useful in reconstructing phylogenetic trees based on whole genome/proteome sequences BBSI Summer School - Iowa State University
lce’s and lca’s To compute lce’sfor two strings S1 and S2 • Build generalized suffix tree, T,of S1 and S2 • Compute string depth for each node in T • Preprocess T for lca queries • lce(i,j) = string depth of lca of suffix i ofS1 and suffix j ofS2 BBSI Summer School - Iowa State University
Example WINDOW$ INDIGO$ 1234567 1234567 $ D ND I $OG O W (1, 7) (2, 7) (2, 5) ND OW$ $ $OGI OW$ $OGI $OG $W INDOW$ $ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) BBSI Summer School - Iowa State University
lce’s, revisited Given two strings S1 and S2 , we are now interested in finding, for each i, the index j such that lce (i, j) is maximal. • What is the meaning of this task? • How do we accomplish it efficiently? • Notice that computing the values lce (i, j) for all j is inefficient! BBSI Summer School - Iowa State University
Palindromes • A palindrome is a string that reads the same in both directions • E.g., CATGTAC • red rum, sir, is murder • Palindrome problem: Find all maximal palindromes in a string S BBSI Summer School - Iowa State University
Finding Palindromes in S • Construct the reverse S’ of S • Build generalized suffix tree of S and S’ • Preprocess T for lce queries • Now what? Left as homework Requirement: Linear time (const. per query) S q + 1 BBSI Summer School - Iowa State University
Palindromes in DNA sequences • We sometimes need to deal with complemented palindromes A T C G • E.g., ATCATGAT is a complemented palindrome • All complemented palindromes in S can be found using a GST of S and the complement of S’ BBSI Summer School - Iowa State University
Suffix Array – Reducing Space M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 Suffix Array Longest common prefix Array Suffix 6 and 2 share “ALA” Suffix 2 and 8 share just “A”. lcp is always withadjacent. BBSI Summer School - Iowa State University
Pattern Search in Suffix Array • All suffixes that share a common prefix appear in consecutive positions in the array. • Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O(|P| log n). Improved to O(|P| + log n) [Manber&Myers93], and to O(|P|) [Abouelhoda et al. 02]. BBSI Summer School - Iowa State University
Computing longest common prefix Values • Find where S1 is in the suffix array. • Compute lcp value of S1. • Find S2 in the suffix array. • Compute lcp value of S2. • Repeat for all suffixes. Run-time is linear (why?) BBSI Summer School - Iowa State University
M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 6 2 8 4 7 3 1 9 5 10 Example Text Position Suffix Array lcp Array 3 1 1 0 2 0 1 0 0 BBSI Summer School - Iowa State University
Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array (why? Wait for next slide) BBSI Summer School - Iowa State University
Building a ST from a SA and lcp D = 0 A LA D = 1 D = 2 AL $M YALAM$ YALAM$ $M D = 3 3 8 4 7 $M YALAM$ 6 2 SA lcp BBSI Summer School - Iowa State University
Some Results • Suffix tree can be constructed in O(n)time and O(n |∑|)space [Weiner73, McCreight76, Ukkonen92]. • Suffix arrays can be constructed without using suffix trees in O(n)time [Pang&Aluru03]. BBSI Summer School - Iowa State University
More Applications • Suffix-prefix overlaps in fragment assembly • Maximal and tandem repeats • Shortest unique substrings • Maximal unique matches [MUMmer] • Approximate matching BBSI Summer School - Iowa State University
Dealing with errors • The basic string data structures can only extract information in the absence of errors. • To deal with errors, decompose into parts that do not involve errors. BBSI Summer School - Iowa State University
The k-mismatch problem • Given a pattern P, a text T, and a number k, find all occurrences of P in T with at most k mismatches Example P = bend, T = abentbananaend, k = 2 Match 1: bent Match 2: bana Match 3: aend BBSI Summer School - Iowa State University
Solution • Build GST of P and T and preprocess it for lce queries • For each starting index i in T, do at most klce queries to determine if there is a k-mismatch beginning at i T P Time = O(k |T |) BBSI Summer School - Iowa State University
References • M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2nd Workshop on Algorithms in Bioinformatics, pp. 449-463, 2002. • M. A. Bender and M. Farach-Colton, The LCA Problem Revisited, LATIN, pages 88-94, 2000. • P. Ko and S. Aluru, Linear time suffix sorting, CPM, pages 200-210, 2003. • U. Manber and G. Myers. Suffix arrays: a new method for on-line search, SIAM J. Comput., 22:935-948, 1993. • E. M. McCreight, A space-economical suffix tree construction algorithm, J. ACM, 23(2):262--272, 1976. • E. Ukkonen, Constructing suffix trees on-line in linear time. Intern. Federation ofInformation Processing, pp. 484-492,1992. Also in Algorithmica, 14(3):249--260, 1995. • P. Weiner, Linear pattern matching algorithms, Proc. of the 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1-11, 1973. BBSI Summer School - Iowa State University