Advanced String Matching Techniques
260 likes | 325 Vues
Explore suffix trees, arrays, and advanced algorithms for efficient string matching operations. Learn about subsequence and substring identification, with practical applications in genomics and databases.
Advanced String Matching Techniques
E N D
Presentation Transcript
Suffix Trees • Suffix trees • Linearized suffix trees • Virtual suffix trees • Suffix arrays • Enhanced suffix arrays • Suffix cactus, suffix vectors, …
Strings and Substrings • String … any sequence of characters. • Substring of string S … string composed of characters i through j, i <= j of S. • S = cater=>ate is a substring. • car is not a substring. • Empty string is a substring of S.
Subsequence • Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. • S = cater=>ate is a subsequence. • car is a subsequence. • The empty string is a subsequence.
String/Pattern Matching • You are given a source string S. • Answer queries of the form: is the string pia substring of S? • Knuth-Morris-Pratt (KMP) string matching. • O(|S| + | pi |) time per query. • O(n|S| + Si | pi |) time for n queries. • Suffix tree solution. • O(|S| + Si | pi |) time for n queries.
String/Pattern Matching • KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. • An application of string matching. • Genome project. • Databank of strings (gene sequences). • Character set is ATGC. • Determine if a “new” sequence is a substring of a databank sequence.
Definition Of Suffix Tree • Compressed trie with edge information. • Keys are the nonempty suffixes of a given string S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r.
String Matching & Suffixes • pi isa substring of S iff pi isa prefix of some suffix of S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r. • Which of these are substrings of S? • leep, eepe, pe, leap, peel
Last Character Of S Repeats • When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. • S = creeper • creeper, reeper, eeper, eper, per, er, r • When the last character of S appears more than once in S, use an end of string character # to overcome this problem. • S = creeper# • creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
1 abbb # b 5 2 abbbb# # b abbbb# b# 3 # abbbb# b 4 # abbbb# b# Suffix Tree For S = abbbabbbb#
abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 5 2 10 3 1 5 9 4 4 8 3 abbbabbbb# 7 2 6 12345678910
abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 1 5 4 2 10 1 3 8 1 5 9 4 4 2 8 3 abbbabbbb# 7 2 6 12345678910
Suffix Tree Construction • See Web write up for algorithm. • Time complexity • |S| = n, alphabet size = r. • O(nr) using array nodes. • This is O(n) for r a constant (or r <= c). • O(n) expected time using a hash table. • O(n) time algorithm for large r in reference cited in Web write up.
Suffix Array • Array that contains the start position of suffixes in lexicographic order. • abbbabbbb# • Assume # < a < b • # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# • SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] • LCP = length of longest common prefix between adjacent entries of SA. • LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]
Suffix Array • Less space than suffix tree • Linear time construction • Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity. • Substring matching binary search for p using SA. • O(|p| log |S|).
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 O(|pi|) Time Substring Matching babb abbba baba
Find All Occurrences Of pi • Search suffix tree for pi. • Suppose the search for pi is successful. • When search terminates at an element node, pi appears exactly once in the source string S.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Element Node abbbb#
Search Terminates At Branch Node • When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Branch Node ab
Find All Occurrences Of pi • To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: • Link all element nodes into a chain in inorder. • Each branch node keeps a pointer to the left most and right most element node in its subtree.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Augmented Suffix Tree b
Longest Repeating Substring • Find longest substring of S that occurs more than m > 1 times in S. • Label branch nodes with number of element nodes in subtree. • Find branch node with label >=m and max char# field.
10 5 7 2 3 abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Longest Repeating Substring m = 2 m = 5
Longest Common Substring • Given two strings S and T. • Find the longest common substring. • S = carport, T = airports • Longest common substring = rport • Longest common subsequence = arport • Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. • Longest common substring may be found in O(|S|+|T|) time using a suffix tree.
Longest Common Substring • Let $ be a new symbol. • Construct the suffix tree for the string U = S$T#. • U = carport$airports# • Find longest repeating substring that is both to left and right of $. • No repeating substring includes $. • Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.