240 likes | 312 Vues
Streamline the sorting of n strings efficiently by extracting the next lexicographically smallest string using our Heap of Strings (HoS) method, combining Heap and Longest Common Prefix properties for optimal performance. The algorithm works for unbounded alphabets and ensures quick processing time with minimal overhead.
 
                
                E N D
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald
On-Demand String Sorting • Preprocessing a set of n strings for an efficient subsequent repeated: • extract the next lexicographically smallest string • Motivation, e.g.: Search engines recurrently return the next best k (k < n)pages • Pages typically ranked by relevance, but can also by values of specified field • Our Heap of Strings (HoS) preprocesses in O(n) time and extracts next smallest in O(log n) time + amortized O(N) time, over all operations on all n strings • N = total length of the n strings. • Combines Heap and Longest Common Prefix properties • Works for unbounded alphabets • e Sorting of Strings is possible in O(n log n + N) time. • Implementing all classic Heap operations, only paying extra O(N) amortized, is not simple
LCP • Def: lcp(S1,S2) denotes the length of the largest common prefix of S1 and S2 • Folklore Lemma: • For strings S1,S2,…,Sm , lcp(S1,Sm) = min1[i<m lcp(Si ,Si+1) • e For strings S1m S2, S3 • S2[ S3t lcp(S1,S2) [ lcp(S1,S3) • Equivalently: lcp(S1,S2) >lcp(S1,S3) t S2> S3 • e For strings S1[ S2, S3 • S2[ S3t lcp(S1,S2) m lcp(S1,S3) • Equivalently: lcp(S1,S2) <lcp(S1,S3) t S2> S3
Heap of Strings (HoS) • Binary (balanced) tree • Each node n holds a string S(n) and an lcp value lcp(n) • Each node n satisfies the HoS property: • S(n) m S(p) for parent p of n • lcp(n)= lcp (S(n), S(p) )
Procedures for Heapify • Basic Step for 2 (BS2): • Given strings S1, S2, • compare the two, letter by letter, • until smaller is identified, and the lcp of the two is determined. • Basic Step for 3 (BS3): • Given strings S1, S2, and S3, • compare all three, letter by letter, • until smallest, Si, is identified, and the lcp of each of the other strings with Si is determined.
More Procedures for Heapify • Basic Step for 2, starting in position l, BS2(l) : • Given strings S1, S2, with common prefix of length l • compare the two, letter by letter, starting from position l • until smaller, Si, is identified, and the lcp of the two is determined. • Basic Step for 3, starting in position l, BS3(l) : • Given strings S1, S2, and S3, with common prefix of length l • compare all three, letter by letter, starting from position l • until smallest, Si, is identified, and the lcp of each of the other strings with Si is determined.
Strings thrown into a binary balanced tree Heapify - based on the classic O(n) process Bottom up, Subtrees are made into HoS-s
Merge two little HoS-es and a root node into one big HoS BS3 of three nodes.(two larger get lcp wrt smallest) Swap, if needed, to getsmallest positioned as root At most one subtree gets a new root This new root, as well as two children, all have larger strings than grand root, and have lcp-s wrt it
Merge two little HoS-es and a root node into one big HoS Comparing lcp-s suffices. On equality, read, from lcp on, BS2(l), or BS3(l), until smallest is found and updated lcp determined. Always record lcp with strings found larger,and swap, if needed, to make smallest the parent. Thus maintaining HoS property If swap needed, continue recursivelysifting down
Merge two little HoS-es and a root node into one big HoS: Sifting Down • On each swap, a sub-HoS gets a new root. • That new root, as well as its two children, all have larger strings than, and lcp wrt, old root • now positioned as parent of new root. • e Comparing lcp suffices to tell smallest of • On tie, read more of two (or three) strings, starting at common lcp, until smallestfound and updated lcp is determined • Record updated lcp with larger string(s).
Sifting down in a HoS of height h • O(h) node operations • For each string comparison, at least one string has its lcp field increase by the number of letter comparisons made.
Heapifying into a HoS of n strings of total length N takes O(n+N) time • A string with lcp = l never gets its prefix of length l participating in any letter comparison • e No more than a total of O(N) letter comparisons • e Heapifying completes in O(n) node operations + a total of O(N) letter comparisons.
Extracting next smallest string from a HoS • Extract the root • Both (now orphan) children are larger than, and have lcp wrt, their gone parent • Comparing lcp-s suffices to find smaller of the two. • In case of a tie, BS2(l) finds smaller and updates lcp;record updated lcp with the larger child • Promote smaller child to vacant parent position • Recurse in subtree rooted by promoted child • HoS property maintained. • Tree might become unbalanced, but not higher.
Extracting next smallest string from a HoS For a HoS of height h O(h) node operations. Thm: Sorting of n strings of total length N,over unbounded alphabet, is possible in O(n log n + N) time, using O(n) space. Letter comparison only from common lcp on. For each letter comparison, at least one lcp grows. lcp never decreases. HoS becomes unbalanced But height does not grow. eNo more than a total of O(N) letter comparisons for heapify followed by sequential extraction of all strings
O(nlogn + N) string sorting • String sorting is a classical problem, appears in textbooks [Knuth, AHU] • Variants: multikey sorting, parallel sorting • Weight balanced ternary search trie [M 79] achieves this runtime • QuickSort with average sorting time of O(nlogn + N) [BS, 97] • Multi phase merge sort, for enhancing cache utilization [I, 05. IBM in the ’80s]
O(nlogn + N) string sorting • Indexing data structures: suffix trees, suffix arrays, BIS [AKLL, 05] • for suffixes of same string • Allow O(nlogn + N) sorting • Some can adjust to a general set of strings (BIS) • All use O (n log n + N) just to build the data structure and get the first result out.
Efficient On Demand Sorting • Thm: On Demand Sorting of n strings of total length Ncan be done with the extraction of the first result in O(n + N1) time, after which the retrieval of further results in O(log n + Ni) time for the i-th result, with Si Ni [ N.
Variation:Find the smallest k < n strings • Maintain a HoS of k elements, with parents LARGER than children. • root holds largest of smallest k • Build a HoS from arbitrary k of the set • For each remaining string in the set: • compare with root and determine lcp of the two • if new is larger than root – discard new • otherwise, discard root, and sift down
Find the smallest k < n strings O (n log k + N) to identify k smallest of n+O ( k log k) to get these sorted.
Can HoS do additional operations of the ordinary (integer) Heap? • Already seen: Extract min costs O(height) yet does not maintain tree balanced • The classic delete takes the last leaf and sifts it down from root, thus maintaining balance • Leaf loses the lcp it has gained, and compares again its leading letters • Classic insertion by sifting up • Some nodes get their grandparent becoming their parent, need to decrease their lcp
Insertion, creating embedded data structure Original nodes do not move. Grandparent do not become parents BIS BIS When pumping up forextraction, smallest node leavesBIS and becomes HoS node. BIS BIS HoS node never gets intoa BIS. BIS
BIS Balanced Indexing StructuresAKLL, 2005 • Adapted here from suffixes to any set of strings • BIS is an AVL tree with fixed size extra info (lcps and pointers) in nodes • allows to insert a string of length l to a tree of size n in O(log n + l ) time • Deletion in O(log n)
Altogether • Thm: It is possible to construct a heap of n strings in O(n) time and support further string insertions and smallest string extractions in time O(log(n) + log(m)) + O(N) amortized over the whole sequence of heap operations, where m is the number of strings inserted post heapifying and were not extracted yet.
Conclusion • Combining basic elements, we support a modern concept: On Demand • lcp is proved again an interesting, useful measure (lcp with what?) • This is a real need: basic sort and k smallest sort are implemented in a search engine product