Data Structures and Algorithms

Data StructuresandAlgorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression

Radix Searching • For many applications, keys can be thought of as numbers • Searching methods that take advantage of digital properties of these keys are called radix searches • Radix searches treat keys as numbers in base M (the radix) and work with individual digits Lecture 10: Searching

Radix Searching • Provide reasonable worst-case performance without complication of balanced trees. • Provide way to handle variable length keys. • Biased data can lead to degenerate data structures with bad performance. Lecture 10: Searching

The Simplest Radix Search • Digital Search Trees — like BSTs but branch according to the key’s bits. • Key comparison replaced by function that accesses the key’s next bit. Lecture 10: Searching

A E S C H R Digital Search Example A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 Lecture 10: Searching

Digital Search Trees • Consider BST search for key K • For each node T in the tree we have 4 possible results • T is empty (or a sentinel node) indicating item not found • K matches T.key and item is found • K < T.key and we go to left child • K > T.key and we go to right child • Consider now the same basic technique, but proceeding left or right based on the current bit within the key

Digital Search Trees • Call this tree a Digital Search Tree (DST) • DST search for key K • For each node T in the tree we have 4 possible results • T is empty (or a sentinel node) indicating item not found • K matches T.key and item is found • Current bit of K is a 0 and we go to left child • Current bit of K is a 1 and we go to right child • Look at example on board

Digital Search Trees • Run-times? • Given N random keys, the height of a DST should average O(log2N) • Think of it this way – if the keys are random, at each branch it should be equally likely that a key will have a 0 bit or a 1 bit • Thus the tree should be well balanced • In the worst case, we are bound by the number of bits in the key (say it is b) • So in a sense we can say that this tree has a constant run-time, if the number of bits in the key is a constant • This is an improvement over the BST

Digital Search Trees • But DSTs have drawbacks • Bitwise operations are not always easy • Some languages do not provide for them at all, and for others it is costly • Handling duplicates is problematic • Where would we put a duplicate object? • Follow bits to new position? • Will work but Find will always find first one • Actually this problem exists with BST as well • Could have nodes store a collection of objects rather than a single object

Digital Search Trees • Similar problem with keys of different lengths • What if a key is a prefix of another key that is already present? • Data is not sorted • If we want sorted data, we would need to extract all of the data from the tree and sort it • May do b comparisons (of entire key) to find a key • If a key is long and comparisons are costly, this can be inefficient

Digital Search • Requires O(log N) comparisons on average • Requires b comparisons in the worst case for a tree built with N random b-bit keys Lecture 10: Searching

Digital Search • Problem: At each node we make a full key comparison — this may be expensive, e.g. very long keys • Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons Lecture 10: Searching

Radix Tries • Used for Retrieval [sic] • Internal nodes used for branching, external nodes used for final key comparison, and to store data Lecture 10: Searching

Radix Trie Example A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 H E A C S R Lecture 10: Searching

Radix Tries • Left subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit • An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst case Lecture 10: Searching

Radix Tries • Problem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above) • This is addressed by Patricia trees, which allow “lookahead” to the next relevant bit • Practical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia) • In the slides that follow the entire alphabet would be included in the indexes Lecture 10: Searching

Radix Search Tries • Benefit of simple Radix Search Tries • Fewer comparisons of entire key than DSTs • Drawbacks • The tree will have more overall nodes than a DST • Each external node with a key needs a unique bit-path to it • Internal and External nodes are of different types • Insert is somewhat more complicated • Some insert situations require new internal as well as external nodes to be created • We need to create new internal nodes to ensure that each object has a unique path to it • See example

Radix Search Tries • Run-time is similar to DST • Since tree is binary, average tree height for N keys is O(log2N) • However, paths for nodes with many bits in common will tend to be longer • Worst case path length is again b • However, now at worst b bit comparisons are required • We only need one comparison of the entire key • So, again, the benefit to RST is that the entire key must be compared only one time

Improving Tries • How can we improve tries? • Can we reduce the heights somehow? • Average height now is O(log2N) • Can we simplify the data structures needed (so different node types are not required)? • Can we simplify the Insert? • We will examine a couple of variations that improve over the basic Trie

Let be S be a sequence of n (key, element) entries with keys in the range [0, N- 1] Bucket-sort uses the keys as indices into an auxiliary array B of sequences (buckets) Phase 1: Empty sequence S by moving each entry (k, o) into its bucket B[k] Phase 2: For i = 0, …,N -1, move the entries of bucket B[i] to the end of sequence S Analysis: Phase 1 takes O(n) time Phase 2 takes O(n+ N) time Bucket-sort takes O(n+ N) time Bucket-Sort AlgorithmbucketSort(S,N) Inputsequence S of (key, element) items with keys in the range [0, N- 1]Outputsequence S sorted by increasing keys B array of N empty sequences whileS.isEmpty() f S.first() (k, o) S.remove(f) B[k].insertLast((k, o)) for i 0 toN -1 whileB[i].isEmpty() f B[i].first() (k, o) B[i].remove(f) S.insertLast((k, o)) Bucket-Sort and Radix-Sort

Bucket Sort Each element of the array is put in one of the N “buckets”

Bucket Sort Now, pull the elements from the buckets into the array At last, the sorted array (sorted in a stable way):

1001 1001 1001 0001 0010 0010 0010 1101 0001 1110 1101 1001 0001 0010 1001 1101 0001 0010 1101 1101 1110 1110 1110 1110 0001 Example • Sorting a sequence of 4-bit integers Bucket-Sort and Radix-Sort

7, d 1, c 3, a 7, g 3, b 7, e 1, c 3, a 3, b 7, d 7, g 7, e B 0 1 2 3 4 5 6 7 8 9        1, c 3, a 3, b 7, d 7, g 7, e Example • Key range [0, 9] Phase 1 Phase 2 Bucket-Sort and Radix-Sort

Key-type Property The keys are used as indices into an array and cannot be arbitrary objects No external comparator Stable Sort Property The relative order of any two items with the same key is preserved after the execution of the algorithm Extensions Integer keys in the range [a, b] Put entry (k, o) into bucketB[k - a] String keys from a set D of possible strings, where D has constant size (e.g., names of the 50 U.S. states) Sort D and compute the rank r(k)of each string k of D in the sorted sequence Put entry (k, o) into bucket B[r(k)] Properties and Extensions Bucket-Sort and Radix-Sort

Lexicographic Order • A d-tuple is a sequence of d keys (k1, k2, …, kd), where key ki is said to be the i-th dimension of the tuple • Example: • The Cartesian coordinates of a point in space are a 3-tuple • The lexicographic order of two d-tuples is recursively defined as follows (x1, x2, …, xd) < (y1, y2, …, yd)x1 <y1 x1=y1 (x2, …, xd) < (y2, …, yd) I.e., the tuples are compared by the first dimension, then by the second dimension, etc. Bucket-Sort and Radix-Sort

Lexicographic-Sort AlgorithmlexicographicSort(S) Inputsequence S of d-tuplesOutputsequence S sorted in lexicographic order for i ddownto 1 stableSort(S, Ci) • Let Ci be the comparator that compares two tuples by their i-th dimension • Let stableSort(S, C) be a stable sorting algorithm that uses comparator C • Lexicographic-sort sorts a sequence of d-tuples in lexicographic order by executing d times algorithm stableSort, one per dimension • Lexicographic-sort runs in O(dT(n)) time, where T(n) is the running time of stableSort Example: (7,4,6) (5,1,5) (2,4,6) (2, 1, 4) (3, 2, 4) (2, 1, 4) (3, 2, 4) (5,1,5) (7,4,6) (2,4,6) (2, 1, 4) (5,1,5) (3, 2, 4) (7,4,6) (2,4,6) (2, 1, 4) (2,4,6) (3, 2, 4) (5,1,5) (7,4,6) Bucket-Sort and Radix-Sort

Radix-sort is a specialization of lexicographic-sort that uses bucket-sort as the stable sorting algorithm in each dimension Radix-sort is applicable to tuples where the keys in each dimension iare integers in the range [0, N- 1] Radix-sort runs in time O(d( n+ N)) Radix-Sort AlgorithmradixSort(S, N) Inputsequence S of d-tuples such that (0, …, 0)  (x1, …, xd) and (x1, …, xd)  (N- 1, …, N- 1) for each tuple (x1, …, xd) in SOutputsequence S sorted in lexicographic order for i ddownto 1 bucketSort(S, N) Bucket-Sort and Radix-Sort

Radix-Sort for Binary Numbers • Consider a sequence of nb-bit integers x=xb- 1 … x1x0 • We represent each element as a b-tuple of integers in the range [0, 1] and apply radix-sort with N= 2 • This application of the radix-sort algorithm runs in O(bn) time • For example, we can sort a sequence of 32-bit integers in linear time AlgorithmbinaryRadixSort(S) Inputsequence S of b-bit integers Outputsequence S sorted replace each element x of S with the item (0, x) for i 0 tob - 1 replace the key k of each item (k, x) of S with bit xi of x bucketSort(S, 2) Bucket-Sort and Radix-Sort

Does it Work for Real Numbers? • What if keys are not integers? • Assumption: input is n reals from [0, 1) • Basic idea: • Create N linked lists (buckets) to divide interval [0,1) into subintervals of size 1/N • Add each input element to appropriate bucket and sort buckets with insertion sort • Uniform input distribution  O(1) bucket size • Therefore the expected total time is O(n) • Distribution of keys in buckets similar with …. ?

Radix Sort • What sort will we use to sort on digits? • Bucket sort is a good choice: • Sort n numbers on digits that range from 1..N • Time: O(n + N) • Each pass over n numbers with d digits takes time O(n+k), so total time O(dn+dk) • When d is constant and k=O(n), takes O(n) time

Radix Sort Example • Problem: sort 1 million 64-bit numbers • Treat as four-digit radix 216 numbers • Can sort in just four passes with radix sort! • Running time: 4( 1 million + 216 ) 4 million operations • Compare with typical O(n lg n) comparison sort • Requires approx lg n = 20 operations per number being sorted • Total running time  20 million operations

Radix Sort • In general, radix sort based on bucket sort is • Asymptotically fast (i.e., O(n)) • Simple to code • A good choice • Can radix sort be used on floating-point numbers?

Summary: Radix Sort • Radix sort: • Assumption: input has d digits ranging from 0 to k • Basic idea: • Sort elements by digit starting with least significant • Use a stable sort (like bucket sort) for each stage • Each pass over n numbers with 1 digit takes time O(n+k), so total time O(dn+dk) • When d is constant and k=O(n), takes O(n) time • Fast, Stable, Simple • Doesn’t sort in place

Multiway Tries • RST that we have seen considers the key 1 bit at a time • This causes a maximum height in the tree of up to b, and gives an average height of O(log2N) for N keys • If we considered m bits at a time, then we could reduce the worst and average heights • Maximum height is now b/m since m bits are consumed at each level • Let M = 2m • Average height for N keys is now O(logMN), since we branch in M directions at each node

Multiway Tries • Let's look at an example • Consider 220 (1 meg) keys of length 32 bits • Simple RST will have • Worst Case height = 32 • Ave Case height = O(log2[220])  20 • MultiwayTrie using 8 bits would have • Worst Case height = 32/8 = 4 • Ave Case height = O(log256[220])  2.5 • This is a considerable improvement • Let's look at an example using character data • We will consider a single character (8 bits) at each level • Go over on board

Multiway Tries • So what is the catch (or cost)? • Memory • Multiway Tries use considerably more memory than simple tries • Each node in the multiwaytrie contains M pointers/references • In example with ASCII characters, M = 256 • Many of these are unused, especially • During common paths (prefixes), where there is no branching (or "one-way" branching) • Ex: through and throughout • At the lower levels of the tree, where previous branching has likely separated keys already

Patricia Trees • Idea: • Save memory and height by eliminating all nodes in which no branching occurs • See example on board • Note now that since some nodes are missing, level i does not necessarily correspond to bit (or character) i • So to do a search we need to store in each node which bit (character) the node corresponds to • However, the savings from the removed nodes is still considerable

Patricia Trees • Also, keep in mind that a key can match at every character that is checked, but still not be actually in the tree • Example for tree on board: • If we search for TWEEDLE, we will only compare the T**E**E • However, the next node after the E is at index 8. This is past the end of TWEEDLE so it is not found • Run-time? • Similar to those of RST and MultiwayTrie, depending on how many bits are used per node

Patricia Trees • So Patricia trees • Reduce tree height by removing "one-way" branching nodes • Text also shows how "upwards" links enable us to use only one node type • TEXT VERSION makes the nodes homogeneous by storing keys within the nodes and using "upwards" links from the leaves to access the nodes • So every node contains a valid key. However, the keys are not checked on the way "down" the tree – only after an upwards link is followed • Thus Patricia saves memory but makes the insert rather tricky, since new nodes may have to be inserted between other nodes • See text

PATRICIA TREE • A particular type of “trie” • Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

PATRICIA TREE • Therefore, PATRICIA TREE will have the following attributes in its internal nodes: • Index bit (check bit) • Child pointers (each node must contain exactly 2 children) • On the other hand, leave nodes must be storing actual content for final comparison

SISTRING • Sistring is the short form of ‘Semi-Infinite String’ • String, no matter what they actually are, is a form of binary bit pattern. (e.g. 11001) • One of the sistring in the above example is 11001000… • There are totally 5 sistrings in this example

SISTRING • Sistrings are theoretically of infinite length • 110010000… • 10010000… • 0010000… • 010000… • 10000… • Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.

SISTRING • Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! • e.g. CUHK • Corresponding sistrings would be • CUHK000… • UHK000… • HK000… • K000… • We require each should be at least 4 characters long. • (Why we pad 0/NULL at the end of sistring?)

SISTRING (USAGE) • SISTRINGs are efficient in storing substring information. • A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n3) • e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK. • Storage requirement is O(n2)max(length) -> O(n3)

SISTRING (USAGE) • We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage. • CUHK <- represent C CU CUH CUHK at the same time • UHK0 <- represent U UH UHK at the same time • HK00 <- represent H HK at the same time • K000 <- represent K only • A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. • Conclusion, sistrings is better representation for storing sub-string information.

PAT Tree • Now it is time for PAT Tree again • PAT Tree is a PATRICIA TREE store every sistrings of a document • What if the document is now contain simply ‘CUHK’? • We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result • It looks frustrating for even small example, but it is how PAT tree works!

PAT Tree (Example) • By digitalizing the string, we can manually visualize how the PAT Tree could be. • Following is the actual bit patternof the four sistrings • Once we understand how thePAT-tree work, we won’tdetail it in later examples.

PAT Tree • In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.” • In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word-by-word, instead of character-by-character.

Data Structures and Algorithms