Tonga Institute of Higher Education Design and Analysis of Algorithms

Tonga Institute of Higher EducationDesign and Analysis of Algorithms IT 254 Lecture 4: Data Structures

Data Structures • Data structures are important for many reasons in computer science. The way data is stored can determine how fast you find it and what you can do with it • The data structures we will focus on are hash tables, binary search trees and Red-Black trees • These data structures have a special name: • Dynamic Sets • They are all able to do the following operations: • Search(S, k), FindMin(S), FindMax(S), Insert(S,x), Remove(S,x)

Binary Search Trees • Binary Search Trees (BSTs) are an important data structure in dynamic sets • Each piece of data in the tree will contain: • key: an identifying field that allows ordering • left: pointer to a left child (may be NULL) • right: pointer to a right child (may be NULL) • p: pointer to a parent node (NULL for root)

15 11 16 9 13 17 Binary Search Trees • BST property: key[left(x)]  key[x]  key[right(x)] • Example:

Inorder TreeWalk • What does the following code do? TreeWalk(x) TreeWalk(left[x]); print(x); TreeWalk(right[x]); • Answer: prints elements in sorted (increasing) order • This is called an inorder tree walk • Preorder tree walk: print root, then left, then right • Postorder tree walk: print left, then right, then root

Searching a BST • We can search for an item in a tree by using a key and a pointer to a node • The search will return an element with that key or NULL if not found: TreeSearch(x, k) if (x == NULL or k == key[x]) return x; if (k < key[x]) return TreeSearch(left[x], k); else return TreeSearch(right[x], k);

More BST Searching • Here’s another function that does the same: TreeSearch(x, k) while (x != NULL and k != key[x]) if (k < key[x]) x = left[x]; else x = right[x]; return x; • Which of these two functions is more efficient?

BST Insert() • To insert an element into the tree, we just need to make sure that after inserting the tree still holds the BST property • How can we insert and maintain the property • Use the search algorithm from Searching • Because we will not find the item, we will get a NULL value. Insert the item where the NULL is • Use a “trailing pointer” to keep track of where you came from

BST Insert 0: TreeInsert(Tree,z) 1: y = NULL 2: x = root[Tree] 3: while (x != NULL) { 4: y = x 5: if z->key < x->key 6: x = x->left 7: else 8: x = x->right 9: } 10: z->parent = y 11: if y == NULL 12: root[Tree] = z 13: else if z->key < y->key 14: y->left = z 15: else 16: y->right = z • Like searching, TreeInsert begins at the root and goes downward. • The pointer x traces the path and pointer y is the parent of x • The lines 3-9 move the x and the y down the tree by using the BST property until x is NULL • The NULL is the place we want to put the item "z" • Lines 10-16 set the pointer for item "z"

Running Time of Search/Insert • What is the running time of TreeSearch() or TreeInsert()? • Answer: O(h), where h = height of tree • What is the height of a binary search tree? • Answer: worst case: h = O(n) when tree is just a linear string of left or right children • Most of the time, it will be 2h = #of items, so if there are 64 items, 2h = 64, h = 8 • For now, we'll just say running times in terms of "h" • Later we’ll see how to make sure h is always O(lg n)

Sorting with BSTs • An algorithm for sorting an array A of length n: BSTSort(A) for i=1 to n TreeInsert(A[i]); InorderTreeWalk(root); • What is the running time of this? • Worst case? • Average case?

3 1 8 2 6 5 7 Sorting with BSTs • Average case analysis • It’s a form of quicksort! for i=1 to n TreeInsert(A[i]); InorderTreeWalk(root); 3 1 8 2 6 7 5 1 2 8 6 7 5 2 6 7 5 5 7

Sorting BSTs • The same partitions are done as with quicksort, but in a different order • In the previous example • Everything was compared to 3 once • Then those items < 3 were compared to 1 once • Etc. • This is basically the same type of comparisons that are done in quicksort, just in a different order. • If it’s using the same idea as quicksort, BSTs are also O(n lg n).

F Example: delete Kor H or B B H C A D K Deletion in BST • Deletion is a bit tricky • 3 cases: • x has no children: • Remove x • x has one child: • Take out x, move child into x’s spot • x has two children: • Swap x with successor • Perform case 1 or 2 to delete it

Red-Black Trees • Red-Black trees • A special form of the binary tree. • The difference is the RB Tree has a property that says the tree will always be "balanced" • Balanced means that all operations will happen in O(lg n) time in the worst case • Nodes in the tree get different colors • Using the colors to help organize the tree, we can make sure the height of the tree is always O(lg n)

Red-Black Trees • Our goals for Red-Black Trees • First: describe the properties of red-black trees • Then: prove that the properties guarantee that h = O(lg n) • Finally: describe operations that are done with red-black trees

Red Black Properties • Every node is either red or black • Every leaf (NULL pointer) is black • Note: this means every “real” node has 2 children • If a node is red, both children are black • Note: can’t have 2 consecutive reds on a path 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black The black-height: # black nodes on path to leaf Now every node has 5 items: (color, key, left, right and parent)

Height of Red-Black trees • Theorem: A red-black tree with "n" nodes has height that is at most 2 lg(n + 1) • This is a good thing to know. It says that our operations will never have to go farther down a tree than 2*lg n, which is O(lg n). • We will try to prove this is true by using induction.

Proving Red-Black Height • First we will show that at any node "x", the tree below x has a height of at least 2bh(x) – 1 nodes. • To prove this we will use induction: • Base Case: If the height of x = 0, then x is a leaf, and the number of nodes is 20 – 1 = 0, which is true (there are no nodes below if height is 0)

Proving Red-Black Height • Inductive Case: • Pretend there is a node x that has: • A positive height • 2 children • Each child has a black-height of either: • bh(x) or bh(x)-1, depending if it is red or black • The height of a child of x must be less than the height of x itself, so we can use induction to say: • Number of nodes a child has is at least 2bh(x)-1 -1 • And both children + "x" have at least • (2bh(x)-1 - 1) + (2bh(x)-1 - 1) + 1 = 2bh(x) – 1 • This proves the claim about how many nodes there are

Proving Red-Black Height • Now we are ready to prove that the height is at most 2 lg (n+1) • Property 3 says at least half the nodes on any path must be black. So, the black height on any path must be at least h/2 • So we can now say: n > 2h/2 – 1 (because we plug in h/2 for bh(x) from the last slide) • This equals: • n + 1 > 2h/2 • lg (n+1) > h/2 • 2 lg (n+1) > h • This is what we wanted to prove, that the height is always less than O(lg n). • All our operations will now be guaranteed to be O(lg n) as well (like search, insert, delete…)

7 7 5 9 5 9 12 12 Red-Black Trees Example • Example Red Black tree • Is it colored correctly? Red-black properties: 1. Every node is either red or black 2. Every leaf (NULL pointer) is black 3. If a node is red, both children are black 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black

7 5 9 12 Inserting in Red-Black Trees • Insert 8 • Where does it go? • What color is it? 1. Every node is either red or black 2. Every leaf (NULL pointer) is black 3. If a node is red, both children are black 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black

Inserting in Red Black • Insert 8 7 5 9 8 12

Insert in Red Black • How about inserting 11? • What color? • It can’t be black! • It can’t be red! 7 5 9 8 12 11 1. Every node is either red or black 2. Every leaf (NULL pointer) is black 3. If a node is red, both children are black 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black

Insert in Red-Black • Insert 11 • Where does it go? • What color? • Solution: Change the colors in the tree 7 5 9 8 12 11 1. Every node is either red or black 2. Every leaf (NULL pointer) is black 3. If a node is red, both children are black 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black

Inserting Problems • Insert 10 • Where does it go? • What color? 7 5 9 8 12 11 1. Every node is either red or black 2. Every leaf (NULL pointer) is black 3. If a node is red, both children are black 4. Every path from node to descendent leaf contains the same number of black nodes 5. The root is always black 10

Insertion Problems • Insert 10 • Where does it go? • What color? • Answer: no color! • Tree is too imbalanced • Must change tree structureto allow re-coloring • Goal: to change the tree in O(lg n) time so that all properties hold 7 5 9 8 12 11 10

Red Black Rotation • Our basic operation for changing the tree is called rotation: • Trees are rotated by changing the pointers in the tree y x rightRotate(y) x C A y leftRotate(x) A B B C

Rotation • What does rightRotate() do? • Answer: It changes a lot of pointers • x keeps its left child • y keeps its right child • x’s right child becomes y’s left child • x’s and y’s parents change • What is the running time: O(1). y x rightRotate(y) x C A y A B B C

Rotation Example • Goal: To rotate left about 9: 7 5 9 8 12 11

Rotation Example • Rotate left around 9 7 5 12 9 8 11

Rotation Code Left Rotate Example x y leftRotate(x) A y x C A B B C LeftRotate(Tree,x) y = x->right // we only know x at start, so get y x->right = y->left // start the rotating if y->left != NULL // if B is not null … y->parent->left = x // switch parent of B to be x y->parent = x->parent // change parent's if x->parent == NULL // if x was root node … root[Tree] = y // change root node to be y else if x == x->parent->left x->parent->left = y // if x was on left side… else // make y on left side of parent x->parent->right = y // else make y on right side of parent y->left = x // now x is the child of y x->parent = y // now x's parent is y

Red Black Insertion • The idea of Insertion is to: • Insert x into tree, color x red • The only red black property that might be violated is if x’s children are also red (because they have to be black) • If so, move violation up tree until a place is found where it can be fixed • Total time will be O(lg n) • To actually write the code for this is complicated because there are many cases we must be careful of. For example: What happens when a parent's left child is null and red. • Deletion is also O(lg n), and is also quite complicated

Hashing and Hash Tables • Sometimes, people just need a few operations: Insert, Search, Delete. A hash table is a good solution, because searching is usually O(1) • At it’s most basic level, a hash table is a data structure like an array. Data is stored in the array at specific indexes. The indexes are chosen by a hash function • A hash function is a mapping between the input set of data and a set of integers • This allows for getting the data in O(1) time. • You just take the thing you are looking for, apply the hash function, and see if it is at the right place in the array

Hashing and Hash Tables • Definition: a Hash will take a value and convert it to an integer, so that it can be put in an array • With hash tables, there is always the chance that two elements will hash to the same integer value. • This is called a collision, and special things have to be done to make sure that things don’t get deleted.

What can we do with hash tables? • One idea: A dictionary • Goal: We want to be able to lookup any word in a dictionary very fast. • Problem: If the words are in an array, we must go through each word before we find the correct one. We cannot use trees because words cannot be sorted very well (most of the time) • Solution: We can use a hash table to map words to integers and store them at a place in an array. • We choose a hash function that changes letters to numbers. • Then when we need to look something up, we use the hash function, check that place in the array and see if it exists.

Hash Tables • To use mathematical language: • Given a table T and a record x, with key and any other data, we need to support: • Insert (T, x) • Delete (T, x) • Search(T, x) • We want these to be fast, but don’t care about sorting the records • The structure we can use is a hash table • Supports all the above in O(1) expected time!

A Simple Hash: Direct Addressing • Suppose: • The range of keys is 0..m • And all keys are different • The idea: • Set up an array T[0..m] in which • T[i] = x if x T and hash-key[x] = i • T[i] = NULL otherwise • This is called a direct-address table • Operations take O(1) time! • So what’s the problem?

Direct Addressing Problems • Direct addressing works well when the range m of keys is small • But what if the keys are 32-bit integers? • Problem 1: direct-address table will have 232 entries, more than 4 billion • Problem 2: even if memory is not an issue, the time to initialize the elements to NULL may be really long • Solution: map keys to smaller range 0..n where n < m • This mapping is called a hash function

Hash tables Now the only problem is: what happens if two keys go to the same place in the array (called a collision)? We can fix collisions with policies: Examples: Chaining Open Addressing T 0 U(universe of keys) h(k1) k1 h(k4) k4 K(actualkeys) k5 h(k2) = h(k5) k2 h(k3) k3 m - 1

Open Addressing • Basic idea of open addressing to resolve collisions • On an insert: if index is full, try another index, …, until an open index is found (called probing) • To search, follow same sequence of probing as would be used when inserting the element • If you reach the element with correct key, return it • If you reach a NULL pointer, element is not in table • Good for fixed sets • Good for sets where there is no deleting only adding • When you delete you will create NULL holes in indexes and searching might not return the correct element • Tables don’t need to be much bigger than n

Chaining • Chaining will put elements in a list if there is a collision —— U(universe of keys) k1 k4 —— —— k1 —— k4 K(actualkeys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k3 —— k8 k6 k8 k6 —— ——

Analysis of Chaining • To insert, delete and search, it is simple. Apply the hash function to the data. Then search down the list • To analyze: • Assume that each key in the table is equally likely to be hashed to any slot • Given n keys and m slots in the table: • Then n/m = average # keys per slot • How long does it take for a search that does not find something? • You need to apply the hash function and search the whole list • O(1+n/m) • How long for something that is found? (on average) • = O(1+ (n/m) * (1/2)) = O(1 + n/m)

Analysis of Chaining • So the cost of searching = O(1 + n/m) • If the number of keys is about the same as the number of indexes in the table, what is n/m? • Answer: n/m = O(1) • In other words, we can make the average cost of searching constant if we make n/m constant

Making a hash function • Choosing a good hash function will be important, because it will determine how many collisions you have • One idea: The Division Method: • Hash(k) = k mod m • In words: hash integer k into a table with m slots using the slot given by the remainder of k divided by m • This is a decent function, but if m is a power of 2 or power of 10, there might be a lot of collisions. • Solution: pick table size m = prime number not too close to a power of 2 (or 10)

Hash Functions: Polynomial Hash Codes • A modulus hash function might be good for whole numbers, but what if you have strings? • A polynomial hash code can take into account strings and the position of each character within a string to produce a relatively unique hash code • If we choose a constant "a" > 1 and use the following code • If we take each character and convert to it's ASCII, then use those ASCII codes as co-efficients "xi" • F(x) = x0ak-1 + x1ak-2 + … + xk-2a + xk-1 • This produces a number that will have few collisions for the most part, but if the string is really long (longer than 31 characters), the number that is produced might not be useful for a hash function

Using Hash Functions • Very rarely will you need to make a Hashtable or hash function. • Instead most languages, C++, Java and Visual Basic for example, all have built in Hashtables that can be used easily • Remember, Hashtables are useful because they can quickly insert, remove and find elements.

Summary • Data structures allow a programmer to control and manipulate their data in different ways that can improve performance and solve problems. • We looked at Dynamic sets like Binary search Trees, Red-black trees and hash functions. You should be familiar with how all three of them work • You should also know the running times of these data structures and be aware of when you should use them.

Tonga Institute of Higher Education Design and Analysis of Algorithms