Introduction to Computer Science 2 Balanced Binary Search Trees (2) & Extended Binary Trees

Introduction to Computer Science 2 Balanced Binary Search Trees (2)&Extended Binary Trees Prof. Neeraj Suri Brahim Ayari

Height of AVL Trees • AVL trees are defined by the height difference of subtrees • Original goal: the tree should be as “balanced” as possible • How balanced is an AVL tree? • The answer is given by the theorem of height of an AVL tree: Theorem: For the height h(T) of an AVL tree with n nodes holds:  log2n + 1  h(T)  1.44 log2( n+1 )

Fibonacci Trees • The lower bound  log2n  + 1  h(T) comes from the minimal height of a balanced binary tree (already shown) • For the proof of the upper bound one needs a special class of AVL trees: Fibonacci trees • Fibonacci numbers: F0 = 0, F1 = 1, Fn = Fn-1 + Fn-2 Definition: Fibonacci Trees are constructed as follows: • The empty tree T0 is a Fibonacci tree (height 0) • The tree T1, that contains only one node is a Fibonacci tree of height 1 • If Th-1 and Th-2 are Fibonacci trees of heights h-1 and h-2, and x a node, then Th = (Th-1, x, Th-2) is a Fibonacci tree of height h • No other trees are Fibonacci trees -> Observe: the number of nodes on the path from root to the deepest leaf gives the height of the Fibonacci tree !

Fibonacci Trees Number of nodes n0 = 0, F0= 0 n1 = 1, F1= 1 n2 = 2 , F2= 1 n3 = 4, F3= 2 T0 : empty tree T1: one node x T2: (T1, x, T0) x T3: (T2, x, T1)

Fibonacci Trees Number of nodes n4 = 7 , F4= 3 n5 = 12 , F5=5 T4: (T3, x, T2) T2 T3 x T5: (T4, x, T3) T4 T3 T6, T7, etc. analogue

Fibonacci and AVL Trees To prove: Every Fibonacci tree is an AVL tree Proof (by induction over h): • Note: Th is always a tree of height h • T0 and T1 are AVL trees • If Th-1 and Th-2 are AVL trees, build according to the rules Th = (Th-1, x, Th-2). • As Th-1 and Th-2 are AVL trees, we must now only check the balancing factor of the root • BF(Th) = | h(Th-1) - h(Th-2) | = | (h - 1) - (h - 2) | = 1 

Fibonacci and AVL Trees • Special note: for a given Fibonacci tree there are no AVL trees with the same height and fewer nodes • The construction gives AVL trees with maximal height • One can add more nodes with kept height, but remove none without violating the AVL criterion (height is kept unchanged) • Fibonacci trees gives the maximal height of an AVL tree for a given number of nodes • Note: the number of nodes nh in Th is the number of nodes in the (h+2)-th Fibonacci number minus 1, i.e., nh = Fh+2 - 1 (for n  0)

Fibonacci and AVL Trees • The following inequality holds for Fibonacci numbers: Fh h-2 for h  2 and  = ½ ( 1 + 5 ) • n is the number of nodes in an AVL tree of height h. As Th contains a minimal number of nodes: n  nh • Insert nh = Fh+2 - 1: n  nh = Fh+2 - 1  h - 1 thus n + 1  h • Number of nodes grows exponentially with the height • Reversely: h  log (n + 1) = (1 / log2) log2(n+1) = 1.44... log2(n+1) • Thus: search path in an AVL tree is in worst case 44% longer than in a complete tree

Cost Analysis of AVL Trees • h  c•log2 (n+1) means: the height of an AVL tree is limited by O(log2n)  Cost for insertion is in O( log2n ) • One should only consider the path from the root to the insertion point • Rotations have constant costs  Cost for deletion is in O( log2n ) • For every node on the path from the root to the deleted node results in maximally one rotation • AVL trees are worst case efficient implementations of binary search trees • Natural trees need (n) steps in worst case • Calculating the average height is still an open problem • Empirical results give h = c + log2n for c  0,2

Weight Balanced Binary Search Trees • Treat the “weight difference” of two subtrees as a measure of balancing • Weight = number of nodes in subtree • The properties are very similar to height balanced binary trees • Let T be a binary search tree, TL the left subtree and n(X) the number of nodes in a tree X Definition: the value (T) = (n(TL) + 1) / (n(T) + 1) is the root balance of T Definition: a tree T is -balanced, if for every subtree T’ holds that:   (T’)  1 - 

Condition   (T’)  1 -  • The set of all -balanced binary trees are called BB() („bounded balance“). • The definition of balance only considers the left subtree, but for a BB() tree holds also for every subtree   1 - ’(T’)  1 -  where ’ analogue to  is defined on the right subtree • Parameter  defines the “distance” from a complete tree: •  = ½ only complete trees allowed •  < ½ relaxed condition •  = 0 no structural conditions •  > ½ makes no sense to consider

Mars Jupiter Pluto Earth Mercury Uranus Neptune Saturn Venus Example • (T) = (n(TL) + 1) / (n(T) + 1) • Choose  = 0.3, then holds for every subtree  = 0.3    1 -  = 0.7 • Tree is in BB() for  = 0.3

Notes • Already noted:  = ½ holds for complete trees • Root balance < ½ means: there are fewer nodes in the left subtree •  limits the root balance symmetrically from both sides • Left tree is complete: root balance goes towards 1 with increasing number of nodes • Only  = 0 allows all “degenerations” • Not every tree (with n nodes) can be transformed into a BB() tree for any  • There is at least one tree in BB() when 0,25    1 - ½ 2  0,292

Height of Weight Balanced Trees • Note: when traversing the path from the root to the leaves one “looses”, dependent on , a number of nodes at every step • Consider the path p = v1, v2, ..., vh • For the right and left subtree TL and TR of a tree T holds (due to the BB() condition) n(TL) + 1  ( 1 -  ) (n(T) + 1) n(TR) + 1  ( 1 -  ) (n(T) + 1) • Traversal of path p: n(v2) + 1  ( 1 -  ) (n(v1) + 1) n(v3) + 1  ( 1 -  ) (n(v2) + 1)  n(vh) + 1  ( 1 -  ) (n(vh-1) + 1)

Height of Weight Balanced Trees • As v1 is the root and vh a leaf, holds: n(T) + 1 = n(v1) + 1 and n(vh) + 1 = 2 • Insertion in the total inequality : 2 = n(vh) + 1  (1 - )h-1 (n(v1) + 1) = (1 - )h-1 (n(T) + 1) • Apply logarithms on both sides: 1  (h - 1)log2(1 - ) + log2 (n(T) + 1) • Thus (note: log2(1 - ) < 0 for  > 0): h - 1  log2 (n(T) + 1) / c  O(log2n) Height of the tree is logarithmic in the number of nodes

Operations on Weight Balanced Binary Trees • Search is the same as for AVL trees • Cost is logarithmic • For insertion/deletion the root balance must be updated along the path from the root to the corresponding position • By violation of the criterion: rotations as for AVL trees • Open issues: • Are rotations appropriate measures for restructuring BB() trees? • How does one effectively calculate the root balance? • The number of rotations on the path to the root is limited: search/insertion/deletion are all in O(log2n)

Position Search in Balanced Binary Search Tree • Comparison: Tree implementations vs. linked lists • Balanced trees allows (almost) all operations in O(log2n) • Linked lists need for search/insertion/deletion in O(n)! • For sequential traversal both perform in O(n) • Should sorted data always be stored in trees?! • One should not underestimate the implementation costs • “Last” operation where lists “win” is for positional search (the pth element) • Positional search: Find the kth element in a list • For trees the “list” is an inorder traversal

? The Problem • For lists: • Travers k elements in O(k) • For trees: • One does not “know” whether to go left or right, and one does not know anything about the number of nodes in the subtrees • Worst case all nodes must be visited: O(n)! • That can be improved! ...

Rank of a Node Definition: The rank of a node is the number of nodes in the left subtree plus 1 Rank = position of node x in the tree where x is root class BinarySearchTree { int K; /* Key */ Info info; /* info */ int balance; /* BF, for AVL trees: -1, 0, +1 */ int rank; BinarySearchTree L, R; /* constructor und methods ... */ public BinarySearchTree posFind(int pos) { ... } }

Algorithm • Pseudo code: • Start in the root • If pos < rank: search in the left subtree • If pos > rank: subtract the rank from the position and search in the right subtree • Search stops when pos = rank • Correctness: • The rank of a node is always its position in the subtree where it is the root • Note: when inserting/deleting in the left subtree, the nodes upwards until the root must update their ranks

Prague Athens Tokyo Rome Cairo Paris Sofia Lima Oslo pos=5 pos=3 pos=8 pos=7 pos=10 pos=4 pos=2 pos=9 pos=1 pos=6 pos=11 Example pos = 4 -> Cairo 5 pos = 9 -> Rome 3 3 Bonn 2 1 2 2 Bern 1 1 1 1

Java Method public BinarySearchTree findPos( int pos ) { BinarySearchTree root = this; while ( ( root  null ) && ( pos  root.rank )) { if ( pos < root.rank ) { root = root.L; } else { pos = pos - root.rank; root = root.R; } } return root; } Complexity in balanced tree O(log2n)

Summary: Balanced Search Trees

Extended Binary Trees

Extended binary trees • Replace NULL-pointers with special (external) nodes. • A binary tree, to which external nodes are added, is called extended binary tree. • The data can be stored either in the internal or the external nodes. • The length of the path to the node illustrates the cost of the search.

External and internal path length • The cost of the search in extended binary trees depend on the following parameters: • External path length = The sum over all path lengths from the root to the external nodes Si (1  i  n+1): Extn = i = 1 ... n+1 depth( Si ) • Internal path length = The sum over all path lengths to the internal nodes Ki ( 1  i  n ): Intn = i = 1 ... n depth( Ki ) • Extn = Intn + 2n (Proof by induction) • Extended binary trees with a minimal external path length have a minimal internal path length too.

Example • External path length Extn = 3 + 4 + 4 + 2 + 3 + 3 + 3 + 3 = 25 • Internal path length Intn = 0 + 1 + 1 + 2 + 2 + 2 + 3 = 11 • 25 = Extn = Intn + 2n = 11 + 14 = 25 0 n = 7 1 1 2 2 2 2 3 3 3 3 3 3 4 4

Minimal and maximal length • For a given n, a balanced tree has the minimal internal path length. • Example: Within a complete tree with height h, the internal path length is (for n = 2h -1): Intn = i = 1 ... h i • 2i • Internal path length becomes maximum if the tree degenerates to a linear list: Intn = i = 1 ... n-1 i = n(n-1)/2 Example: h = 4, n = 15, Int = 34, Ext = 16•4 = 64 For comparison: List with n = 15 nodes has Int = 105, Ext = 105 + 30 = 135

25 15 8 15 3 25 8 3 Weighted binary trees • Often weights qi are assigned to the external nodes ( 1  i  n+1 ). • The weighted external path length is defined as Extw = i = 1 ... n+1 depth( Si )  qi • Within weighted binary trees the properties of minimal and maximal path lengths do not apply any more. • The determination of the minimal external path length is an important practical problem... Extw = 88 (less than 102 although linear list) Extw = 102

Application example: optimal codes • To convert a text file efficiently to bit strings, there are two alternatives: • Fixed length coding: each character has the same number of bits (e.g., ASCII) • Variable length coding: some characters are represented using less bits than the others • Example for coding with fixed length: 3-bit code for alphabet A, B, C, D: • A = 001, B = 010, C = 011, D = 100 • Message: ABBAABCDADA is converted to • 001010010001001010011100001100001 (length 33 bits) • Using a 2-bit code the same message can be coded only with 22 bits. • For decoding the message, group each 3-bits (respectively 2bits) and use a table with the code and its matching character.

Application example: optimal codes (2) • Idea: More frequently used characters are coded using less bits. • Message: ABBAABCDADA • Coding: 01010001011111001100 • Length: 20 Bit! • Variable length coding can reduce the memory space needed for storing the file. • How can this special coding be found and why is the decoding unique?

Application example: optimal codes (3) • Representation of the frequencies and coding as a weighted binary tree. • First of all decoding: Given a bit string: • Use the successive bits, in order to traverse the tree starting from the root. • If you arrive to an external node, use the character stored there. Example: 010100010111... 1 0 5 A • 1. Bit = 0: external node, A • 2. Bit = 1, from the root to the right • 3. Bit 0, links, external node, B • 4. Bit = 1, from the root to the right • 5. Bit 1, right • ... 1 0 3 B 0 1 1 2 D C

Correctness condition • Observation: Within variable length coding, the code of one character should not be a prefix of the code of any other character. • If a character is represented in form of an extended binary tree, then the uniqueness is guaranteed (only one character per external node). • If the frequency of the characters in the original text is taken as the weight of the external nodes, then a tree with minimal external path length will offer an optimal code. • How is a tree with minimal external path length generated?

Huffman Code • Idea: Characters are weighted and sorted according to the frequency • This works as well independently from the text, e.g., in English (characters with relative weights): • A binary tree with minimal external path length is constructed as follows: • Each character is represented with an appropriate tree with its corresponding weight (only one external node). • The two trees having respectively the smallest weight are merged to a new tree. • The root of the new tree is marked with the sum of the weights of the original roots. • Continue until only one tree remains.

Example 1: Huffman • Alphabet and frequency: • Step 1: (4, 5, 9, 10, 29) • new weight: 9 4+5 0 1 5 4 9+9 0 1 • Step 2: (9, 9, 10, 29) • new weight: 18 9 9 0 1 5 4

Example 1: Huffman (2) • Step 3: (18, 10, 29)  (10, 18, 29) • new weight: 28 10+18 0 1 18 10 57 0 1 0 1 9 9 28 29 0 1 0 1 5 4 18 10 0 1 9 9 • Step 4: (28, 29) • finished! 0 1 5 4

Resulting tree • Coding: • Extw = 112 • Using this coding, the code e.g., for: • TENNIS = 00101101101010100 • SET = 0100100 • NET = 011100 • Decoding as described before. 57 0 1 28 E 0 1 18 T 0 1 9 N 0 1 I S

Some remarks • The resulting tree is not regular. • Regular trees are not always optimal. • Example: the best nearly complete tree has Extw = 123 • For the messageABBAABCDADA20 bits is optimal(see previousslides) 29 10 9 5 4

Example 2: Huffman • Average number of bits without Huffman: 3 (because 23 = 8) • Average number of bits using Huffman code: • There are other “valid” solutions! But the average number of bits remains the same for all these solutions (equal to Huffman)

Analysis /* Algorithm Huffmann */ for (int i = 1; i  n-1; i++) { p1 = smallest element in list L remove p1 from L p2 = smallest element in L remove p2 from L create node p add p1 und p2 as left and right subtrees to p weight p = weight p1 + weight p2 insert p into L } • Run time behavior depends in particular on the implementation of the list • Time required to find the node with the smallest weight • Time required to insert a new node • “Naive” implementations give O(n2), “smarter” result in O(n log2n)

Optimality • Observation: The weight of a node K in the Huffman tree is equal to the external path length of the subtree having K as root. • Theorem: A Huffman tree is an extended binary tree with minimal external path length Extw. • Proof outline(per induction over n, the number of the characters in the alphabet): • The statement to prove is A(n) = “A Huffman tree with n nodes has minimal external path length Extw”. • Consider first n=2: Prove A(2) = “A Huffman tree with 2 nodes has minimal external path length”.

Optimality (2) • Proof: • n = 2: Only two characters with weights q1 and q2 result in a tree with Extw = q1 + q2. This is minimal, because there are no other trees. • Induction hypothesis: For all i  n, A(i) is true. • To prove: A(n+1) is true. V T1 T2

Optimality (3) • Proof: • Consider a Huffman tree T with n+1 nodes. This tree has a root V and two subtrees T1 und T2, which have respectively the weights q1 and q2. • Considering the construction method we can deduce, that For the weights qi of all internal nodes ni of T1 and T2: qi  min(q1, q2). • That’s why: for these weights qi: q1 + q2 > qi. So if V is replaced by any node in T1 or T2, the resulting tree will have a greater weight. • Replacing nodes within T1 and T2 will not make sense, because T1 and T2 are already optimal (both are trees with n nodes or less and the induction hypothesis hold for them). • So T is an optimal tree with n+1 nodes. q1 + q2 V q1 q2 T1 T2

Huffman Code: Applications • Fax machine

Huffman: Other applications • ZIP-Coding (at least similar technique) • In principle: most of coding techniques with data reduction (lossless compression) • NOT Huffman: lossy compression techniques like JPEG, MP3, MPEG, …

Introduction to Computer Science 2 Balanced Binary Search Trees (2) & Extended Binary Trees