Advanced Searching Techniques and Methods

Chapter 7 Searching

table(file) key internal key, embedded key 和整個record 在一起 age name no. 1 2 3 4 external key 另外自成一個table, 並有pointer

Terminologies of searching • primary key: unique secondary key: may not be unique internal search: data stored in main memory • external search: data stored in auxiliary memory retrieval: a successful search • a search and insertion algorithm: retrieve the data if a successful search insert the data if an unsuccessful search

Abstract data type typedef KEYTYPE ... // a type of key typedef RECTYPE ... // a type of record RECTYPE nullrec = ... // a "null" record KEYTYPE keyfunct(r) RECTYPE r; {... }; abstract typedef [rectype] TABLE (RECTYPE); abstract member(tbl, k) TABLE(RECTYPE) tbl; KEYTYPE k; postconditionif (there exists an r in tbl such that keyfunct(r) == k) then member = TRUE else MEMBER = FALSE

abstract RECTYPE search(tbl, k) TABLE(RECTYPE) tbl; KEYTYPE k; postcondiction (not member(tbl, k) && (search == nullrec) || (member(tbl, k) && keyfunct(search) == k); abstract insert(tbl, k) TABLE(RECTYPE) tbl; RECTYPE r; precondition member(tbl, keyfunct(R) == FALSE postcondition inset(tbl, r); (tbl - [r]) == tbl'; abstract delete(tbl, k) TABLE(RECTYPE) tbl; KEYTYPE k; postcondition tbl == (tbl' - [search(tbl, k)]);

Sequential search (linear search) • Applied to an array or a linked list • Data are not sorted. e.g. 9 5 6 8 7 2 (1) search 6: successful (2) search 4: unsuccessful (3) delete 6: 9 5 2 8 7 (4) insert 4: 9 5 2 8 7 4 • time complexity: successful search: comparisons = O(n) unsuccessful search: n comparisons = O(n)

Sequential search with C algorithm: for (i = 0; i < n; i++) if (key == k[i]) return(i); return(-1); sentinel: an extra key inserted at the end of the array k[n] = key; for (i = 0; key != k[i]; i++) ; if (i < n) return(i); else return(-1);

Move-to-front method • Let p(i) be the probability that record i is retrieved. • p(0)+ p(1)+ ... + p(n-1) = 1. • average number of comparisons: p(0) + 2p(1) + 3p(2) + ... + np(n-1) This number is minimized if p(0) ≧ p(1) ≧ p(2) ≧ ... ≧ p(n-1). • move-to-front method e.g. 9 5 6 8 7 2 (1) search 6: 6 9 5 8 7 2 (2) search 8: 8 6 9 5 7 2 The retrieved record is moved to the head of the list

Transposition method e.g. 9 5 6 8 7 2 (1) search 6: 9 6 5 8 7 2 (2) search 8: 9 6 8 5 7 2 • The retrieved record is interchanged with the preceding record. • The transposition method is more efficient in an unchanging probability distribution • The move-to-front method is better for a small to medium number of requests and for quickly changing probability distribution. • Mixed method: • use the move-to-front method for the first s searches, then use the transposition method.

Searching in an ordered table Key Record linear searching: comparisons (sequential) (average) (successful or unsuccessful)

Indexed sequential search (1) sorted Key Record • Indexed sequential file: index pointer

Indexed sequential search (2) • The use of an index is applicable to a sorted table stored as an array or a linked list. • Deletion: by a flag • Insertion: • shift some elements if there exist some deleted entries. (Pointers need be changed in the index file) • keep an overflow area

A secondary index Sequential table Primary index Secondary index

Binary search e.g. 2 5 6 7 8 9 search 7: needs 3 comparisons • Time complexity: O(logn) • used only if the table is sorted and stored in an array. • An insertion or a deletion requires O(n) time. • Improvement: two arrays, one for flags, the other for the sorted keys and some "empty holes". e: empty flag data f: full

6 2 8 5 7 9 Binary search tree • inorder traversal: 2 5 6 7 8 9 • The binary search uses a sorted array as an implicit binary search tree. (The middle element of the array is the root.)

6 2 8 5 7 9 6 2 8 5 7 9 4 Insertion in a binary search tree • Insert 4 • The inserted key is added to the tree as its leaf node.

Deletion in a binary search tree (1) Case 1: The deleted node has no sons. Delete it directly. 8 8 3 11 3 11 1 5 9 14 1 5 9 14 6 10 12 15 6 10 12 7 13 7 13 Deleting node with key 15.

Deletion in a binary search tree (2) Case 2: The deleted node has only one subtree. Delete it and move the subtree up. 8 8 3 11 3 11 1 5 9 14 1 6 9 14 6 10 12 15 7 10 12 15 7 13 13 Deleting node with key 5.

Deletion in a binary search tree (3) Case 3: The deleted node has two subtrees. Its inorder successor s takes its place. The right son of s takes the place of s. (s has no left son.) 8 8 3 11 3 12 1 5 9 14 1 5 9 14 6 10 12 15 6 10 13 15 7 13 7 Deleting node with key 11.

Deletion in a binary search tree (4) • Asymmetric deletion: replaced by inorder successors • Symmetric deletion: replaced by inorder predecessors and successors alternately. • Average search time in a binary search tree: O(logn)

Optimum binary search trees e.g. sorted data: 2 3 5 7 some binary search trees: • In an optimum binary search tree, the expected number of comparisons is minimized under a given set of keys and probabilities. 3 3 5 2 2 5 2 7 2 7 3 7 5 3 5 7

e.g. p2 k2 p1 p3 k1 k3 q0 q1 q2 q3 pi: probability for successful search qi: probability for unsuccessful search expected number of comparisons: 2p1 + p2 + 2p3 + 2q0 + 2q1 + 2q2 + 2q3 e.g. k3 k1 k2 expected number of comparisons: 2p1 + 3p2 + p3 + 2q0 + 3q1 + 3q2 + q3

(1)Balancing method e.g. key(data) 1 2 3 4 5 6 7 frequencies of 2 10 3 1 4 8 9 successful search partial sum 2 12 15 16 20 28 37 Select i as the root such that the difference of the costs on the left and the right is minimized. The binary search tree can be constructed recursively. Time complexity: O(n) 5 16 17 7 2 2 4 8 1 3 6 1 4 Construction of (near) optimum search trees (1) frequency

(2)Median split tree e.g. key(data) 1 2 3 4 5 6 7 frequencies 2 10 3 1 4 8 9 The most frequent key is stored in the root. The split key is the median of all remaining keys. The binary search tree can be constructed recursively. The tree is a balanced tree. Time complexity: O(nlogn) node key split key 2 4 3 1 7 5 1 4 5 6 Construction of (near) optimum search trees (2) How to search?

-1 0 1 -1 0 0 1 0 0 0 0 0 0 0 0 0 0 Balanced binary tree (AVL tree) • The heights of the two subtrees of every node never differ by more than 1. balance = (height of left subtree) – (height of right subtree) Each node in a balanced binary tree has a balance of 1, -1, or 0. • A balanced binary tree:

p B D F q A D B F D G r C F A C E G B E A C E G Rotations of a binary tree left rotation: • q = right(p) • r = left(q) • left(q) = p • right(p) = r • The inorder traversal is the same after a rotation is performed. (a) Original tree (b) Right rotation (c) Left rotation

A 0 C 1 C T1 H = n 0 A Tree T3 Height = n 0 T1 H = n T2 H = n T2 H = n T3 H = n Newly inserted node Newly inserted node Insertion of an AVL tree (1) Case 1: Node C is the first unbalanced node traced up from the newly inserted node. right rotation on the subtree rooted at C The height of the subtree is not changed after the new insertion.

C 2 C 1 B T4 H = n A 2 0 T4 H = n A T3 H = n-1 0 T1 H = n B 0 T2 H = n-1 T1 H = n T3 H = n-1 T2 H = n-1 Newly inserted node Newly inserted node Insertion of an AVL tree (2) Case 2: First rotation: left rotation on the subtree rooted at A

B 0 C A 0 -1 T4 H = n T3 H = n-1 T1 H = n T2 H = n-1 Newly inserted node Second rotation: right rotation on the subtree rooted at C The height of the subtree is not changed after the new insertion. Insertion requires at most 2 rotations. Deletion is more complex, it requires O(logn) rotations.

A 12 50 85 E D C B 37 60 70 80 100 120 150 6 10 F H G 110 37 62 65 69 Multiway search trees • A multiway search tree of order n: • at most n subtrees • at most n-1 keys in a node

B-trees • B-tree of order m: m-1 ≦ # of keys in a nonroot node ≦ m-1 2 1 ≦ # of keys in the root node ≦ m-1 • a B-tree of order 5: 320 540 430 480 380 395 406 412 493 506 511 451 472 (a) Initial portion of a B-tree

395 430 480 493 506 511 451 472 380 382 406 412 (b) After inserting 382 395 430 480 508 451 472 493 506 380 382 406 412 511 518 (c) After inserting 518 and 508

a B-tree of order 4: 87 140 152 186 194 23 61 74 90 100 106 (a) An initial B-tree twig 97 102 140 152 186 194 23 61 74 90 100 106 (b) Inserting 102 with a left bias

87 100 140 152 186 194 23 61 74 90 102 106 (c) Inserting 102 with a right bias

Deletion in multiway search trees (1) The simplest method • Mark a deleted key, do not remove it. • disadvantage • Waste space • In a nonleaf node, only the same key can reuse the "deleted" space. (2) A technique similar to binary search trees used in an unrestricted multiway search tree • If the key has an empty left or right subtree, remove it. If it is the only one key in the node, remove the node. • Otherwise, its successor takes its place. (The successor has an empty left subtree.)

Deletion in B-trees (i) Shift a key from its father and its brother (borrow) 80 120 150 B 90 113 126 135 142 A Delete key 113 80 126 150 90 120 135 142 A B

80 126 150 (ii) Take a key from its father and combine with its brother B 68 73 90 120 135 142 Delete key 120 and consolidate 80 150 68 73 90 126 135 142 B

(iii) do (ii), then do (i) for its father 60 170 30 50 80 150 180 220 280 C A B D E 65 72 87 96 153 162 173 178 187 202 Delete 65, consolidate and borrow 60 180 30 50 150 170 220 280 C A B D E 72 80 87 96 153 162 173 178 187 202

(iv) do (ii), then do (ii) for its father. 60 180 300 G 30 50 150 170 220 280 C A B D E F 153 162 173 178 187 202 Deleting 173 and a double consolidation 60 300 G 30 50 150 180 220 280 D E F C A B 153 162 170 178 187 202

Deletion in B-trees • This may be done up to the root. If the root has more than one key => no problem. If the root has only one key => remove the root. • Insertion, deletion or searching in a B-tree requires O(logn) time, where n denotes the number of nodes in the B-tree.

B+-tree • All keys are maintained in leaf nodes and keys are also replicated in nonleaf nodes. • Finding the next record: O(1) time

Digital search tree Keys 1 180 185 1867 195 207 217 2174 21749 217493 226 27 274 278 279 2796 281 284 285 286 287 288 294 307 768 8 9 0 5 6 5 eok eok 7 eok eok end of key

2 0 1 2 7 8 9 7 7 6 4 8 9 eok 1 4 5 6 7 8 4 eok 4 eok eok eok eok 6 eok eok eok eok eok eok eok eok 9 eok eok 7 3 eok 3 6 0 eok 7 8 eok eok

Trie • This is one kind of digital search trees. • Each node contains exactly m pointers. (Some of them are null.) e.g. m=10 for numerical data. • It is useful when the set of keys is dense.

Hashing • hash function: to transforming a key into a table index e.g. data: 18 23 33 13 24 10 hash function: h(k) = k mod 10 • hash collision: Two records (keys) attempt to insert into the same position.

Resolution of hash collision (1) open addressing (rehashing) a) linear probing: to place the collided record in the next available position in the array b) rehashing function: ...

(2) chaining k r next

Issues of hashing • How to choose a hash function? the division method: h(key) = key mod m It is best that the table size m is prime. • Advantage of hashing: faster than binary search • Disadvantage of hashing: 1.need more memory. 2.to delete a record is difficult.

Advanced Searching Techniques and Methods

Advanced Searching Techniques and Methods

Presentation Transcript

Chapter 7

Chapter 7

Chapter 7

CHAPTER 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7

Chapter 7