Indexing and Hashing: Basic Concepts and Comparison

Chapter 12: Indexing and Hashing Rev. Sep 17, 2008

Chapter 12: Indexing and Hashing • Basic Concepts • Ordered Indices • B+-Tree Index Files • B-Tree Index Files • Static Hashing • Dynamic Hashing • Comparison of Ordered Indexing and Hashing • Index Definition in SQL • Multiple-Key Access

Basic Concepts • Indexing mechanisms used to speed up access to desired data. • E.g., author catalog in library • Textbook • Search Key - attribute to set of attributes used to look up records in a file. • An index fileconsists of records (called index entries) of the form • Index files are typically much smaller than the original file further reducing the effort needed to find particular result. • Two basic kinds of indices: • Ordered indices: search keys are stored in sorted order • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”. pointer search-key

Index on Sequential File • An index is another file containing key-pointer pairs of the form (K,a) • K is a search key • a is an address (pointer) • The record at address a has search key K • Particularly useful when the search key is the primary key of the relation • Possible Data structure • Simple index on sorted file • Secondary index on unsorted file • B – tree • Hash table

Index Evaluation Metrics • Access types supported efficiently. E.g., • records with a specified value in the attribute • or records with an attribute value falling in a specified range of values (e.g. 10000 < salary < 40000) • Access time • The time it takes to find a particular data item, or set of data item. • Insertion time • Time it takes to insert a new data item.This value includes the time it takes to find the correct place to insert the new data item, as well as the time it take to update the index structure • Deletion time • Space overhead

Ordered Indices • In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library, index of book. • Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. • Also called clustering index • The search key of a primary index is usually but not necessarily the primary key. • Secondary index:an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. • Index-sequential file: ordered sequential file with a primary index.

Dense Index Files • Dense index — Index record appears for every search-key value in the file. In a dense primary index, the index record contains the search-key value and a pointer to first data record with that search-key value.

Sparse Index Files • Sparse Index: contains index records for only some search-key values. • Applicable when records are sequentially ordered on search-key • As is true in dense indices each index record contains a search-key value and a pointer to first data record with the search-key value. • To locate a record with search-key value K we: • Find index record with largest search-key value < K • We start at the record pointed to by that index entry, follow the pointers in the file until we find the desired record.

Sparse Index example • If we want to find the records for Perryridge branch. • Since there is no index entry for “Perryridge” the last entry before “Perryridge” is “Mianus” we follow that pointer. Then read the file in sequential order until we find the first Perryridge record, and begin processing at that point.

Sparse Index Files (Cont.) • Compared to dense indices: • Less space and less maintenance overhead for insertions and deletions. • Generally slower than dense index for locating records. • Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Example • Let assume file(database) have 100,000 records with 100 record per block. • Assign one index entry per block • By applying binary search it requires as many as [ log2 (b)] where b is no of block. • For our 100-block index, binary search requires seven block reads. • Assume per block on disk it takes 30 milliseconds, search will take 210 milliseconds, which is long.

Multilevel Index • If primary index does not fit in memory, access becomes expensive. • Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. • outer index – a sparse index of primary index • inner index – the primary index file • If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. • Searching for records with a multilevel index requires significantly fewer I/O operations • Indices at all levels must be updated on insertion or deletion from the file.

Multilevel Index (Cont.)

Sequential File Sparse 2nd level 330 170 10 90 130 10 50 490 210 170 70 110 30 570 230 410 150 250 90 190 90 70 50 30 10 100 80 60 40 20 Two-Level Index Example

Index Update: Record Deletion • If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. • Single-level index deletion: • Dense indices – deletion of search-key: similar to file record deletion. • Sparse indices – • If the index does not contain an index record with the search-key value of the deleted record, nothing needs to be done to index. • Otherwise the system takes the following actions: • If the deleted record was the only record with its search key the system replaces the corresponding index record with an index record for the next search-key value. If the next search-key value already has an index entry, the entry is deleted instead of replaced. • Otherwise, if the index record for the search-key value points to the record being deleted, the system updates the index record to point to the next record with the same search-key value.

10 70 50 20 60 80 40 40 Deletion from dense index • delete record 30 10 20 30 30 40 40 50 60 70 80

70 10 30 50 20 40 60 80 Deletion from sparse index 10 30 50 70 90 110 130 150

70 10 30 50 20 40 60 80 Deletion from sparse index • delete record 40 10 30 50 70 90 110 130 150

10 30 50 70 20 40 60 80 40 40 Deletion from sparse index • delete record 30 10 30 50 70 90 110 130 150

10 30 50 70 20 40 60 80 50 70 Deletion from sparse index • delete records 30 & 40 10 30 50 70 90 110 130 150

Index Update: Record Insertion • Single-level index insertion: • Perform a lookup using the key value from inserted record • Dense indices – if the search-key value does not appear in the index, insert it. • Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. • If a new block is created, the first search-key value appearing in the new block is inserted into the index. • Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms

10 30 40 60 20 50 Insertion into sparse index 10 30 40 60

10 30 40 60 20 50 34 • our lucky day! we have free space where we need it! Insertion into sparse index • insert record 34 10 30 40 60

10 30 40 60 20 50 15 20 20 30 Insertion into sparse index • insert record 15 10 30 40 60 • Illustrated: Immediate reorganization • Variation: • insert new block (chained file) • update index

Types of Single-Level Indexes • Secondary Index • A secondary index provides a secondary means of accessing a file for which some primary access already exists. • The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. • The index is an ordered file with two fields. • The first field is of the same data type as some nonordering field of the data file that is an indexing field. • The second field is either a block pointer or a record pointer. There can be many secondary indexes (and hence, indexing fields) for the same file. • Includes one entry for each record in the data file; hence, it is a dense index

FIGURE 14.4A dense secondary index (with block pointers) on a nonordering key field of a file.

Secondary Indices Example • Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. • Secondary indices have to be dense Secondary index on balance field of account

Primary and Secondary Indices • Indices offer substantial benefits when searching for records. • BUT: Updating indices imposes overhead on database modification --when a file is modified, every index on the file must be updated, • Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive • Each record access may fetch a new block from disk • Block fetch requires about 5 to 10 micro seconds, versus about 100 nanoseconds for memory access

B+-Tree Index Files • Disadvantage of indexed-sequential files • performance degrades as file grows, since many overflow blocks get created. • Periodic reorganization of entire file is required. • Advantage of B+-treeindex files: • automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. • Reorganization of entire file is not required to maintain performance. • (Minor) disadvantage of B+-trees: • extra insertion and deletion overhead, space overhead. • Advantages of B+-trees outweigh disadvantages • B+-trees are used extensively B+-tree indices are an alternative to indexed-sequential files.

Three Types B-Tree Nodes • Root node - contains node pointers to branch nodes. • Branch node - contains pointers to leaf nodes • Leaf node - contains index items and horizontal pointers to other leaf nodes.

A node in a search tree with pointers to subtrees below it.

B+-Tree Index Files (Cont.) A B+-tree is a rooted tree satisfying the following properties: • All paths from root to leaf are of the same length • Each node that is not a root or a leaf has between n/2 and n children. • A leaf node has between (n–1)/2 and n–1 values • Special cases: • If the root is not a leaf, it has at least 2 children. • If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values. • Minimum no of keys in leaf node (n/2) • Minimum no of keys in non-leaf node is (n+1)/2 - 1

B+-Tree Node Structure • Typical node • Ki are the search-key values • Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). • The search-keys in a node are ordered K1 < K2 < K3 < . . .< Kn–1

Example of a B+-tree B+-tree for account file (n = 3)

Example of B+-tree • Leaf nodes must have between 2 and 4 values ((n–1)/2 and n –1, with n = 5). • Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5). • Root must have at least 2 children. B+-tree for account file (n = 5)

B+Tree Example n=4 Root 100 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

Sample non-leaf to keys to keys to keys to keys < 120 120 k<150 150k<180 180 120 150 180

Sample leaf node: From non-leaf node to next leaf in sequence 120 130 unused To record with key 120 To record with key 130

Observations about B+-trees • Since the inter-node connections are done by pointers, “logically” close blocks need not be “physically” close. • The non-leaf levels of the B+-tree form a hierarchy of sparse indices. • The B+-tree contains a relatively small number of levels • Level below root has at least 2* n/2 values • Next level has at least 2* n/2 * n/2 values • .. etc. • If there are K search-key values in the file, the tree height is no more than  logn/2(K) • thus searches can be conducted efficiently. • Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time (as we shall see).

Queries on B+-Trees • Find all records with a search-key value of k. • N=root • Repeat • Examine N for the smallest search-key value > k. • If such a value exists, assume it is Ki. Then set N = Pi • Otherwise k Kn–1. Set N = Pn Until N is a leaf node • If for some i, key Ki = k follow pointer Pito the desired record or bucket. • Else no record with search-key value k exists.

Queries on B+-Trees (Cont.) • If there are K search-key values in the file, the height of the tree is no more than logn/2(K). • A node is generally the same size as a disk block, typically 4 kilobytes • and n is typically around 100 (40 bytes per index entry). • With 1 million search key values and n = 100 • at most log50(1,000,000) = 4 nodes are accessed in a lookup. • Contrast this with a balanced binary tree with 1 million search key values — around 20 nodes are accessed in a lookup • above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds

Updates on B+-Trees: Insertion • Find the leaf node in which the search-key value would appear • If the search-key value is already present in the leaf node • Add record to the file • If the search-key value is not present, then • add the record to the main file (and create a bucket if necessary) • If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node • Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the next slide.

Updates on B+-Trees: Insertion (Cont.) • Splitting a leaf node: • take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. Place the first n/2 in the original node, and the rest in a new node. • let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. • If the parent is full, split it and propagate the split further up. • Splitting of nodes proceeds upwards till a node that is not full is found. • In the worst case the root node may be split increasing the height of the tree by 1. Result of splitting node containing Brighton and Downtown on inserting Clearview Next step: insert entry with (Downtown,pointer-to-new-node) into parent

Updates on B+-Trees: Insertion (Cont.) B+-Tree before and after insertion of “Clearview”

Insertion in B+-Trees (Cont.) • Splitting a non-leaf node: when inserting (k,p) into an already full internal node N • Copy N to an in-memory area M with space for n+1 pointers and n keys • Insert (k,p) into M • Copy P1,K1, …, K n/2-1,P n/2 from M back into node N • Copy Pn/2+1,Kn/2+1,…,Kn,Pn+1 from M into newly allocated node N’ • Insert (Kn/2,N’) into parent N • Read pseudocode in book! Mianus Downtown Mianus Perryridge Downtown Redwood

Inserting a Data Entry into a B+ Tree 1)Find correct leaf node 2) Add index entry to the node 3) If enough space, done! 4) Else, split the node Redistribute entries evenly between the current node and the new node 5) Insert <middle key, ptr to new node> to the parent 6) Go to Step 3

Insert into B+tree First lookup the proper leaf; (a) simple case • leaf not full: just insert (key, pointer-to-record) (b) leaf overflow (c) non-leaf overflow (d) new root

32 n=4 (a) Insert key = 32 100 30 3 5 11 30 31

7 3 5 7 n=4 (b) Insert key = 7 100 30 3 5 11 30 31

160 180 160 179 n=4 (c) Insert key = 160 100 120 150 180 180 200 150 156 179

Indexing and Hashing: Basic Concepts and Comparison

Indexing and Hashing: Basic Concepts and Comparison

Presentation Transcript

Chapter 12: Indexing and Hashing

CS4432: Database Systems II

Indexing and Hashing

B+-Trees and Static Hashing

Chapter 5: Hashing

Chapter 5: Hashing

Hashing and Indexing

Indexing and Hashing

Chapter 7: Equity Portfolio Management

Appendix E-A Hashing

Chapter 12: Indexing and Hashing

Chapter 20

B-trees - Hashing

Web indexing

indexing and hashing

Chapter 8. Storage Management and Indexing Techniques

Chapter 9: Maps, Dictionaries, Hashing

Indexing