1.37k likes | 1.61k Vues
File Storage and Indexing. CMPT 454. Files and Indexing. File Organizations Indices Types of index Tree based indexing Hash based indexing. Finding Records. Accessing a table Consider SELECT * FROM Customer To make access to a table efficient Store the table on adjacent blocks
E N D
File Storage and Indexing CMPT 454
Files and Indexing • File Organizations • Indices • Types of index • Tree based indexing • Hash based indexing
Finding Records • Accessing a table • Consider SELECT * FROM Customer • To make access to a table efficient • Store the table on adjacent blocks • On the same cylinder, or • Adjacent cylinders • But many queries include where clauses • File organizations should support the efficient retrieval of records within a file
File Organizations • Data in a database consists of collections of records, or data files • Each file consists of one or more blocks on a disk • Afile organization is a method of arranging records in a file • File organizations make some operations more efficient • In addition, files can be indexed, to provide multiple ways to access records efficiently • Index files contain search key values and references to records
External Storage - Reminder • Database data must be persistent, so must be stored on secondary memory, such as a hard disk • Disk access is relatively inefficient, in the order of 10 to 15 milliseconds to access a single page • Hundreds of thousands of times more than an equivalent access to a main memory location • The cost of disk I/O dominates the cost of database operations • The unit that data is read from or written to disk is a block, typically 8 kilobytes • Reading several pages in sequence from a disk takes much less time than reading several random pages
File Organizations • The minimum set of file operations is: • Create and destroy files • Insert and delete records • Scan an entire file • A scan brings all of the records in the file into main memory • A single DB table is usually stored as a single file • Every record in a file has a unique record ID, or rid • A rid consists of a block address, and a slot number • The simplest file structure is an unordered file, or heap file
Heap Files • Heap files support insertion and deletion of records and file scans • Because the entire file can be scanned, individual records or collections of records can be found • New records are inserted where there is room • Either in slots that contained previously deleted records, or • At the end of the file • When a record is deleted, no other records are affected • That is, there is no need to reorganize remaining records
Sorted Files • The records in a sorted file, or sequential file, are stored in order • Based on the sort key of the file • The attribute of the record that the file is sorted on • The basic organization assumes that file pages are full, to conserve space • Pages should be maintained in sequence • To allow for more efficient disk access • Insertions result in records being shuffled up • Deletions result in records being shuffled down
Sorted Files in Practice • To avoid inefficiencies involved in inserting and deleting records from sorted files • Pages have only partial occupancy • i.e. space for future insertions is left in each page • Overflow pages can be attached (by pointers) to pages that become full • Records can be locally reorganized in adjacent pages • Sorted files may need to be periodically re-ordered
Indexes • An index is a data structure that organizes data to optimize the retrieval of records on some criteria • An index supports efficient retrieval of records based on the search key of the index • An index can be created for a file to speed up searches that are not efficiently supported by the file's organization • A file can have more than one index • An index is a collection of data entries which must contain: • A search key value, k, and • Information to find data records with that search key value
Primary Indexes • An index on a sequential file • The search key of the index is the same as the sort key of the file • Primary indexes can be either dense or sparse There are different kinds of keys Primary key Candidate key Superkey Sort key Search key
Dense Indexes • A dense index is a sequence of blocks containing search key : rid pairs • The rid contains addresses to records with the search key • The blocks of the index are in the same order as the file • Searching an index is faster than searching the file • The index is smaller • The index is sorted so binary search can be used • The index may be small enough to fit in main memory • If so, once the index has been read, records can be found with one disk I/O
Dense Index Two data records or four index records fit on one block file index
Sparse Indexes • A sparse index usually contains one data entry for each blockof records in a data file • It is only possible to use a sparse index if the data file is sorted by the search key of the index • Sparse indexes are smaller than dense indexes • Sparse indexes are searched in much the same way as dense indexes • Except that the index is searched for the largest key less than or equal to the target value • The rid is then followed to a block of the data file
Sparse Index Two data records or four index records fit on one block file index
Multiple Level Indices • An index on a large data file can cover many blocks • Even using binary search, multiple disk I/Os may be needed to find a record • An alternative is to build a multiple level index • The first level of the index may be dense or sparse • Subsequent levels of the index are sparse indexes on the preceding level of the index
Multiple Level Index multiple level index file
Secondary Indexes • We define a primary index is an index whose search key is the same as the sort key of a sequential file • There can only be one primary index for a file • Terminology – a primary index sometimes refers to an index where the search key includes the primary key, we do not use this definition • More terminology – primary indices are also referred to as clustered • A secondary indexis an index whose search key is not the sort key of the file • Secondary indexes must be dense • Why? • Secondary indices are also referred to as unclustered
Secondary Index file index
Using Secondary Indexes • The pointers in a block of a secondary indexes may go to many blocks of the data file • Making secondary indexes less efficient than primary indexes for retrieval of a range of records • Heap (data) files are not ordered so require secondary indexes
Clustered Files • A clustered file contains the records of two tables • Consider two tables in a one to many relationship • Where queries containing a join between the two tables are made frequently • Note the unfortunate re-use of the word clustered ... SELECT pet_name, species FROM Owner, Pet WHERE Owner.sin = 111 AND Owner.sin = Pet.sin
Clustered File Organization Pets of owner1 Pets of owner2 Pets of owner3 Pets of owner4
Clustered Files and Indexes • Consider a primary index on Owner • The file is sorted by owner • Individual owners can be retrieve rapidly • Pets of owners can also be retrieved rapidly • Efficiency is reduced when retrieving a range of owner data • Any index on attributes of Pets would be a secondary index
Indirection • A secondary index may waste space if search key values are repeated • Each record ID is paired with a search key • Create a bucket for each set of rids associated with a search key • Follow a pointer to the bucket, then • Follow the rids in the bucket to the records • Saves space if search key values are larger than record IDs • And each key appears at least twice on average
Indirection file index buckets
Multiple Secondary Indexes • Multiple secondary indexes using indirection can improve efficiency on queries with complex criteria • Collect all the rids from the buckets that meet each of the criteria • Then intersect them • And only retrieve records using the result • This avoids retrieving records that match some, but not all, of the criteria
Document Retrieval • There are issues related to the storage and efficient retrieval of documents • Keywords are used to identify documents • More and more documents are maintained on the web • A document can be considered as a record in a table • The record can be thought of as having Boolean attributes for each possible word in the document • An attribute is true if the word is in the document and false otherwise
Inverted Indexes • There is a secondary index on each attribute (word) of a document • Only those records where an attribute is true is contained in the index • So the index leads to documents with particular words • Indices for all attributes (words) are combined into a single index • Known as an inverted index • The index uses indirect buckets for space efficiency
Matching Documents • Inverted indexes of words in multiple documents can be used to satisfy queries • Using intersection documents that contain multiple target words can be found • Additional information can be maintained to match words in sections of documents • Titles, headers, and other sections • Number of occurrences of words • ...
Composite Indices • An index’s search key can contain several fields • Such search keys are referred to as composite search keys or concatenated keys • e.g. {fName, lName} • For searches on equality the values for each field in the search key must match the values in record • e.g. 'Joe Smith' does not match 'Joe Jones' or 'Fred Smith' • For range queries, ranges may be specified for any fields in the search key. • If no values are specified for a field it implies that any value is acceptable for that field
Tree Index Introduction • Multiple level indexes can be very useful in speeding up queries • There is a general data structure that is used in commercial DBMSs • Known as B trees • We will look at B+ trees, a commonly used variant • B trees have two desirable properties • They keep as many levels as are required for the file being indexed • Space on tree blocks is managed so that each block is at least ½ full
B Tree Structure • B trees are balanced structures • All paths from the root to a leaf have the same length • Most B trees have three levels • But any number of levels is possible • B trees are similar to binary search trees • Except that B tree nodes contain more than two children • That is, they have greater fan-out • B tree node size is chosen to be the same as a disk block
B+ Tree Node Structure • The number of data entries in a node is determined by the size of the search key • Up to nsearch key values and n + 1 pointers • The value n is chosen to be as large as possible and still allow n search keys and n + 1 pointers to fit on a block • Example • If block size is 4,096, and the keys are 4 byte integers and pointers are 8 byte addresses • Find the largest value n such that 4n + 8(n + 1) ≤ 4,096 • n = 340
Leaf Nodes • Search keys in leaf nodes are copies of the keys in the data file • The leaf nodes contain the keys in order • The left most n pointers point to records in the data file • A leaf node must use at least (n + 1)/2 of these pointers • The right most pointer points to the next leaf next leaf in the tree record with key 12 record with key 24 record with key 29
Interior Nodes • In interior nodes, pointers point to next level nodes • Label the search keys K1 to Kn, and pointers p0 to pn • Pointer p0 points to nodes whose search key values are less than K1 • Other pointers, pi, point to nodes with search keys greater than or equal to Kiand less than Ki+1 • An interior node must use at least (n + 1)/2 pointers K < 12 12 ≤ K < 24 24 ≤ K < 29 K ≥ 29
Example B+ Tree in this example n = 3 note that (n+1)/2 = 2
B+ Tree Applications • B+ trees can be used to build many different indexes • The B tree could be a sparse index on a sorted data file, or • A dense index on a data file • We will assume for now that there are no duplicate values for search keys • That is the search key is a candidate key for the relation • The meaning of interior nodes changes slightly if there are duplicate search key values
Searching a B Tree • The B+ tree search algorithm is similar to a BST • To search for a value K start at the root and end at a leaf • If the node is a leaf and the ith key has the value K then follow the ith pointer to the record • If the node is an interior node follow the appropriate pointer to the next (interior or leaf) node • Searching a B+ tree index requires a number of disk I/O operatuions equal to the height of the tree • Plus one I/O to retrieve the record • However the root of the tree may be in main memory
B+ Tree Searches which nodes are visited in a search for 22? which nodes are visited in a search for 16?
Range Queries • B trees are useful for processing range queries • A range query typically has a WHERE clause that specifies a range of values • Assume query specifies values from x to y • Search the tree for the leaf that should contain value x • Follow the leaf pointers until a key greater than y is found • The tree can also be used to satisfy queries that have no lower bound or no upper bound
B+ Tree Insertions • Insert the record in the in the data file • Retaining the ridand the search key value, K • Insert the entry in the appropriate place in a leaf • Use the search algorithm to find the leaf node • Insert a data entry, if it fits the process is complete • If the target leaf node is full then split it • The first (n + 1) / 2 entries stay in the original node • Create a new node with the remaining (n + 1) / 2 entries to the right of the original node • Insert an entry with the first search key value from the new leaf in its parent node that points to the new leaf
B +Tree Insertions • Adding an entry to an interior node may cause it to split • After inserting a new entry there should be n + 1 keys (and n + 2 pointers) • The first (n + 2) / 2 pointers stay in the original node • Create a new node with the remaining (n + 2) / 2 pointers to the right of the original node • Leave the first n / 2 keys in the original node and move the last n / 2 keys to the new node • The remaining key 's value falls between the values in the original and new node • This left over key is inserted into the parent of the node along with a pointer to the new interior node
B+ Tree Insertions continued • Moving a value to a higher, interior level of the tree, may again cause a split • The same process is repeated until no further splits are required, or until a new root node has been created • If a new root is created it will initially have just one key and two children • So will be less than half full • This is permitted for the root (only)
B+ Tree Insertion Example insert 2, 21 and 11 the values are maintained in order in the index pages data file
B+ Tree Insertion Example insert 8 create new root with the first value of the new leaf node create a new node with the last ½ of the values chain the new node to the original node
B+ Tree Insertion Example insert 64, then 5 insert 23 ... both leaf nodes are now full ...
B+ Tree Insertion Example ... inserting 23 ...