Database Implementation Issues
E N D
Presentation Transcript
Database ImplementationIssues CSCE 315 – Programming Studio, Spring 2019 Robert Lightfoot Slides adapted from those used by Jennifer Welch and John Keyser
Database Implementation • Typically, we assume databases are very large, used by many people, etc. • So, specialized algorithms are usually used for databases • Efficiency • Reliability
Storing Data • Other terminology for implementation • Relation is a table • Tuple is a record • Attribute is a field
Storing a Record (Tuple) • Often can assume all the fields are fixed (maximum) length. • For efficiency, usually concatenate all fields in each tuple. • Variable length: store max length possible, plus one bit for termination • Store the offsets for concatenation in a schema
Example: tuple storage • Senator • Name – variable character (100 + 1 bytes) • State – fixed character (2 bytes) • YearsInSenate – integer (1 byte) • Party – variable character (11 + 1 bytes) 116 0 101 103 104
More on tuples/records • So, schema would store: • Name: 0 • State: 101 • YearsInSenate: 103 • Party: 104 • Note that HW/efficiency considerations might give minimum sizes for each field • e.g. multiple of 4 or 8 bytes
Variable Length Fields • Storing max size may be problematic • Usually nowhere close – waste space • Could make record too large for a “unit” of storage (i.e., disk block) • Storing pointers may be problematic • Record is not known size, thus variable length records • Not much locality in memory • Extra storage for pointer • Dynamic memory allocation is slow
Variable Length Fields • Compromise: “local” pointers • Store fixed-length records, followed by variable-length • Variable data still kept locally • Header stores info about variable fields • Pointer to start of each • Not great if data in variable field will change
Record Headers • Might want to store additional key information in header of each record • Schema information (or pointer to schema) • Record size (if variable length) • Timestamp of last modification
Record Headers and Blocks • Records grouped into blocks • Correspond with a “unit” of disk/storage • Header information with record positions • Also might list which relation it is part of. • Concatenate records Header Record 1 Record 2 … Record n
Addresses • Addresses of (pointers to) data often represented • Two types of address • Location in database (on disk) • Location in memory • Translation table usually kept to map items currently in virtual memory to the overall database. • Pointer swizzling: updating pointers to refer to disk vs. memory locations
Records and Blocks • Sometimes want records to span blocks • Generally try to keep related records in the same block, but not always possible • Record too large for one block • Too much wasted space • Split parts are called fragments • Header information of record • Is it a fragment • Store pointers to previous/next fragments
Adding, Deleting, Modifying Records • Insertion • If order doesn’t matter, just find a block with enough free space • Later come back to storing tables • If want to keep in order: • If room in block, just do insertion sort • If need new block, go to overflow block • Might rearrange records between blocks • Other variations
Adding, Deleting, Modifying Records • Deletion • If want to keep space, may need to shift records around in block to fill gap created • Can use “tombstone” to mark deleted records • Modifying • For fixed-length, straightforward • For variable-length, like adding (if length increases) or deleting (if length decreases)
Keeping Track of Tables • We have a bunch of records stored (somehow). • We need to query them (SELECT * FROM table WHERE condition) • Scanning every block/record is far too slow • Could store each table in a subset of blocks • Saves time, but still slow • Use an index
Index • A general term – applies to other settings (e.g., index of a file, index of a dictionary, index of search engine) • It is a separate file describing the main data file • Describe here means giving random access to the data file by some search keys • Example: Apple Apricot Ball Banana Blazer Cannot Cat Code Coil …. A B C Index file Data file
Index in Database Terminology • Special data structures to find all records that satisfy some condition • Possible indexes • Simple index on sorted data • Secondary index on unsorted file • Trees (B-trees) • Hash Tables
Sorted files • Sort records of the relation according to field (attribute) of interest. • Attribute of interest is search key • Might not be a “true” key • Index stores (K,a) values • K = search key • a = address of record with K
Dense Index • One index entry per record • Useful if records are huge, and index can be small enough to fit in memory • Can search efficiently and then examine/retrieve single record only 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73
Sparse Index(on sequential file) • Store an index for only every n records • Use that to find the one before, then search sequentially. 1 7 12 27 44 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73
Multiple Indices • Indices in hierarchy • B-trees are an example 1 27 1 7 12 27 44 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73
Duplicate Keys • Can cause issues, in both dense and sparse indexes, need to account for 1 7 12 27 44 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73
What if not sorted? • Can be the case when we want two or more indices on the same data • e.g. Senator.name, Senator.party • Must be dense (sparse would make no sense) • Can sort the index by the search key • This second level index can be sparse
Example – Secondary Index 1 27 1 7 12 27 44 1 5 7 7 10 12 18 18 18 27 30 35 43 44 65 73 5 35 18 43 12 44 73 1 65 10 7 18 30 27 7 18
Buckets • If there are lots of repeated keys, can use buckets • Buckets are in between the secondary index and the data file • One entry in index per key – points to bucket file • Bucket file lists all records with that key
Example – Buckets 1 7 12 27 44 1 5 7 10 12 18 27 30 35 43 44 65 73 7 18 18
Putting It All Together • To handle variable length records (e.g. VARCHAR(x)), we now have a header for every record • If everything were fixed length and we did not care about wasted space, only per-table header would suffice • To deal with long disk latency, group as many records in a disk block as possible • The rest of the block is empty • We need a block header to keep track header Record 3 Record 2 Record 1 A disk block containing records from a table
Putting It All Together (contd.) • Insert/Delete/Update operations put this design into test • Delete is easier, just mark as deleted • Insert/Update might require a lot of “move-to-right” operations for making room for new/increased data • But empty spaces in most blocks, the “move-to-right” operation does not propagate too much • Usually an empty spot is found in the current or surrounding blocks • If no surrounding blocks found, overflow blocks can be used • Thus, insert/update operations does not undo sorted files header Record 3 Record 2 Record 1
Putting It All Together - Indices • Index is a separate file along with the data file • Often small enough to fit in the RAM • The main purpose of indices is to avoid scanning all disk blocks • Each index entry contains a key and a pointer to the record containing the key • Index often resides in the RAM, while the record in the disk • Two types of indices: • Dense Index (applies to sorted or unsorted files) • Sparse (only for sorted files)
Storage Considerations • Memory Hierarchy • Cache • Main Memory • Secondary storage (disk) • Tertiary storage (e.g. tape) • Smaller amounts but faster access • Need to organize information to minimize “cache misses”
Storage Considerations:Making things efficient • Placing records together in blocks for group fetch • Prefetching • Prediction algorithm • Parallelism/Striping • placing across multiple disks to read/write faster • Example: RAID • Mirroring • double the reliability
Storage ConsiderationsMaking it reliable • Checksums • Mirroring disks • Parity bits • RAID levels • Mostly, out of the scope of this course