File Structures Course: 03-60-415

File Structures Course: 03-60-415 Dr. Joan Morrissey School of Computer Science University of Windsor Windsor, Canada

What are file structures and why we need them. • A data structure is a method of organizing data in RAM/main memory. Examples are: arrays, a stack, a queue, a binary tree, an avl-tree etc. • A file structure is a method of organizing data on secondary storage devices (SSDs) such a hard disk, tape or cd-rom etc. Examples are: a simple index, secondary indexes, the family of B-trees and hashed files. NO commercial DB is going to fit into RAM and RAM is volatile. • File structures are used to minimize accesses to SSDs so data can be retrieved quickly. Why SSDs? • RAM capacity limited for very large databases which may be held on several hard disks. • RAM is volatile but we need permanent storage for our DB. • Cheaper for backup and archiving and distribution of software, games etc,.

Hard Disks. • Most common method of secondary storage for DBs. • However, disks are very slow since we have physical parts to move – unlike RAM. • Must use buffering techniques to load sector(s) into RAM and back to disk – also slows down access. • We can have zone sectoring, that can store the maximum amount of data or “hard” sectors which all contain same amount of data - More later ! • What does a basic disk look like?

A disk….

An example of a disk with 7 cylinders….

Tracks, cylinders and sectors with “hard” sectors. • A cylinder on a hard sectored disk is all of tracks X. For example cylinder 200 consists of all tracks 200 on each side of each platter. • If you can get a file on a single cylinder (or contiguous cylinders) then this speeds up access to data as you don’t have to move the R/W heads.

Hard and Zoned sectors. • With hard sectors, each sector contains the same amount of data. Wasted space at the edge of platters! • With zoned sectors, each “sector” holds a number of sectors. With this we get better used of space and have cylinders. But would have some waste of storage. • Disk moves at a constant speed in both cases. Different to CD.

Advantages….and disadvantages of disks. • Fast (slower than Ram) & cheap. Non-volatile. • With hard sectoring, tracks are organized into concentric circles and divided into sectors, each containing the same amount of data. Have same number of sectors on each track. Causes waste of space as sectors on outside of platter are not packed maximally with data. • Zone recording more efficient as we can store more data on each track. • Always have to consider how to minimize accesses to disk (seeks). • Seek = time to move to right cylinder, rotational delay so that required sector is under r/w head and head settle time. • Sector is the least amount of data that can be written or read. Don’t want BIG sectors because this causes problems with buffering to and from RAM. • Random and sequential access possible. More later!

Other types of SSDs. Magnetic tape: • As of 2014, the highest capacity tape cartridges can store 185 TB of data- Sony. Price?? In 2013 4TB was $30K…but will get lower • Only can have sequential access. • Usually very robust. Good for backup and storage. • Nine parallel tracks (horizontal) on tape. One bit slice = byte + parity bit. • Can have even or odd parity. If using even then parity bit is set to make number of ones even - ensures correctness of data. Format: End of tape Load point marker File info File header IBG Data Etc… Next file EOF Etc…

CDs – not used much except for s/w delivery. • Can be read only or read/write. 700 MB (approx) for single density. • Cheap but very slow for random access to data. Identify sector by minute:second so random access is by “trial and error”. • Not robust. Easy to scratch. Dust can be a problem. Degrades over time. • Single spiral track consisting of pits and lands. Pits are nms deep – space between is a land. • Read by red laser beam. Change in light intensity from moving from a pit/land or land/pit = 1. • Speed up as you read towards centre to ensure that the same amount of data goes past the read/write head in a given period of time. Important with data but not with music! • Double density possible. Narrower tracks or two layers of material on top of one another where laser can read through top layer. Latter basically a DVD!

Flash drives or USB drives. • Usually 32/64 GB storage but more available for a price !1TB is already available for $2000 - with plans for larger capacity. Prices will get lower. • Fairly robust. Human error often leads to them breaking or being lost. • Very fast as there are no moving parts. • Limited number of r/w cycles before degrading. But usually get lost before that  • The term drive persists because computers read and write flash drive data using the same system commands - seen as just another drive. • Nonvolatile storage. • Very convenient! Also…External hard drives.. important for safe backup. Cloud storage - becoming cheaper! Solid state disks – non volatile storage with no moving parts. Buffering is faster as is seek time. Degrades with time. 250GB for $100 + . Hard drive: 4TB $200 !

Indexing… important as it gives faster access to data. A simple index for fixed length records: Data File: consists of entry sequenced records (not sorted). Records held on disk but only PK, name, shown. Data always appended at EOF. Data file RRN Jones (primary key) record………….. 0 Burke……………. 1 2 Adams………………….. 3 Smith………………………….

Primary index for fixed length records. Primary Index: consists of PK and RRN (gives the position of a record relative to the beginning of the file – assumes fixed length records). It is sorted and fixed length. (For data in previous slide) RRN PK Adams Burke Jones Smith

How do we find records in the data file? • Load the primary index into RAM from the disk! • Do a binary search of the PK. • Pick up corresponding RRN. • Seek to RRN on disk and buffer record into RAM. • Note that the first RRN is always 0 and only used with fixed length records. RRN × record length = byte offset. • Note also that data is always added to the end of the data file on the disk, faster!

A simple index for variable length records: Data File: Consists of entry sequenced records (not sorted). (0) Jones record……(200)Adams record………(350)Smith……..

Primary index for variable length records for data on last slide. Primary Index: consists of PK and the byte offset (from start of file – byte 0). It is sorted (by PK) and fixed length. Primary Key Byte offset

Finding a variable length record. For example, retrieve the Smith record? 1. Load primary index into RAM 2. Use a binary search to find PK (Smith) in the primary index − can do so because it is fixed length. 3. Retrieve corresponding byte offset. 4. Seek to record in data file using byte offset. 5. Read the Smith record.

Note: • We assume that the simple index can fit in RAM but not the data file. • A “seek” is an access to a secondary storage device. • Can’t use RRN with variable length records. • What if the primary index is too large to fit into RAM? First, consider using secondary keys to access records: for example, get all the records where the supplier is in Paris? Note that secondary keys do not have to be unique. • Note: in the examples shown the data file consists of records where the PK is name but there is also other information in the record.

Secondary indexes – work with a primary key file. • The secondary index is fixed length and sorted so that it can fit into RAM all at once. • The primary key file is fixed length and entry sequenced – i.e. data added at EOF. • The data file is never sorted and never fits into RAM. • “Next RRN” in the Primary Key File is simply a linked list. • The Primary Index, the Secondary index and the Primary Key File are all needed to access the data file. But don’t need them all in RAM at once. Only needed data is buffered into RAM.

Data file for secondary index & primary key file. Note that the data is not sorted. Data always added at end of file for efficiency. First record is RRN 0, and so on. RRN is not part of the file. Look at constructing a secondary index based on city...... Data data City RRN Name

Secondary Index based on City & primary key file. Primary Key File Secondary index

Finding Records. How do we retrieve all the records where the supplier is in Paris? 1. We do a binary search of the secondary index to find the SK “Paris” 2. We retrieve the corresponding first RRN, 1. 3. We seek to RRN 1 in the primary key file and get the PK “Jones” – the first supplier in Paris. 4. We follow the linked list using Next RRN to pick up Black (6) and Liu (7) – the other suppliers in Paris. 5. We now have 3 PKs – Jones, Black and Liu. 6. The -1 indicates the end of the linked list. 7. With each of the PKs we do a binary search of the Primary Index, using PK, and pick up the corresponding byte offset or RRN to find data in the data file. 8. Finally we retrieve the records from the data file on disk using RRN or offset.

B-Trees. • A B-tree is simply a large index. It will never fit in RAM – only parts of the B-tree will fit – and the aim is to reduce seeks while it is being used. A seek involves the physical movement of a read/write head on the device and thus is very slow in comparison to RAM. • The basic unit of the B-tree is the node – conceptually an ordered sequence of keys, references (RRN or byte offset) and pointers. • For example: an order 7 B-tree node has 6 keys, 6 corresponding references and 7 pointers. The order is the maximum number of pointers that the node can have. Note: references left out for clarity. Each pointer points to another node in the B-tree. Very efficient. For example, take K. The pointer on the left points to the node containing keys that are greater than F but less than K. A ↓ U Z O ↓ F ↓ K ↓ ↓ ↓ ↓

Example of a complete (small) B-tree: order 4 N ← Root node 7 W K Q ←Node 6 D S H J L A C I M B G R E U V Z F T Y O P X ↑ Leaf nodes – no pointers ↑ Node 9 Node numbers come from how the tree is built – left out to simplify diagram.

Finding a Record. For example, find the record with key “Z” • Load root (node 7) into RAM. N is less than Z so follow right pointer to node 6. • Root node is always loaded into RAM first. May even keep it there while we are using the tree if we have space – improves efficiency. • Load node 6 into RAM. Do binary search of node and follow pointer to right of W, to node 9, since Z is greater than W. • Load node 9 into Ram and do binary search to find Z. Follow reference (pointer to disk) to find “Z” records. • Load Z block into RAM. Note that we have a “modified” binary search. B-trees very powerful when node size is large. For example, with node size 512 can access > 134 million records with a maximum of 3 seeks.

When to use a B-tree. • When you need very fast indexed access to records. • Objective is to keep the B-tree as shallow (lowest possible number of levels) as possible. Achieved by increasing node size – which is fixed for a particular tree. For example, 512 (or more) pointers in node. • Limited only by what size node can fit in RAM and time needed to do binary search of node in RAM. • Note that nodes need not necessarily (and probably won’t) be full – depends on the order of the addition of records. But MUST – as a property – be at least half full. • The root (the top of the tree) can have as little as two pointers. The leaves have no pointers. • You must have a primary key and a reference (RRN or byte offset). Pointer will be to a cylinder, tract and sector number (on a hard disk) – pointing to where the next node is located.

Advantages: • Can find the information to get the record at any level in the tree. (Not true of B+-Trees) • Can, if needed, be used for sequential access to data if you do an in-order traversal of the tree and the date is stored in the tree. (B+-Tree is much more efficient for this task). • Tree is always balanced – all leaves are always on the same level because tree is built from leaf nodes to root – not the other way around.

B+-Trees: Indexed sequential access • Many applications need both indexed access (for example, through a B-tree) and sequential (in order) access. • Example : student records. Indexed: print transcript for an individual student. Sequential: update grades for all students registered in 60-415-01 • Therefore, we need file structure which allows both (a) random (indexed) access to a single record and (b) sequential access to all records by the primary key. • Solution: B+-Trees • Problem: keep the records in physical order by key. • Do we sort the file every time we get a new record? No! Too expensive. Solution: keep records in sorted blocks connected by a linked list so that the blocks are logically kept in sorted order. Note that the blocks can be anywhere on the disk but ideally close together.

The sequence set. Each block contains records and is sorted by PK. Must be able to fit at least two blocks in RAM together to merge blocks and move records. Advantage: never have to keep the file sorted - just the blocks

Disadvantages of sequence set: • Blocks may not be full – get internal fragmentation. Space wasted in the file. However, must be at least half full. • Must maintain linked list as records are inserted and deleted – may cause addition or deletion of blocks. Records are moved to keep blocks half full. • Records are not stored in physical order so more seeks may be necessary to print in sorted order. What’s a good block size? • Requires no more than one seek. • Must be able to fit two blocks in RAM ( + code) so that blocks can be merged or split – caused by deletion or insertion of a record into a block.

How do we access the blocks? • We place a B-tree (the index set) on top of the blocks = B+-Tree. • The purpose of the B+-Tree is to locate a block of records which is then loaded into RAM and (a) searched for the required record or (b) processed record by record in order. Do (b) by following linked list of blocks. • The most common type of B+-Tree is the simple prefix B+-Tree but only used when keys (separators) can be shortened. • We don’t use all keys in the B-Tree part. We use strings called separators to distinguish between one block (of records) and another block. Use “separators” - not all keys. • We use the shortest possible string as the separator. This is what makes it “simple prefix”. • Height balanced also – same as a B-Tree. All leaves at same level. • Also want to keep the B+-Tree as shallow as possible – easier as we use separators rather than all keys as in a B-Tree.

Example of a simple prefix B+-Tree: ↑ Sequence set Remember that the sequence set is in logical but not physical order.

Properties of and when to use a B+-Tree. • Use when indexed and sequential access is needed to the records. • Always need to go to the leaf level to retrieve a block of records. Not true of B-trees. • Separators rather than keys used – more efficient tree. • Usually shallower than a B-Tree. • Use simple prefix B+-Tree when keys will compress and space is a problem. The cost is more complex structure and code. However, sometimes we need really fast access to date. An example is reading price code labels in a supermarket checkout. Any type of B-tree would be too slow. Which brings us to ….

Hashing on a disk ….. • The best method for really fast access to the records stored on a SSD is hashing. Note that hashing on SSDs is done somewhat differently than hashing done in RAM as the objective is to minimize disk accesses. Advantages: • Direct access to the record as no index is used. • Save space since we have no index (simple, B-tree or B+-Tree). • Fast inserts and deletes to data file (file of records). • Average of less than 2 seeks to retrieve any record. Disadvantages: • Can’t use with variable length records. • Very difficult to sort the data file. • Secondary keys are not possible with a simple hashed file.

What is hashing? • A hash function, h(key), transforms the key into a home address, which is an address on a secondary storage device, for example on a hard disk. • The addresses produced are “random”. That is, every address is equally likely to occur using the hash function. • Two or more keys may hash to the same home address. This is called a collision and the keys are called synonyms. We must have methods for dealing with this problem.

Hashing – a simplified diagram. ↑ Hash file

Building the file & retrieving a record. • Assuming no collision resolution at this stage. • Set aside a number of addresses – always greater than the number of records (usually approx twice as many – optimal reduction of collisions). • Apply the hash function to all keys – place records in addresses in data file on hard disk. Retrieving a record from the file: • Apply hash function to the supplied key (from query) and to get corresponding address. • Seek to address and move record into RAM. • If you come to the end of the file then you simply start at the beginning of the file and keep searching until you reach the point where you started. If that happens then the record is not in the file!

Collisions happen! • Collisions must be resolved since (for the moment) we can’t have 2 records at the same address. • Ideally: find the perfect hash function which never produces collisions – impossible! • Solution: Develop algorithms – called collision resolution methods, which will minimize collisions. Methods include: • Chose a hash function which will distribute the records at least randomly. In this case, every address is just as likely to be produced. • Use extra addresses so that collisions are less likely. However, the cost is space which will never be used – but get less seeks. • Use progressive overflow and chained progressive overflow. • Put more than one record at an address – the address is then called a bucket.

Progressive overflow. To place a record: Apply hash function to PK to produce an address. If address is already in use (busy) then continue searching down in the file to find an empty slot to place the record corresponding to the PK. To find a record: 1. Apply the hash function to get the home address. 2. Perform sequential search from home address until record is found. 3.What if you come to the end of the file? Wrap around to the first address in the file and continue search.

Progressive overflow…. continued How do you know if a record is not in the file? Stop searching when one of the following happens • You return to the home address. • You find an empty space - record would have been stored there if it was in the file. (Another good reason for a low PD). Advantage of progressive overflow: very simple to implement. Disadvantage of progressive overflow: • Very slow because of sequential search to find empty slot – slow means expensive! • Can cause clusters of “overflow” records and thereby increase the number of seeks. Happens because you always use the next available empty slot with this method. It does not spread out the records.

Chained progressive overflow - improvement. It is a variation where we use a linked list of synomyns to reduce the number of seeks . Next RRN Record Home PK RRN

Advantages & Disadvantages. Advantage: reduced number of seeks. Look at finding Liu! Disadvantages: • Still get clustering of records. • Linked list to maintain – makes inserts more complicated. • Can’t always get into the right linked list (of synonyms) by starting at the home address of a record. Problem occurs when there is another record at the home address already as a result of chained progressive overflow. Solution is to do a sequential search to get the right record and thus into the right linked list. For example, next slide * Solution: have a primary data area (home addresses only) and a separate overflow data area where synonyms are placed and linked by pointers. * new

Chained progressive overflow - problem . Won’tbe able to find Wang record by starting at home address 2 ! Huge problem. Next RRN Record Home PK RRN

Primary Data Area & Overflow Area Primary Data Area Overflow

Primary data area & overflow data area. Advantage: can always find a record by starting at its home address – will always get into correct linked list. Disadvantage: Now have two files to maintain – more overheads and more complicated code. Also have a linked list to maintain in the overflow area but it is smaller than in chained progressive overflow.

Other collision resolution methods. Buckets: Store more than one record at each address. A bucket is usually one or two sectors or a block on the disk. Don’t make too big as there is a trade off between bucket size and the time required to buffer bucket into RAM. • The hash function now produces a home bucket address. • Still get some collisions – but far fewer. Can use progressive overflow to deal with the collisions but clusters of buckets are rare. Double Hashing: if a collision occurs then another hash function is applied to the key to give a number X. X is then added to the home address to give the actual address (if this address is occupied the X is added again). • Advantage: spreads out records making collisions (and clusters) less likely and reduces the average number of seeks needed to find a record. • Disadvantage: removes locality – record may be placed on a different cylinder which will cause an extra seek. So try to keep synonyms on the same cylinder.

How do we handle deletions of records from the file? Two issues to consider: • Want to reuse the slot (space on disk). • Don’t want deletions to interfere with the search for a record in the file. (remember we stop searching in progressive, and chained progressive, overflow if we find an empty slot). Solution: insert a special marker (called a tombstone) when we delete to indicate that a record was there but has been deleted. However, we do not put in a tombstone if the slot after it is empty as this would increase the length of the search for a (non-existent) record.

File degradation. Problem: performance deteriorates over time as records are added and deleted. Specifically, tombstones could be occupied by overflow records – which make search lengths longer than they need to be. Solutions: • Reorganize (move records around) after a delete – expensive and complicated code. • Use a different collision resolution method. However, can still run into problems after time has elapsed. • Rehash the file when the average search length (average number of seeks to find a record) becomes unacceptable. Best solution !

File Structures Course: 03-60-415