Objectives: To get familiar with: Data compression Storage management

Chapter 6 Organizing File for Performance Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search

Outline • Data compression • Reclaiming space in files • Record deletion • Dynamic space reclaiming for fixed-length record • Dynamic space reclaiming for variable-length record • Storage fragmentation • Internal sorting and binary search • Keysorting

Data Compression • Data compression: to organize files into smaller size. • Use less storage, • Can be transmitted faster, • Can be processed faster sequentially. • Encoding with a different notation • The “State” field in the address file requires two bytes. However, 50 states can be encoded using 6 bits. 50% space saving for each occurrence of the state field. • The compact notationis a redundancy reduction technique. • Costs: • The file is not readable by humans. • The overhead of encoding and decoding operations.

Data Compression (cont’d) • Suppressing repeating sequences • Suitable for sparse arrays or images with regions of same colors. • Run-length encoding: choose an unused byte value to indicate that a run-length code following that byte. • Encoding algorithm: • Read through the data (pixels or values) that make up the image or data content, copying the data values to the file in sequence, except where the same data value occurs more the once in the succession, • Where the same value occurs more than once in succession, substitute the following three entries: • The special run-length code indicator, • The data value that is repeated, and • The number of times that the value is repeated. • Example, 50 51 52 52 52 52 52 53 54 54 54 54 54 54 54 55 52 52 53 53 53 54 The encoded sequence is: 50 51 ff 52 05 53 ff 54 07 55 ff 52 02 ff 53 03 54

Data Compression (cont’d) • Variable length encoding • Letters with high frequency are encoded using shorter symbols. • Letters with low frequency are encoded using longer symbols. • Huffman code (for a set of seven letters): • four bits per letter (minimum 3 bits). • The string “abefd” is encoded as “1010000100100000”. • Huffman codes are used in some UNIX systems for data compression. • Irreversible compression techniques • Voice coding • Some image coding scheme that change pixel granularity or reduce color quality

Reclaiming Space in Files • File organization with the following operations: • record insertion • record deletion • record modification • Space reclaiming is needed when • deleting fixed-length and variable-length records • modifying variable-length records • can be treated as a deletion followed by an insertion

Head pointer pointer pointer ... deleted record deleted record deleted record pointer -1 Dynamic Space Reclaiming -- Fixed-Length Records • An naive approach: When inserting a new record, • searching the file record by record; • if a deleted record is found, insert the new record in the place of the deleted record; • otherwise, insert the new record at the end of the file. • Issues on reclaiming space quickly: • How to know immediately if there are empty slots in the file? • How to jump to one of those slots, if they exist? • Linking all deleted records together using a linkedlist:

Head pointer 2 RRN 5 RRN 2 -1 Head pointer 5 2 RRN 3 RRN 5 RRN 2 -1 Head pointer 2 RRN 5 RRN 2 -1 Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) • Use the link list of the deleted records as a stack: • Add (push) a recently deleted record of RRN 3 to the top of the stack: • Remove a free space of RRN from the top of the stack for an inserted record:

Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) • Use the link list of the deleted records as a stack: • Add (push) a recently deleted record of RRN 3 to the top of the stack: • Insert three new records to the space of the deleted records:

Size 47 Size 38 Size 72 Size 68 -1 New Link Size 47 Size 38 Size 68 -1 Size 72 removed record: Dynamic Space Reclaiming -- Variable-Length Records (cont’d) • When inserting a new record, we need to search the available list for a deleted record with large enough record length: • The current available list: • Insert a record of 55 bytes:

Storage Fragmentation • Internal fragmentation caused by fixed-length records: Ames|John|123 Maple|Stillwater|OK|74075|................................... Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|......................... • Internal fragmentation caused by variable-length records: • The inserted records is shorter than the deleted record HEAD.FIRST_AVAILABLE:-1 40 Ames|John|123 Maple|Stillwater|OK|74075|64 Ham|Al|28 Elm| Ada|OK|70332|.....................................................|45 Brown|Martha| 625 Kimbark|Des Moines|IA|50311| • Reclaim the used part of the deleted record: HEAD.FIRST_AVAILABLE:43 40 Ames|John|123 Maple|Stillwater|OK|74075|35 *|-1.................. ..............26 Ham|Al|28 Elm|Ada|OK|70332|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|

Storage Fragmentation (cont’d) • External fragmentation caused by continuing to insert records so some space becomes too fragmented to be useful: • Insert a record of 25 bytes HEAD.FIRST_AVAILABLE:43 40 Ames|John|123 Maple|Stillwater|OK|74075|8 *|-1.....25 Lee|Ed |Rt 2|Ada|OK|7482026 Ham|Al|28 Elm|Ada|OK|70332|45 Brown |Martha|625 Kimbark|Des Moines|IA|50311| • How to handle external fragmentation: • storagecompaction: regenerate the file when external fragmentation becomes intolerable. • coalescing the holes:combine two record slots on the available list if they are physically adjacent. • placement strategy: adopt a placement strategy to minimize fragmentation.

Placement Strategies • First-fit placement strategy: search the first available space which is large enough for the inserted record. • Least amount of work when we place a newly available space on the list. • Best-fit placement strategy: search the smallest available which is large enough for the inserted record. • Order the available list in ascending order by size, then use the first-fit placement strategy. • After inserting the new record, the free area left over may be too small to be useful. May cause serious external fragmentation. • The small free slots are placed at the beginning of the available list. Make the search of the first-fit space increasingly long as time goes on. • Worst-fit placement strategy: • Order the available list in descending order by size, then use first-fit placement strategy. • Always insert the new record to the first slot. If the first slot is not large enough. The new record is inserted to the end of the file. • Decrease the chance of external fragmentation.

Binary Search • Search by guessing. • Use RRN to jump around • Searching a file of n records: • the worst case: log n+1 comparisons, • the average case: log n+1/2 comparisons. • Requirement • Works only for fixed-length records. • The records must be in order in the searching field.

Sorting a Disk File in RAM • If the records are not in order, they must be sorted before we can use binary search. • Consider any internal sorting algorithms: bubble sort, quick sort, bucket sort, etc. • If applied directly on data stored on disk, they require many disk accesses (seeking, rotational delay) and multiple passes over the list. Extremely slow • If the entire file can fit into RAM. Load the entire contents of the file into RAM and perform internal sorting. • Can access records sequentially. • Much faster if the file is stored sequentially. • This is an example of a general rule: minimizing disk access cost by forcing disk accesses into a sequential mode and performing complex, direct access in memory.

Limitations of Binary Searching and Internal Sorting • Binary searching requires more than one or two disk accesses • Accessing records by relative record number (RRN), we can retrieve a record with a single disk access. • Ideally, we can combine RRN retrieval (single access) and search by key (ease of use). • Keeping a file sorted is very expensive • If record insertion is as frequent as record search, it is expensive to keep records sorted. • Keep records unsorted and use sequential search. • An internal sort works only on small files • It is not possible to read all records of a large file into the main memory. • Only load the keys to the main memory -- keysorting.

Keysorting • Only load recordskeys into RAM. • A KEYNODES[ ] array has two fields: KEY and RRN. There is a correspondence between KEYNODES[ ] and records in the actual file. • Actual sorting process, simply sort the KEYNODES[ ] array according to the key field.

Limitation of Keysorting • The keysort method requires two reads and one write for each record. • The first pass of reads can be done sequentially, sector by sector. • The second pass of reads cannot be done sequentially. It may requires many random seeks for these reads. • Since the write operations interleave with the reads in the second pass, these writes also require separate seeks. • If only one copy of the records are kept in the disk, it is not an easy job to create a sorted version of the file from KEYNODES[ ] array. • Solution: • Not to write the sorted file back to the disk. • Only write the KEYNODES[ ] array back to the disk as the index file.

Objectives: To get familiar with: Data compression Storage management