File Structures by Folk, Zoellick, and Ricarrdi Chapter 7Indexing
Chapter Objectives • Index files. • Operations Required to Maintain an Index File. • Primary keys. • Secondary keys.
Index • a tool for finding records in a file • consists of: • key field • field on which the index is searched • reference (address or RRN) field • tells where to find the data file record associated with a particular key.
Examples of an Index • book index • usually at the end of the book • arranged alphabetically by topic • The index in a library (an on-line catalog) allows you to locate items by an author, by a title, or by a call number. • photo thumbnails • usually represents a link to the actual photo • much smaller file, can be loaded quickly • actual photo takes much longer to load • if index was actual photos, would take long to load
Example: Index in Databases • University uses an index file to keep track of its courses. • The data file consists of the following fields in each record: • Department • Title • Professor • Student List • Room & Time
Example: Primary key • Primary key could be • Department • not specific enough • Course Number • not unique • Professor • not unique • Room & Time • possible • classes aren’t identified this way • Department + Course Number • Obvious?
Index file • used to provide rapid access to individual records in the data file via the keys • Example index file consists of the following fields: • key (e.g. CIS402) • reference (address) =address of the corresponding record in the data file
Operations on an Indexed File • Create (when data file is created) • Load into memory (whole file, if possible and prudent) • Write updated file to permanent storage • Record(s) added to data file • Record(s) deleted from data file • Update record(s) in data file • Searches
Creating Files of Data Load Index • Create files • index file • data record file • via • buffer I/O • an array. Writing Back Index File • Can be part of the close operation for the index file • close function in index object can write the buffer/array to the disk before closing file
Record addition • Adding a new data record to the data file requires adding a new record to the index file • If the index file is sorted: • adding a new record may require rearranging the records in this file. • depends upon index file representation in memory • if sort necessary, easily done if the indices are in main memory
Record deletion • Deletion of a data record requires deletion of the corresponding index record. • Can space in data file be reclaimed? • Difficult, as with index file organization all data records are pinned • a pinned data record is one that has a reference to its address in an index file • Other consequences • Resorting difficulty • Solution: Sort the file via the indices
Record Updating • Two categories of updates: • modification of key value • re-ordering of the index file might be required • two possible situations • modifying key reorders file • see below • modification of non-key value • might still require reordering of records in the data file. (WHY?) • size of data record might increase, requiring moving it to space that can hold it • must reset index for that record
Indexes too large for Memory • kept on the secondary storage • disadvantages • time consumption • searching the index file • requires disk accesses instead of just memory accesses • rearranging indexes • requires disk accesses
Solutions to Index Files in 2ndary storage • If the index file is too large to be kept in main memory than the following alternative organizations should be considered: • a hashed organization (if access speed is very important) • probably won’t get to this • a tree structured organization, or a multilevel index such as a B-tree • get ready
Pros of a simple index file • allows for use of binary search • sorting and maintaining an index is much easier than for a data file • true if index entries are much smaller than data records, • if data records are pinned, can rearrange keys without moving data records • apply them to multiple simple indexes...
Indexing with Multiple Key Access • unique primary key often used as a search keyword. • Example primary key • CIS 402 • What if you’d like to include the prof in the search? • Two keys: Course & Prof • Could also be: • Course & Time • Location & Time (?)
Secondary key • A secondary key is a key for which multiple records may exist in the data file. • Example: • Sorting an Excel sheet using two fields (e.g. name & section) • A professor teaches more than one class
Secondary Index File • create for the possible secondary indexes. • secondary keys can be shared • primary keys were unique • Example: • Professor Spiegel 2ndary keys: • Primary keys containing this prof: • CIS 136 • CIS 235 • CIS 402 • Can access those courses via the secondary key • What if course has multiple sections?
Record Addition • Adding a record to the data file likely requires adding a record to the secondary index file. • Costs are similar to the cost of adding a record in the primary index file. • records might have to be shifted • indexes may have to be rearranged
Record Deletion • must remove all references to that record in the file system. • search for primary key in primary index file • remove index • search in secondary index file • for the primary key of the record to be deleted • remove index from the secondary index file. • what if secondary keys are maintained? • secondary key refers to primary key • primary key will have been deleted, and will not exist • if we consider this possiblility, don’t have to delete secondary key • pitfalls?
Record Updating • There are three possible situations: • secondary key altered • may have to rearrange the secondary key index so it stays in sorted order • primary key altered • big impact on the primary key index • in the secondary key index only need to update the affected primary key field • confined to non-key fields • all updates that do not affect either the primary or secondary key fields do not affect the secondary key index, even if the update is substantial. • recall, can affect primary index, since that refers to location in data file
Retrieving Data with Multiple Secondary Keys • Example: • All courses taught by Spiegel or Gordon • Requires two searches • searches produce a list of courses by providing primary keys. • Spiegel: CIS136, CIS235, CIS402 • Gordon: CIS425, CIS520, CIS243
Boolean AND in searches • Example: • Search for courses: • taught by Spiegel • located in Lytle Hall • Courses found are in intersection of • courses taught by Spiegel • courses offered in Lytle Hall
Boolean OR searches • Example: • Search for courses: • taught by Spiegel • located in Lytle Hall • Courses found are in union of • courses taught by Spiegel • courses offered in Lytle Hall
Cons of the Current Secondary Index Structure • index file has to be rearranged every time a new record is added to the file. • for duplicate secondary keys, secondary key field is repeated for each entry.
Improvements to the secondary index key structure • Solution 1 • Allow for multiple primary keys to be associated with a single secondary key by allocating a primary key list (STL vector is best; why?) for each secondary key entry. • Solves the problem of sorting each time when an new entry is added. • According to text: Suffers from internal fragmentation due to fixed nature of list, and the number of allocated entries in the array may prove too small. • STL (or Java) vector fixes this: How?
Improvements to the secondary index key structure • Solution 2 • Create an inverted list of indexes. Have each secondary key point to a list of primary key references associated with it. • This method eliminates most of the problems associated with maintaining a secondary index file. • Which solution is better?
Selective Index • contains keys for only a portion of the records in the data file. • fitting certain criteria • e.g. • Courses offered after 12 noon • states with a ‘G’ in their name
Binding • takes place when a key is associated with a particular physical record in the data file. • takes place: • during the preparation of the data file and indexes • allows faster access • later on during program execution • secondary keys can be bound to primaries, or to actual addresses • advantage: quicker access • disadvantage: if record edited/deleted