CT213 – File Management

CT213 – File Management Petronel Bigioi

Content • File Management Overview • File Management Functional Requirements • File Management Architecture • File Organization – pile, sequential, indexed sequential, indexed and hashed • File Directory • Content and structure • File Sharing • Access rights and simultaneous access • Record blocking • Secondary Storage management • File Allocation and Free Space Management

File Management Overview • Filing systems provide • A storage service – clients don’t know about the physical characteristics of the disks or where the files have been stored on them • The filing systems must make sure that a file is not lost, even if there are hardware failures or software crashes • A directory service – clients can give convenient text names to files and group them in directories (establish some relationship between them) • Clients should be able to access the sharing of their files with other by specifying who can access a given file and in what way

File Management Overview • (1) – client calls an operation such as open-file with the text name as an argument • Another argument could be a field specifying the type of access (read-only or read-write) • The directory service will carry out an access check to ensure that the client is authorized to access the file • The directory service will translate the text name into a form that enables the file storage service to locate the file on disk (name resolution) • In order to do the name resolution, the directory service may call the file storage service, the file storage service may call the disk handler to access the disk and find the required information • (2) The filing systems stores a very large numbers of files, only a few being used at any time. • The filing system is ready for the client to use the file (It will have set up the information about this file in its tables in main memory) • It will return a user-file-identifier (UFID) for the client to use in subsequent requests to read or write the file • (3) Request to read the file • (4) The storage service returns the portion of the file that was requested at (3)

File Management Overview • Concurrency control must be provided • One approach is to allow multiple clients read control, but only one client the right to write • An operating system will deal with such approach by noting if a file system has been opened for reading access, in which case it will allow multiple clients or for writing, in which case it would refuse any subsequent requests to read or write from other clients; this is called mandatory concurrency control. A file is said to be locked for reading or writing. • Many applications may need to have write access to different parts of the same file (the above approach is inflexible in this case) • Many operating systems allow simultaneous write access to files, being the application job to make sure it will sync the access • Eventually the OS will provide some extra locking service to help the clients to cooperate with each other.

Files Common Terms • Field – basic element of data, containing a single value (i.e. an employees last name) • Characterized by its length and data-type (i.e. ASCII string, decimal, etc..) • Depending on file design, it may be fixed or variable length • Record – a collection of related fields that can be treated as single unit • The record of an employee can contain fields such as name, social security number, etc.. • Up to the design of the file system it may be fixed or variable in length

Files Common terms • File – is a collection of similar records • The file is treated as a single entity by users and applications and may be referenced by name • Files have unique file names and may be created and deleted • Access control restrictions usually apply at the file level • Database – a collection of related data • The essential aspect is the relations that exists between elements of data are explicit • The data base itself consists of one or more types of files • Usually there is a separate database management system that is independent of the file system management and the operating system

File Management Requirements • To meet the data management needs and requirements from the user • Create, delete, read and change files; controlled access to other users files; control the type of access to own files; restructure the files; move data between files; backup and recovery files in case of damage; file access by using symbolic file names • To guarantee that the data in the file is valid • To optimize performance • To provide I/O support for a wide range of storage device types • To provide a standardized set of I/O interface routines • To provide I/O support for multiple users, in the case of multiple-user systems

File System Architecture • The Basic I/O Supervisor is responsible for all file I/O initiation and termination • Maintains control structures that deal with device I/O scheduling and file status • Is concerned with the selection of the device on which file I/O is to be performed, on the basis of which file has been selected • Is also concerned with scheduling disk and tape access to optimize performance • I/O buffers and secondary memory is allocated at this level • Logical I/O enables users and applications to access records. Thus, whereas the basic file system deals with blocks of data, the logical I/O module deals with file records. • Logical I/O provides a general purpose record I/O capability and maintains basic data about files • Access method is the closest level of the file system to the user • Provides a standard interface between applications and file system • Different access methods: pile, sequential, indeed sequential, indexed, hashed • At the lowest level we have device drivers, that do communicate directly with the peripheral device or their controllers or channels • A device driver is responsible for staring I/O operations on a device and processing the completion of an I/O request • Device drivers are considered to be part of the operating system, and in the case of file system, the devices controlled are disks and tapes • Basic File System deals with blocks of data that are exchanged with the disk or tape systems. • It is concerned with the placement of those blocks on the secondary storage device and on the buffering of those blocks in main memory • It doesn’t understand the content of the data structure of the files involved

File Organization • Criteria: • Rapid access • Ease of update • Economy of storage • Simple maintenance • Reliability • These criteria may vary in importance • CD-ROM – Ease of update irrelevant • Indexes – Faster but use more storage • We will outline five common organizations (the actual number that have been implemented or proposed is unimaginably large) • Pile • Sequential file • Indexed sequential file • Indexed file • Direct (Hashed) file

File Organization • Pile • Add data to the file as it arrives (chronological order) • Record size and field order may vary (variable length records, variable set of fields) • Requires use of exhaustive search • Sequential File • Fixed length record format • Size and order of fields fixed • Key field - unique record ID • Records stored in order based on key • Handles random requests poorly • Must use sequential search (batch system) • Hard to insert new records

File Organization • Indexed Sequential File • Still maintains the organization of records in sequence, based on a key • Adds an index to the file to speed lookup • Index provides a lookup capability to reach quickly the vicinity of a desired record • The index file contains two records: the key and a pointer to the main file • May have multiple levels of indexes • Overflow area to handle new records • Each record in the main file contains a hidden pointer to the overflow file (used if needed) • Link from main records to overflow, and back • Operations: • Search: To find a specific field, a search begins in the index file. The highest key value that is less than or equal to the desired key record is looked up in the index file. A pointer to the main file is retrieved and the search continues in the main file. • Additions: Each record in the main file contains an additional field (not visible to the application) that is a pointer to the overflow file; When a new record is to be inserted, it is actually added to the overflow file. The record in the main file, that immediately precedes the new record in logical sequence is updated to contain a pointer to the new record (in the overflow file).

File Organization • Indexed File • Useful when is necessary to search for a record on the basis of some other attribute than the key field • May have multiple indexes • One for each field we may search • Records accessed only through the indexes • Each index may be: • Exhaustive – contains one entry for every record in the main file • Partial – contains entries only for records where the field of interest exists • Used in applications where time is critical and where data is rarely processed exhaustively (such as reservation systems, inventory controls, etc…) • Direct (Hashed) file • Use hashing on a key to find the record • No notion of sequential access • Generally used when rapid access to one record is required (directory)

Hashing • It can find most of the items with a single seek • Insertions and deletions can be handled without added complexity • Assuming that a number of N items are to be inserted into a hash table of length M, with M>=N • Insert an item into the hash table: • Convert the label of the item to near random number n (between 0 and M-1) (i.e. if the label is numeric, then a popular mapping function is to divide the label by M and take the reminder as the value of n • Use n as index into the hash table • If the entry is empty, then store the item • If the entry is occupied, then store the item according to the hashing criteria (linear or overflow with chaining) • Table lookup of an item whose label is known • Convert the label of the item to a near random number n (using same mapping function as for insertion) • Use n as index into the hash table • If the corresponding entry is empty, then the item hasn’t been inserted • If the corresponding entry is occupied, and the labels match, then retrieve the value • If the corresponding entry is occupied and the labels are not matching, then continue the search according to the hashing criteria ( linear or overflow with chaining)

Linear Hashing Labels of the items to be stored are numeric and the hash table has eight positions (M=8). The hashing function takes the reminder upon division by 8 • In the linear hashing schema, if the entry is already occupied, set n=(n+1)(mod M) and try again. Perform this step until we will find an empty entry • The figure assumes that the entries have been inserted in the ascending order • Item 50 and 51 maps in positions 2 and 3 • Item 74 maps in position 2, position 2 is taken, so we try position 3 (taken). Next is position 4 we need to try and is empty, so we write it on position 4 • The average search is not depending on the table size, is dependent of how full the table is (at 80%, we are getting an average for the search around 3)

Hashing using Overflow with Chaining • A separate table in which overflow entries are inserted is kept. This table includes pointers passing down the chain of entries associated with any positions in the hash table. • For large values of N and M, for N=M, the average search is around 1.5 • This method provides for compact storage with fast lookup.

File Directory • A file directory is a structure associated with any file management system and collection of files • It contains information about the files, including attributes, location and ownership. Most of this information (especially the one concerned with the storage is handled by the operating system) • The directory itself is a file, owned by the operating system and accessible by various file management routines. • Some of the information in directories is available to users through system routines • The users cannot directly access the directory even in read-only mode

Typical Directory Entries • Basic • Name – Unique in directory (some systems permit file versions) • Type – Text, binary, load module, etc. • Organization – Sequential, indexed, etc. • Address • Device – Which disk holds the file • Often this must be the same device as the directory is on • Starting address/Blocks used • Block #, cylinder #, or other location id • Size used – Current file size • May be in bytes or blocks • Size allocated – Maximum space allocated for this file • Not used on all file systems

Typical Directory Entries • Access Control • Owner – Who has control of the file • Access Information – What users are allowed to work with the file • Permitted Actions – Controls reading, writing, etc. • Usage Information • Date Created • Identity of Creator • Date Last Read Access • Identity of Last Reader • Date Last Modified • Identity of Last Modifier • Date of Last Backup • Current Usage – Who has the file open, is the file locked, are there updates waiting in main memory?

Directory Structure • Operations to support: • Search for the file entry (open) • Create a new file • Delete a file • List the files in the directory • May be for all or part of the directory • May also include attribute information • Simplest form • A list of directory entries, one for each file (CP/M, DOS 1.0) • Difficult to handle large numbers of files or multiple users • The directory would be very large and held on the disk (looking up a given filename in it would take the directory service a long time) • Different users might use the same text names for their files. Unique text names would be achieved by appending the username to each filename. • Some support for organizing the information is desirable. Convenient grouping within user’s files should be supported for easy location and access control • More complex form • One directory for each user • Easier to manage access information • Users still can’t structure files

Directory Structure • Tree-structured file system • Single master (root) directory • DOS: Master directory for each drive • Each directory may contain files and other subdirectories • Names only unique in directory • Each directory often stored as a sequential file • Less effective when there are a large number of files in a given directory

Directory Tree Structure

Directory Tree Example • Path - following set of directories from master directory to file • Example: /UserB/Word/UnitA/ABC • “/” often used to separate directories • Working directory • Current directory for files: /UserB/Word • Files in this directory unless path given

File Sharing • Rights that may be granted: • None – Others don’t know it exists • Often done by preventing user from reading the parent directory (Unix) • May have an explicit permission bit for access to the file name (Novell) • Knowledge – Know it is there and who the owner is • Execution – Able to run a program • Read – Look at/copy contents • Execute and Read may be independent • Append – Add data to the file • Cannot modify existing contents • Update – Modify/delete/add data • Change Protection – Grant rights to file • Owner can specify what other users have rights to this file • Deletion – Can delete file

File Sharing • Who to grant rights to: • Specific user • May allow different users to have distinct permissions • Group of users • World (public files) • Simultaneous Access • Multiple users may want to access or modify the same file • Example: Airline reservation database • Locking: Entire file vs. Records • Easier to lock entire file • Locking records allows more concurrency • Instance of reader/writer problem • Must address mutual exclusion and deadlock

Record Blocking • A record is the logical access unit of a file • Blocks are unit of I/O with secondary storage. For I/O to be performed, records must be organized as blocks. • Issues to consider: • Should be blocks be fixed or variable length • On most systems blocks are fixed length • Simplifies I/O, buffer allocation in main memory and organization of blocks on secondary storage • What should the relative size of a block be compared to the average record size • The larger the block the more records that are passed in one I/O operation • If a file is being processed or searched sequentially, than this is an advantage • If records are being accessed randomly, it will result in unnecessary transfer of unused records, than this is a disadvantage • Three methods of blocking • Fixed blocking, variable-length spanned blocking and variable-length un-spanned blocking

Fixed Blocking • Fixed length records are used and an integral number of records are stored in a block • There may be unused space at the and of each block (internal fragmentation)

Variable Length Spanned Blocking • Variable length records are used and are packed into blocks with no unused space • Two records may span across two blocks with the continuation indicated by a pointer to the successor block • Wastes space only at the end of the file

Variable Length Unspanned Blocking • Variable length records are used, but spanning is not employed. • There is wasted space inmost blocks because of the inability to use the remainder of a block if the next record is larger than the remaining unused space

Secondary Storage Management • On secondary storage a file is a collection of blocks; the operating system or file management system is responsible for allocating blocks to files • Two management issues • Space on secondary storage must be allocated to files • Keep track of the space available for allocation • The approach taken for file allocation may influence the approach taken for available space management

File Allocation • Issues in file allocation • When a new file is created, do we specify the maximum size? Is that space allocated at once? • Space is allocated to a file as one or more contiguous units (portions). How big of a unit should we use when allocating space for a file? • How do we keep track of what space has been allocated to a given file (what kind of structure or table is used to keep track for a unit allocated to a file)? • Pre-Allocation • Declare max size in advance • May be hard to guess space needed • Tendency to overestimate space needed • Ok if the file will never change • Dynamic allocation • Get space as the file needs it • Files are often no longer contiguous

File Allocation • Portion (unit) size • At one extreme, a single unit large enough to hold the entire file, while at the other extreme space on disk is allocated one block at a time. • In choosing the unit allocation size, there is a tradeoff between efficiency from the point of view of a single file versus overall system efficiency • Few items to be considered: • Having lots of small units requires more space for allocation tables • Fixed-size portions simplifies the reallocation of space • Variable-sized units or small fixed-size units reduces wasted space

File Allocation • Two common alternatives: • Variable-sized large contiguous portions • Minimizes waste, allocation overhead • Have to deal with fragmentation • First-Fit – choose the first unused contiguous group of blocks of sufficient size from a free block list. • Best-Fit – choose the smallest unused group that is of sufficient size • Nearest-Fit allocation – choose the unused group of sufficient size that is closest to the previous allocation for the file to increase locality • Blocks – Small fixed-size portions • May require large tables or complex structures for their allocation • Abandons contiguity • Allocate blocks as needed • Either strategy is compatible with pre-allocation and dynamic allocation. Not clear which strategy is best.

File Allocation Methods • Three methods are in common use • Contiguous allocation • Single contiguous set of blocks is allocated to a file at the time of file creation • Chained allocation • Each block contains a pointer to the next block in the chain • Indexed allocation • The file allocation table contains a separate one level index for each file; the index has one entry for each allocated portion (unit) to the file.

Contiguous allocation • A single contiguous set of blocks assigned to a file when it is created • Pre-allocation strategy with variable-sized portions (units) • Good performance (especially for sequential files) • External fragmentation tends to occur • Use compaction to combine free space • Need to specify the size of the file at the time of creation • Used by CD-ROMs (ISO 9660) • Before compaction (left) • After compaction (right)

Chained Allocation • Allocate on the basis of individual blocks • Directory only links to the first block • Each block points to the next block • Easy to add blocks to a file • No external fragmentation • MSDOS FAT12/16/32 is a variation • Best suited to sequential files • File B, with start=1and len=5 • No accommodation for locality • If necessary to brig in several blocks of a file at a time, then series of access to different parts of the disk is necessary • To overcome this problem, files are “consolidated” by some systems

Indexed Allocation • File allocation table contains one level index for each file; the index has just one entry for portion (unit) allocated to the file • Typically the file indexes are not stored as part of the allocation table • The file index for a file is kept in a separate block and the entry for the file allocation table points to that block • Supports both sequential and random access to a file • In the figure above, a fixed size blocks allocation schema is presented • Eliminates external fragmentation

Indexed Allocation • Indexed allocation supports also a variable size portions (units) allocation schema • Improves locality • It is the most popular form of file allocation • In both cases, from time to time consolidation may be done • It reduces the size of the index for the variable sized portions schema

Free Space Management • Same as managing the space allocated to files, the free space that is not currently allocated needs to be managed • In order to be able to perform file allocation, we need to know what blocks on the disk are available, therefore we need to keep a disk allocation table, in addition to file allocation table • A number to methods to record free space: • Bit Tables – Free bit for each block • Works well with any of the presented allocation methods • It is as small as possible • Still, it can be large. The amount (in bits) required for a block bitmap is as follows: [disk size (Bytes)/ (8 *file system block size(BYTES))]. For a 16GBytes disk, we would get a 4MBytes large table … it is large to hold in memory and also large to search • To speed up the search in the bit tables, the OS may divide disk into sections • Additional data structures must be kept to summarize the status of each section (i.e. the number of free blocks and the maximum sized contiguous number of free blocks)

Bit Table Example • Example bit vector for the above figure • 00111000011111000011111111111011000

Free Space Management • A number to methods to record free space: • Chained Free List • Free portions may be chained together by using a pointer and length value in each free portion • It can produce fragmentation of the disk and many portions (units) will be a single block long. In this situation, every time a block gets allocated, it needs to be first read, to find out the pointer to the next free block. If this is done for a file creation (for multiple blocks) it can slow down the operation. Similarly, deleting highly fragmented files, it is time consuming • Indexing • Free space is treated as a “file”, store list of blocks in the same manner as ordinary files • For efficiency, the index should be on the basis of variable-size portions rather than blocks • One entry in the file for each free portion on the disk • Provides efficient support for all known file allocation methods • Free Block List • Each block is assigned a number sequentially and the list of the numbers of all free blocks is maintained in a reserved portion of the disk • Can treat it as a stack and re-allocate recently freed blocks – only the last few blocks need to be kept in memory • Can use FIFO structure (a block is allocated from the head of the FIFO, and de-allocated by adding it to the tail of the queue) • May have a background process that works to facilitate contiguous allocation

Free Space Management • Where to store allocation tables • Memory • Table size may be a problem • Information may be lost if it crashes • Disk • Requires extra read/write to allocate a block - slows system down dramatically • Handling system crashes • Lock the allocation table on disk, do the allocation, update the disk • Will make the system very slow • May choose to pre-allocate a batch of blocks, then allocate to files on demand • Mark it “in use” on disk • Clean it up when a crash occurs

References • “Operating Systems”, William Stallings, ISBN 0-13-032986-X • “Operating Systems”, Jean Bacon and Tim Harris, ISBN 0-321-11789-1

CT213 – File Management